Skip to content

TAP LMS Architecture

Live single-source-of-truth for the Summer Program (SP) pipeline as it stands on 2026-05-16, post CR-001, CR-002 v2, CR-003, CR-004, CR-005, and CR-006. Cite this doc before designing new work; supersede it via CRs and ADRs rather than editing in place. References throughout use path/to/file.py:LINE form.

1. System overview

TAP LMS is a Frappe v15 custom app (tap_lms) running on Postgres. The Summer Program is a multi-week, archetype-driven, Glific-integrated student learning journey. A cohort lives on a Batch, a per-cohort run lives on a BatchProgramRun (BPR), and a per-student row lives on a ProgramEnrollment (PE). Each PE carries a resolved_flow_state (the state-machine label) and a next_action_at (when the scheduler should next look at it). State transitions are numbered T0–T25 in summer_program/state_machine.py. Every state change calls Glific to maintain group membership and to push contact fields back so the WhatsApp flows can branch on backend state.

Integrations the SP depends on:

  • Glific — WhatsApp messaging. Two surfaces: contact fields (push state to Glific via update_contact_fields) and groups (CR-005 kind-keyed collections: main, escalation, binge_paused, program_dropped, program_completed). Glific flows call back into the backend via the whitelisted update_flow_status endpoint and the save_submission endpoint.
  • RabbitMQ — feedback pipeline broker. Submissions are published to RabbitMQ by save_submission.enqueue_submission; the FeedbackConsumer external service drains the queue, generates AI feedback, writes back to the Submission row, then calls back into the backend via feedback_consumer_hook.on_feedback_ready.
  • Vocallabs — parent-call automation. CR-003 added a parent_call escalation channel; summer_program/vocallabs.py is the integration module with its own retry+DLQ budget.

2. Tech stack and invariants

Layer Choice Notes
Web framework Frappe v15 Custom app tap_lms, module path tap_lms.summer_program.*
Database Postgres No MariaDB compat layer — MariaDB fallback branches are dead code (L-007)
Background jobs RQ workers via frappe.enqueue Queues used: short, default, long
Broker RabbitMQ (pika) Publisher-confirms + persistent delivery (delivery_mode=2) — save_submission.py:725-737
Messaging Glific GraphQL API Auth token cached in Glific Settings; mutations: updateContact, updateGroupContacts, startContactFlow, createGroup
Voice Vocallabs HTTP API Token cached in Redis under VOCALLABS_TOKEN_CACHE_KEY; default TTL 3600s

Cross-cutting invariants (project-wide rules — see .claude/memory/lessons.md):

  • L-005 (IN/array trap) — Frappe's PG param wrapper mangles WHERE col IN %s when binding a 2-tuple with an inner sequence. Use flat IN (%s, %s, ...) with scalar params, or = ANY(%s::text[]) with explicit cast. Repeats across pe_dispatcher.py:99-122, pe_dispatcher.py:172-187, save_submission.py:292-307, _record_delivery_failure at pe_dispatcher.py:721-733.
  • L-008 (Glific flat-map contract) — Glific flows read @results.webhook.<field> directly. Whitelisted endpoints write the response via frappe.local.response.update(...) (no message: wrapper, scalar values only). Contact-field names are public contract — never rename without auditing every flow. See flow_callback.py:56-89 and docs/api-standard-glific.md.
  • L-010 (journey_label idempotency) — every state-changing UPDATE includes WHERE journey_label = <expected> plus RETURNING name. A 0-row return means a parallel handler already moved the state; skip gracefully. Anchor pattern in the dispatcher's atomic claim — pe_dispatcher.py:139-152.
  • L-011 (atomic counter) — counters that race with the T19 weekly reset use SET col = COALESCE(col, 0) + %s. Read-then-write is unsafe. See activity_points.py:157-176, quiz_points.py:97-106, _record_delivery_failure at pe_dispatcher.py:724-733.
  • L-012 (per-attempt audit lock) — write the audit field (e.g. StudentQuizAttempt.points_earned) FIRST, then do rollups. The audit field is the idempotency anchor — quiz_points.py:75-78.
  • L-015 (retry+DLQ) — every external write (Glific, RabbitMQ, Vocallabs, group-write) carries P-007 retry+DLQ. State has already committed in PG; a silent failure leaves Glific stale or submissions lost.
  • L-017 (no commit in handlers) — controllers and flow handlers must not call frappe.db.commit(). Frappe commits at request-end; mid-handler commits poison transaction isolation and break test rollback.
  • L-024 (two-folder deploy)app/ is the working folder; ~/frappe-bench/apps/tap_lms/ is the bench checkout. Edits flow working → bench → git, then git pull back into the working folder. Don't re-copy after the bench has pulled.

3. Key DocTypes

3.1 Student

The contact-level identity row. Lives in tap_lms/doctype/student/.

  • Student.name1 — canonical display name (NOT student_name, which does not exist). Resolved via summer_program/utils.py:get_student_display_name.
  • Student.name — the document name and the student ID (e.g. ST00051383); there is no separate student_id column.
  • Student.glific_id — Glific contact ID; null for students never synced.
  • Student.archetype — one of dormant | fence_sitter | irregular_submitter | submitter (lowercase snake_case per constants.py:18-28). Upstream-supplied data: set by CSV import, BigQuery sync, or manual entry. The SP backend never auto-assigns, randomizes, or recomputes this field. Task #13 (splitter) was explicitly cancelled — ADR-004 audit log.
  • Student.experiment_arm — one of default | arm_a | arm_b. Also upstream-supplied.
  • Student.phone — used by Vocallabs as the parent's phone (the student uses the parent's device — CR-003 §"Phone resolution").

3.2 Batch

Per-cohort container. tap_lms/doctype/batch/batch.json:1-100.

Field Type Purpose
batch_id Data (unique) Public batch identifier used in collection labels
start_date Date Cohort start; drives check_auto_activate and week calendar
end_date Date Cohort end
program_type Select(Summer\|Regular) Owned by Batch (NOT BPR — clarification 2026)
total_weeks Int Length of the program (typically 8)
grace_window_days Int Per-cohort grace duration; per-week clock duration (CR-003)
current_calendar_week Int The "weekly Monday" cursor advanced by batch_admin.auto_advance_batch_week

3.3 BatchProgramRun (BPR)

The per-run wrapper around a Batch, holding pipeline status, validation report, Glific flow IDs, and the embedded pg_collections child table. tap_lms/doctype/batchprogramrun/batchprogramrun.json:1-100.

Key fields:

  • status — Select: draft | importing | enrolling | collections_ready | active | completed (constants.py:172-182).
  • validation_status — Select: not_run | passed | failed. Set by validate_bpr at batch_activation.py:39-151.
  • total_imported, total_enrolled — counters incremented atomically by enrollment.py:213-220.
  • content_delivery_flow, escalation_flow, reengagement_flow, grace_notification_flow, program_complete_flow, binge_info_flow — Glific flow UUIDs. Resolved per action type via constants.py:153-163 ACTION_FLOW_FIELD_MAP. (reengagement_flow and grace_notification_flow reminders are retired per CR-003; the columns remain for legacy callers.)
  • pg_collections — child table of PGCollection rows.
  • pg_onboarding_sets — child table linking to Backend Student Onboarding documents.

3.4 PGCollection (CR-005 kind-keyed)

Child table embedded under BatchProgramRun (istable=1). tap_lms/doctype/pgcollection/pgcollection.json:1-83.

Kind values (collection_membership.py:42-48 COLLECTION_KINDS):

Kind Glific group label Purpose
main SP_<batch>_main Source-of-truth for the weekly content delivery sweep
escalation SP_<batch>_escalation All PEs in normal_escalation or remedial_escalation
binge_paused SP_<batch>_binge_paused PEs in paused_binge
program_dropped SP_<batch>_program_dropped Terminal — PEs dropped via T17 grace expiry or T23
program_completed SP_<batch>_program_completed Terminal — PEs completed via T16
  • collection_label — Glific group label.
  • glific_group_id — Glific group ID (returned by create_group_if_missing).
  • is_active — Check. CR-005 backfill marks new kind-keyed rows as is_active=1; the legacy archetype-keyed rows (archetype/experiment_arm columns populated, kind null) get marked is_active=0 by the migration patch.
  • member_count — denormalized integer counter, maintained by _bump_member_count at collection_membership.py:191-221. Glific itself is the source of truth; this column lets the weekly cron cheaply skip empty BPRs.
  • archetype, experiment_arm, student_count — legacy fields retained for historical rows; not used by new code.

3.5 ProgramEnrollment (PE)

The per-student-per-batch row that holds every piece of state the dispatcher and state machine read or write. tap_lms/doctype/programenrollment/programenrollment.json.

Identity:

  • enrollment (Data, unique) — public enrollment ID.
  • student, batch, glific_id, course_level, language, experiment_arm, archetype, program_type.

State:

  • resolved_flow_state — Select with 11 values (see §5).
  • journey_label — Select with 12 values (constants.py:82-99). The atomic-claim guard column.
  • last_label_change_at — Datetime.
  • program_status — Select: active | paused | completed | dropped.
  • drop_reason — Data; valid values include grace_expired, delivery_failure, admin, manual.
  • current_expected_submission_type — Select: emoji | word_text_voice | summary_text_voice | image | video | photo_video_artefact.
  • current_path — Select: Core | Remedial.
  • current_tier — Select: Basic | Intermediate | Advanced. Resolved from TIER_BY_WEEK (constants.py:265-268).

Week / scheduling:

  • current_week, max_allowed_week — Int. Week advance respects max_allowed_week to gate binge progression.
  • next_action_at, next_action_type — the per-PE clock the dispatcher reads. Action types: content_delivery, escalation, week_advancement, feedback_notification, grace_check, pause_check, feedback_timeout (constants.py:121-127).
  • current_escalation_step — Int. Set by T2/T4/T8/T10.
  • current_escalation_type — Data. CR-003: written alongside the step counter, pushed to Glific as the contact field escalation_type.

Grace clock (per-week, CR-003):

  • in_grace_window (Check), grace_window_start (Datetime), grace_window_end_at (Datetime). Armed by activity_points.handle_content_log on the week's first VideoClass via atomic Postgres CASE WHEN. Cleared by T3/T7/T9/T17 (primary submission) or by T17/T23 drop.

Submission + delivery:

  • submission_count — Int. Owned exclusively by save_submission._try_claim_primary (save_submission.py:298-307); state-machine transitions no longer bump it (H-AUDIT-4).
  • quiz_completed — Int.
  • last_submission_at — Datetime.
  • delivery_failure_count — Int. Bumped by _record_delivery_failure; T23 fires at MAX_DELIVERY_FAILURES = 3.
  • last_flow_triggered, last_flow_triggered_at — Data / Datetime. Tracking.

Gamification (CR-002 v2, 9 fields):

  • total_activity_points, weekly_activity_points — VideoClass points.
  • total_quiz_points, weekly_quiz_points, bonus_quiz_points — quiz points.
  • total_submission_points, weekly_submission_points — submission points.
  • special_gems — Int. Floored at 0.
  • current_streak — Int.
  • total_points — cumulative roll-up across all three streams.
  • Sticky weekly flags: weekly_video_done (Check, internal-only — NOT pushed to Glific per constants.py:218), weekly_submission_done (Check, pushed as weekly_submission_done contact field).

ProgramEnrollment is the canonical read-side source for all Summer Program helpers (2026-05-19 incident fix). The legacy Student.enrollment child table is written by backend onboarding for non-SP flows; it is NOT updated by start_program_enrollment for new SP batches. Reading from the child table miss-resolves students enrolled into the current cohort but whose child rows still point at a previous batch. The following helpers now query ProgramEnrollment directly with program_status IN ('active', 'paused') and ORDER BY creation DESC:

  • _get_active_bpr_for_student (student_progression_sp.py:1802-1845) — finds the active BPR + Batch for a student. Used by get_next_content / get_weekly_content to fail fast with no_active_batch when no enrollment exists.
  • _get_course_level_for_student (student_progression_sp.py:1848-1867) — reads course_level from the matching PE. No legacy fallback (both callers are SP-only paths already gated by _get_active_bpr_for_student).
  • get_student_sp_status (api.py:191-249) — status endpoint; iterates PEs by creation desc and returns the first BPR match.

Regression coverage: tests/test_active_bpr_resolution.py pins the contract (empty child table + active PE → resolves; non-Summer batch skipped; dropped PE ignored).

3.6 ArchetypeConfig

Per-(batch, archetype, experiment_arm, week) row. Replaced the always-16 invariant after the 2026-05-13 ADR-004 supersession (per-tuple validation rule). Read by pe_dispatcher._get_week_rule (pe_dispatcher.py:658-674) and by the validator _validate_archetype_config_before_activation called from activate_bpr (batch_activation.py:192-195).

Fields used by the runtime:

  • expected_submission_type — drives PE.current_expected_submission_type after week advance.
  • core_learning_unit, remedial_learning_unit — content references for the week.

3.7 Submission

The output of save_submission. Lives outside the SP module but is owned conceptually by it. Fields the SP reads: student_id, program_enrollment, week, is_primary, status (Pending/Processing/Completed/Failed — overall lifecycle), result_status (Select: Pending / Success - Original / Success - Flagged / Failed — drives the points decision in _compute_submission_points; writers: feedback_processor._determine_result_status writes Success - Original when AI says clean and Success - Flagged when plagiarized or AI-generated; save_submission._apply_duplicate_submission_feedback ALSO writes Success - Flagged synchronously for duplicate submissions; the Failed Select option exists but no production code writes it today — kept as a defensive value in the points gate), submission_validity (Valid/Invalid — drives the routing decision in on_feedback_ready; "Invalid" plus WeekRule.submission_validation_enabled=1 routes to Remedial via T6b), overall_feedback, overall_feedback_translated, audio_feedback_url, assign_id, submission_type, submission_text, submission_url, escalation_step_at_submit, created_at.

3.8 StudentQuizAttempt, StudentContentLog, ProgramEventLog

  • StudentQuizAttempt — wired via doc_events["StudentQuizAttempt"]["on_update"] to quiz_points.handle_attempt_update. The points_earned column is the L-012 audit anchor.
  • StudentContentLog — wired via doc_events["StudentContentLog"]["after_insert"] to activity_points.handle_content_log. Filtered to content_type == "VideoClass" AND action == "completed". The points_awarded column is the audit anchor.
  • ProgramEventLog — append-only audit table. Written by summer_program/event_log.py:log_event and log_state_transition. Event types include state_transition, submission_received, flow_completed, path_changed, week_advanced, program_dropped, escalation_sent, delivery_failed, grace_window_entered, activity_points_awarded, quiz_points_awarded, bonus_quiz_points_awarded, feedback_delivered.

4. Lifecycle — enrollment to BPR activation

The pipeline that takes a draft BPR to active. Each step is gated by status (constants.py:172-182 BPR_STATUS_FLOW).

4.1 Bulk enrollment (steps 3a and 3b)

summer_program/enrollment.py:40-154:

  1. start_enrollment(bpr_name) — gathers students from the BPR's onboarding sets via _get_students_for_bpr. Marks BPR status = enrolling. Chunks students at ENROLLMENT_CHUNK_SIZE=100 and enqueues _process_enrollment_chunk on the long queue.
  2. _process_enrollment_chunk — for each student, builds the 6-field bootstrap (archetype, experiment_arm, program_type=Summer, batch_id, course_level, student_name) and enqueues _sync_contact_fields_job with pe_name=f"pre-pe:{sid}" (synthetic so the DLQ log entry is correlatable when there is not yet a PE row). Atomically bumps BatchProgramRun.total_enrolled via raw SQL (enrollment.py:213-220).
  3. setup_collections(bpr_name) — for the 4×2 = 8 archetype/arm combinations, creates the legacy Glific groups via create_or_get_collection and bulk-adds students via add_contacts_to_group_bulk (COLLECTION_BATCH_SIZE=500 per call). Records each in bpr.pg_collections. Marks BPR status = collections_ready. Note: these legacy archetype-keyed PGCollection rows still exist post-CR-005 but are is_active=0 after the migration patch and not used by new code.

4.2 Program enrollment (step 3c — separate)

summer_program/program_enrollment_api.py:40-...:

  • start_program_enrollment(bpr_name) — bulk-creates ProgramEnrollment records. Idempotent — filters out students who already have a PE for the batch. Background chunks via frappe.enqueue.
  • create_program_enrollment(student_id, batch_name) — single-PE entry point (API A6). Sets the 14 baseline Glific contact fields and writes journey_label='enrolled', program_status='active', current_path='Core', current_week=1, resolved_flow_state='normal_content_delivery'.

4.3 Validation (step 4)

batch_activation.validate_bpr (batch_activation.py:38-151) checks:

  1. BPR is in collections_ready.
  2. 8 archetype×arm legacy collections exist on the BPR.
  3. At least one Glific flow is configured.
  4. total_enrolled > 0 and at least one PE row exists.
  5. Batch has start_date and total_weeks.

Result is persisted as bpr.validation_status plus JSON bpr.validation_report.

4.4 Activation (step 5)

batch_activation.activate_bpr (batch_activation.py:154-228) is gated on validation_status == passed:

  1. Hard-fail on per-tuple ArchetypeConfig completeness via _validate_archetype_config_before_activation (replaced the old always-16 rule — ADR-004 audit log 2026-05-13).
  2. Flip BPR status = active, set activated_at = now().
  3. CR-005 _ensure_kind_keyed_pg_collections (batch_activation.py:382-439): idempotently create the 5 kind-keyed PGCollection rows + the matching Glific groups via create_group_if_missing. Each row carries kind, collection_label = f"SP_{batch}_{kind}", glific_group_id, member_count=0, is_active=1.
  4. No per-PE seeding of next_action_at — CR-005 locked decision #4: content delivery is now batch-triggered every Tuesday at 09:00 IST (03:30 UTC) via weekly_content_delivery_trigger against the main group. The legacy _seed_pe_actions function is kept as dead code (operator escape hatch).

4.5 Auto-activation cron

batch_activation.check_auto_activate (batch_activation.py:334-377) is wired to the daily scheduler. It activates BPRs in collections_ready with validation_status=passed whose Batch.start_date <= today.

5. The state machine

summer_program/state_machine.py defines 11 valid resolved_flow_state values (constants.py:50-79) and the named transitions T0–T25. Every transition flows through the transition(...) engine at state_machine.py:64-104, which:

  1. Updates pe.resolved_flow_state and pe.last_label_change_at.
  2. Applies any extra_updates dict.
  3. Calls pe.save(ignore_permissions=True).
  4. Logs to ProgramEventLog via log_state_transition.
  5. Enqueues _sync_contact_fields_job (P-007 retry+DLQ).
  6. CR-005: calls collection_membership.maintain_collections(pe, from_state, to_state) to keep Glific group membership current.

5.1 The 11 states

State Path In active set? Audit kind
normal_content_delivery Core yes (main) none
normal_escalation Core no escalation
remedial_content_delivery Remedial yes (main) none
remedial_escalation Remedial no escalation
grace_waiting either no none (Gap-fill A, CR-005)
paused_binge Core no binge_paused
submitted_awaiting_feedback either yes (main) none
feedback_ready either yes (main) none
week_completed either yes (main) none
program_completed terminal no program_completed
program_dropped terminal no program_dropped

MAIN_ELIGIBLE_STATES is the set of 5 yes-yes rows (collection_membership.py:22-28). STATE_TO_AUDIT_KIND maps the right column (collection_membership.py:31-39). Note STATE_PAUSED_NO_ACTIVITY is RETIRED per CR-003 — the Select option remains for legacy rows but no new transition writes it.

5.2 Transition catalogue (T0–T25)

ID From → To Trigger source Notes
T0 (new PE) → normal_content_delivery scheduler T0 NO LONGER arms the grace clock — armed by activity-points handler on first VideoClass. state_machine.py:322-339
T1 normal_content_delivery → normal_content_delivery flow_callback Schedules next escalation. state_machine.py:343-358
T2 normal_content_delivery → normal_escalation scheduler Writes current_escalation_step + current_escalation_type. state_machine.py:362-374
T3 normal_escalation → submitted_awaiting_feedback flow_callback Bumps submission points + streak + gems + clears grace clock. state_machine.py:378-414
T4 normal_escalation → normal_escalation scheduler Next escalation step. state_machine.py:418-430
T5 normal_escalation → grace_waiting scheduler Steps exhausted. Defensive grace-clock backfill for legacy PEs. state_machine.py:434-461
T6 REMOVED (CR-006) n/a state_machine.py:464-481 raises if called. Remedial is now reserved for failed-feedback only.
T6b submitted_awaiting_feedback → remedial_content_delivery microservice CR-004 path. AI feedback set submission_validity='Invalid' AND WeekRule.submission_validation_enabled=1. (Originally keyed on result_status='failed' per CR-004; switched to submission_validity post-ship — see §9.3.) state_machine.py:485-512
T7 normal_content_delivery → submitted_awaiting_feedback flow_callback Same shape as T3. state_machine.py:516-550
T8 remedial_content_delivery → remedial_escalation scheduler Mirrors T2. state_machine.py:554-564
T9 remedial_content_delivery → submitted_awaiting_feedback flow_callback Mirrors T3. state_machine.py:568-601
T10 remedial_escalation → remedial_escalation scheduler Mirrors T4. state_machine.py:605-617
T11 remedial_escalation → grace_waiting scheduler Mirrors T5. state_machine.py:621-638
T12 submitted_awaiting_feedback → feedback_ready microservice Called from on_feedback_ready. Clears next_action_at/next_action_type — the Glific F5 callback drives the next transition. state_machine.py:642-656
T13 feedback_ready → week_completed flow_callback Glific SP_Feedback_Delivery completed. Schedules week_advancement. state_machine.py:660-666
T14 (a.k.a. T19) week_completed → normal_content_delivery scheduler The week-advance. CR-002 v2 streak/gem compute + weekly resets. Function is named t14_week_advance for historical reasons. state_machine.py:682-768
T15 week_completed → paused_binge scheduler Binge limit (calendar not yet caught up). state_machine.py:772-780
T16 week_completed → program_completed scheduler current_week > total_weeks. state_machine.py:784-791
T17 (sub.) grace_waiting → submitted_awaiting_feedback flow_callback Submission during grace. Clears grace fields. state_machine.py:795-823
T17 (exp.) grace_waiting → program_dropped scheduler CR-003 direct drop. drop_reason='grace_expired'. state_machine.py:837-859. Alias t18_grace_expired retained for one cycle (state_machine.py:867).
T18 DELETED (CR-003) n/a Pre-CR-003 transitioned to paused_no_activity + re-engagement loop.
T19 (reactivate) DELETED (CR-003) n/a Re-engagement is now inbound-only via SP_Incoming_Router.
T20 (reactivate-remedial) DELETED (CR-003) n/a Same reason.
T21 paused_binge → normal_content_delivery scheduler Calendar caught up. state_machine.py:886-894
T22 submitted_awaiting_feedback (no change) flow_callback Duplicate submission. Log-only. state_machine.py:898-902
T23 ANY → program_dropped dispatcher System-initiated. reason in {delivery_failure, admin, manual}. Idempotent on already-terminal PEs. state_machine.py:906-937
T24 ANY → program_dropped admin Admin drop. state_machine.py:941-948
T25 (no change) scheduler Delivery-failure log only. state_machine.py:952-959

5.3 Idempotency: the journey_label primitive (L-010)

Every state-changing UPDATE in this codebase guards on WHERE journey_label = <expected> and uses RETURNING name. The dispatcher's atomic claim is the canonical example (pe_dispatcher.py:139-152):

UPDATE "tabProgramEnrollment"
   SET next_action_at = NULL
 WHERE name = %s
   AND journey_label = %s
   AND next_action_at IS NOT NULL
RETURNING name

A 0-row return is the signal that a parallel worker, a Glific callback, or save_submission moved the PE under us. The handler then skips without raising. This is the entire reason journey_label exists as a separate column from resolved_flow_state — the label is the "before-image" of the most recent transition and is cheap to compare against.

6. Per-PE dispatcher

summer_program/pe_dispatcher.py is the brain of the time-based automation. It runs every minute (hooks.py:73-76, cron */1 * * * *) and routes PEs whose next_action_at <= NOW() to per-action handlers.

6.1 The dispatch loop

process_program_actions at pe_dispatcher.py:63-199:

  1. Selects up to DISPATCH_BATCH_SIZE = 1000 PEs (pe_dispatcher.py:60) with:
  2. next_action_at <= NOW()
  3. program_status IN ('active', 'paused') (paused included so binge-resume is reachable — fix B3)
  4. next_action_type != ''
  5. ORDER BY next_action_at ASC
  6. FOR UPDATE SKIP LOCKED (L-003) so 4 parallel workers each take a different slice.
  7. For each row, runs the atomic claim (L-010) — clears next_action_at conditional on journey_label matching the value read in the SELECT.
  8. On successful claim, calls _dispatch_single(pe_row) which looks up the handler in HANDLER_MAP (pe_dispatcher.py:590-603) and runs it inside a try/except.
  9. After the loop, logs a structured sp_dispatcher line with claimed, skipped, errors, queue_depth.

Scaling envelope (architecture §8.8 sizing math, ADR-003 audit 2026-05-13): 1000-row batch × 1-min cron × 4 workers ≈ 240K actions/hour. A 100K-student week-boundary T19 burst drains in ~25 min. Prerequisite: partial index idx_pe_next_action (task #24).

6.2 The handlers

Action type Handler What it does
content_delivery handle_content_delivery (pe_dispatcher.py:245-275) Preserved escape hatch; NOT reached in normal flow under CR-005. Logs that it fired and triggers the content flow for a single contact.
escalation handle_escalation (pe_dispatcher.py:278-389) CR-003 channel-aware. Resolves the step's escalation_type from ArchetypeConfig, transitions PE via T2/T4/T8/T10, then branches: parent_call → enqueue Vocallabs; other types → fire SP_Escalation flow. On step exhaustion, routes to T5/T11 (grace).
feedback_timeout handle_feedback_timeout (pe_dispatcher.py:399-451) Safety-net fallback. Checks Submission status; if feedback arrived and PE wasn't advanced, fires T12 fallback. Otherwise reschedules for +1h, max 3 retries.
week_advancement handle_week_advancement (pe_dispatcher.py:454-499) Decides T16 (program done) vs T15 (binge pause) vs T14/T19 (normal advance).
grace_check handle_grace_check (pe_dispatcher.py:502-547) CR-003 direct drop. If weekly_submission_done=1 no-op. If clock not yet expired, reschedule. Otherwise fire T17 grace-expired.
pause_check handle_pause_check (pe_dispatcher.py:557-583) Binge-resume check. T21 if calendar caught up; else reschedule +7 days.

Retired handlers per CR-003: handle_re_engagement and handle_grace_reminder — re-engagement is inbound-only via SP_Incoming_Router and per-week escalation steps replace proactive grace reminders.

6.3 Delivery failure chain

_record_delivery_failure(pe_name) at pe_dispatcher.py:687-740 is the bridge from a failed Glific call to T23 auto-drop:

  1. Atomic UPDATE: delivery_failure_count = COALESCE(delivery_failure_count, 0) + 1 with program_status IN ('active', 'paused') folded into the WHERE so a PE that went terminal can't get its counter bumped (M1 fix 2026-05-13).
  2. RETURNING gives the post-update value; if 0 rows returned, no-op.
  3. If new_count >= MAX_DELIVERY_FAILURES (3), fires t23_auto_drop(pe, reason='delivery_failure').

The helper is shipped but not yet wired in — _trigger_flow is fire-and-forget (pe_dispatcher.py:628-643) and there is no delivery-confirmation webhook from Glific yet. The webhook wire-up is M-AUDIT-8 (deferred Phase 1).

7. Content delivery — CR-005 Approach B (state-driven)

The current weekly content delivery model. Replaces the previous per-PE content_delivery scheduling that armed ACTION_CONTENT_DELIVERY on every T0 and T19.

7.1 Continuous membership maintenance

Every state-machine transition (via transition() at state_machine.py:64-104) calls maintain_collections(pe, from_state, to_state) at collection_membership.py:51-86. This:

  1. Computes from_main, to_main (membership in MAIN_ELIGIBLE_STATES) and from_audit, to_audit via STATE_TO_AUDIT_KIND.
  2. If from_main != to_main, enqueues an add/remove on the BPR's main PGCollection.
  3. If from_audit != to_audit, enqueues remove of the old audit kind and add of the new audit kind.

Membership churn is fully delta-based — re-running with same from_state==to_state is a no-op.

7.2 Glific group writes (retry + DLQ + counter)

_enqueue_group_write (collection_membership.py:89-111) looks up the kind-keyed PGCollection row and enqueues _group_write_job with enqueue_after_commit=True. The job (collection_membership.py:114-188) wraps add_contact_to_group / remove_contact_from_group (both in glific_integration.py) with P-007 retry (5 immediate retries) + DLQ.

On successful Glific call, _bump_member_count (collection_membership.py:191-221) atomically maintains PGCollection.member_count:

UPDATE "tabPGCollection"
   SET member_count = COALESCE(member_count, 0) + 1                       -- add
       member_count = GREATEST(0, COALESCE(member_count, 0) - 1)          -- remove
 WHERE name = %s

The GREATEST(0, ...) guard prevents negative drift if a remove fires for a contact that was never in the group.

7.3 The weekly fire

scheduler.weekly_content_delivery_trigger (scheduler.py:223-292) runs Tuesday 03:30 UTC = 09:00 IST (cron 30 3 * * 2 in hooks.py:87-89):

  1. Selects active BPRs with a configured content_delivery_flow.
  2. For each, reads the main PGCollection row, skipping when member_count <= 0.
  3. Calls start_group_flow(content_delivery_flow, main_glific_group_id) — exactly one Glific API call per BPR.

No recompute, no reconcile. Idempotency is operator discipline ("don't manually invoke during the cron window"); Glific deduplicates identical group-flow starts within a short window.

8. Escalation, grace, and drop

The "weekly nudge" loop, per CR-001 + CR-003 + CR-006.

8.1 Per-week grace clock

activity_points.award_activity_points (activity_points.py:75-202) is the only arming path. On the week's first VideoClass completion (detected via pe.weekly_video_done == 0 pre-UPDATE), it both:

  1. Logs grace_window_entered to ProgramEventLog with the expected grace_window_end_at.
  2. Issues a single atomic UPDATE that bumps activity points, sets weekly_video_done=1, AND arms grace_window_start = NOW(), grace_window_end_at = NOW() + (X || ' days')::interval, in_grace_window=1 — gated by CASE WHEN weekly_video_done = 0 (Postgres evaluates against the OLD row value within the same UPDATE). Second-video-same-week is a no-op for the grace fields.

X is Batch.grace_window_days, falling back to DEFAULT_GRACE_WINDOW_DAYS = 14 (constants.py:281).

T19 re-arming is automatic: T19 resets weekly_video_done = 0, so the next VideoClass watch re-trips the CASE WHEN and writes a fresh grace_window_end_at. T0 no longer arms the clock (the activity-points handler does — CR-003 follow-up 2026-05-13).

Grace clearing: T3/T7/T9/T17 (primary submission) explicitly null in_grace_window, grace_window_end_at, grace_window_start.

8.2 Escalation channel routing

handle_escalation (pe_dispatcher.py:278-389) resolves the step's escalation_type from ArchetypeConfig and branches:

  • help_note_a / help_note_b / voice_note → fire SP_Escalation Glific flow. Glific reads the contact fields escalation_order + escalation_type and renders the per-channel content.
  • parent_call → enqueue vocallabs.initiate_parent_call. Glific is NOT involved for parent calls.

Defaults: escalation_type defaults to help_note_a if missing. hours_after_previous defaults to 24.

8.2.1 Vocallabs parent-call flow (cache-on-Student + auto-backfill — task #80/#81, CR-010)

vocallabs.py is the integration module. The 2026-05-24 hardening (CR-010) changed the flow from "addMultipleContactsToGroup every call" to a permanently-cached prospect_id on the Student doctype.

Why the change: Vocallabs enforces a uniqueness constraint on (client_id, prospect_group_id, phone) in their Hasura backend. Without a cache, every repeat call to the same parent's phone hits prospects_client_id_prospect_group_id_phone_key violation and never dials. Original behavior was to treat this as transient and retry 6 times (~75s of wasted worker time per failure, then DLQ). The cached design hits the insert ONCE then reuses the returned prospect_id forever.

Storage: Student.vocallabs_prospect_id (Data, read-only in UI, no_copy=1). Populated by _store_prospect_id(student, prospect_id) via frappe.db.set_value with update_modified=False (no audit-trail bump for what is essentially a tech-side cache).

Call sequencing (_call_vocallabs):

  1. Cache hit (Student.vocallabs_prospect_id populated):
  2. POST /b2b/vocallabs/updateContactData with the freshly rendered data block (best-effort — failure logs to SP Vocallabs UpdateData and proceeds).
  3. POST /b2b/vocallabs/initiateVocallabsCall with the cached prospect_id.
  4. 2 HTTP calls (plus cached auth token).

  5. Cache miss + first encounter (cache empty AND parent phone NOT in Vocallabs):

  6. POST /b2b/vocallabs/addMultipleContactsToGroup → returns prospect_id.
  7. _store_prospect_id writes to Student.vocallabs_prospect_id (before initiateVocallabsCall fires, so a mid-flight call failure still leaves the cache populated).
  8. POST /b2b/vocallabs/initiateVocallabsCall.

  9. Cache miss + auto-backfill (cache empty AND parent phone already in Vocallabs from a prior insert):

  10. addMultipleContactsToGroup returns the Hasura uniqueness violation.
  11. _is_duplicate_prospect_response detects the prospects_client_id_prospect_group_id_phone_key constraint name OR the GraphQL code: "constraint-violation".
  12. _lookup_prospect_id_by_phone paginates GET /b2b/vocallabs/getContacts (limit=200, max 50 pages = 10,000 prospects). Matches by exact phone OR last-10-digits (handles E.164 ↔ 10-digit drift across Indian numbers).
  13. On match: cache the recovered UUID, run updateContactData, proceed to initiateVocallabsCall.
  14. On miss (phone genuinely not there, or unrecognized response shape): raise PermanentVocallabsError → fail-fast under SP Vocallabs Duplicate Prospect (lookup required) Error Log title.

Status template freshness: data.status is bound to the prospect record (not the call). On cache hit we MUST refresh via updateContactData before each call, otherwise the agent reads the variables from the original insert (week 1's text on a week 5 call). The team configures the template via ParentCallConfig.status_template per-week via UnitContentItem on LearningUnit, falling back to VoiceAgentSettings.default_parent_call_config. The code never hard-codes content.

Sibling race caveat: Two Student rows can share one parent phone (multiple kids enrolled, single device). For the cache to work, both Student rows must store the same prospect_id. The auto-backfill handles this — first sibling's call recovers the existing UUID, second sibling's call recovers the same UUID independently. But data.status updates race: if both siblings hit step 4 within the same second, the second updateContactData overwrites the first's variables before the first call has dialed out. Team-side mitigation is to make ParentCallConfig.status_template sibling-agnostic ("Your child hasn't submitted") rather than name-specific.

Single agent limitation (task #48): Today there's one VoiceAgentSettings.agent_id for all calls. Multi-language students get the English agent. Vocallabs is exploring language-detection-from-voice on their side; if that ships, task #48 may become moot. Otherwise the plan is one Vocallabs agent per language with a Student.language → agent_id lookup at call time.

Error Log taxonomy (all under method-column titles starting with SP Vocallabs): - SP Vocallabs Retry — transient failure; will retry (P-007 budget). - SP Vocallabs DLQ — manual replay required — retries exhausted. - SP Vocallabs Duplicate Prospect (lookup required) — permanent fail; cache empty AND lookup couldn't recover prospect_id. - SP Vocallabs Lookup — pagination diagnostics (only logged on miss/error, not success). - SP Vocallabs UpdateDataupdateContactData failed; call still placed with stale prospect data. - SP Vocallabs Config / SP Vocallabs Template / SP Vocallabs Skipped — early-exit branches.

Verified end-to-end on test server 2026-05-24: 3 test parents (ST00051383 / ST00051387 / ST00051238) all had polluted phones from prior test runs. Phase C confirmed fail-fast under the dedicated Error Log title (3 entries × ~0.6s each vs. yesterday's 12 retries + 2 DLQ over 75s). Phase B (post-fix) confirmed auto-backfill: cache populated with real UUIDs (9f5f108c-..., 9e27ba86-..., 41556126-...), three subsequent cache-hit calls succeeded end-to-end on the API side with valid call_ids returned. Vocallabs-side dial-out failed (unrelated config issue on their end — call_ids handed off to Vocallabs team).

8.3 Step exhaustion (CR-006 unified)

When next_step > len(steps), escalation has exhausted within the week. CR-006 (2026-05-15) removes T6 (escalation_to_remedial) — remedial is now reserved for failed-feedback students via T6b. All zero-submission exhausters take the same path:

  • Core state (normal_content_delivery / normal_escalation) → t5_escalation_to_gracegrace_waiting.
  • Remedial state (remedial_content_delivery / remedial_escalation) → t11_remedial_to_gracegrace_waiting.

8.4 Grace expiry → direct drop (CR-001)

handle_grace_check (pe_dispatcher.py:502-547) fires at grace_window_end_at:

  1. If PE has moved out of grace_waiting, no-op.
  2. If weekly_submission_done=1, no-op (student submitted within the window).
  3. If clock hasn't actually expired (skew), reschedule to the exact expiry time.
  4. Otherwise fire t17_grace_expired(pe, "dispatcher")program_dropped with drop_reason='grace_expired'. No paused_no_activity hop, no re-engagement (those branches were deleted in CR-001 / CR-003).

9. Submission pipeline

The path from a student WhatsApp message to AI feedback delivered back to the student.

9.1 save_submission (save_submission.py:46-237)

Whitelisted as allow_guest=True so Glific can call it. Flow:

  1. Resolve student_id (utils.resolve_student accepts name, glific_id, or phone).
  2. Resolve active PE via get_active_pe (state_machine.py:1005-1034) — uses program_status IN ('active', 'paused') and ORDER BY modified DESC to pick the live row when multiple PEs exist for the same student.
  3. Reject if PE is in a terminal state.
  4. Normalize the payload — _normalize_submission_payload infers submission_type as text | emoji | audio | image | video from the input shape (URL → infer by extension; non-URL → emoji if no alnum chars else text).
  5. Insert Submission FIRST inside a savepoint (save_submission.py:135-160, task #81 audit 2026-05-15). is_primary=False is provisional; flipped after the atomic claim succeeds. Order matters: if the insert failed AFTER claiming primary, retries would see "duplicate" and silently lose the real submission.
  6. _try_claim_primary(pe, week) (save_submission.py:279-315) — the atomic primary claim. Uses UPDATE ... RETURNING: UPDATE `tabProgramEnrollment` SET journey_label = 'submitted', last_label_change_at = NOW(), submission_count = COALESCE(submission_count, 0) + 1, last_submission_at = NOW() WHERE name = %s AND journey_label IN (%s, %s, %s, %s, %s) RETURNING name The 5 valid pre-submission labels are enrolled | content_delivered | grace_window | resumed | week_advanced. RETURNING-empty means duplicate. submission_count is bumped here, exactly once — T3/T7/T9/T17 no longer touch it (H-AUDIT-4, 2026-05-15).
  7. If primary: apply state transition via apply_submission_transition(pe, points=0, "flow_callback") which dispatches on pe.resolved_flow_state to T7/T3/T9/T17 or T22 for duplicates. CR-007 (2026-05-19): points are NOT computed here. AI validation runs asynchronously after save_submission, so the actual point award (depending on Submission.result_status) is deferred to feedback_consumer_hook.on_feedback_ready (§9.3). The transition still bumps streak / gems / weekly_submission_done on every submission regardless of validity (user spec).
  8. Update EngagementState (legacy bridge) and log to StudentContentLog with a savepoint (so a bridge-log insert failure doesn't poison the outer transaction — L-030).
  9. If primary: enqueue process_submission_async (save_submission.py:629-666) which uploads media to GCS, marks submission.status = "Processing", then enqueues enqueue_submission.
  10. enqueue_submission (save_submission.py:669-815) publishes to RabbitMQ via pika:
    • Publisher confirms enabled (channel.confirm_delivery()).
    • Durable queue (queue_declare(durable=True)).
    • Persistent delivery (delivery_mode=2).
    • Mandatory routing (mandatory=True).
    • P-007 retry: 5 retries via FEEDBACK_PIPELINE_MAX_RETRIES, DLQ on exhaustion with a JSON-structured Error Log entry including submission_id, student_id, full pe_context.

9.2 FeedbackConsumer (external)

Runs outside this app. Drains RabbitMQ, generates AI feedback, writes back to the Submission row (overall_feedback, overall_feedback_translated, audio_feedback_url, result_status, status="Completed"), and calls back into the backend via feedback_consumer_hook.on_feedback_ready(submission_name, student_id).

9.3 on_feedback_ready (feedback_consumer_hook.py)

  1. Resolve student from submission if not provided.
  2. Find the active PE in submitted_awaiting_feedback. If none, return {"status": "no_pe"}.
  3. Verify the submission's week matches the PE's current_week; mismatch → skip.
  4. Points: CR-007 (2026-05-19), revised 2026-06-15. _compute_submission_points(pe, submission_name, result_status) evaluates branches in this order — the AI-verdict gate sits at the top and overrides every later branch, including the escalation-tier reward for late submissions:
  5. Branch 0result_status == 'Pending': 0 points (defensive — the hook contract is that FeedbackConsumer writes the AI verdict before calling on_feedback_ready, so Pending should never reach here; if a race delivers it anyway, treat as no-verdict and award nothing).
  6. Branch 1WeekRule.submission_validation_enabled = 1 AND result_status in ('Failed', 'Success - Flagged'): 0 points. Pre-2026-06-15 the late branch ran first and shielded sent_count >= 1 + Failed/Flagged from this gate; the spec now zeroes them out so the AI verdict universally trumps the tier reward. 'Failed' is dead schema in production (no writer emits it) — kept defensively.
  7. Branch 2sent_count >= 1 (late submission, escalation fired): EscalationStep[sent_count].points_awarded.
  8. Branch 3 — on-time + valid (result_status == 'Success - Original', OR any non-Pending value in lax mode): Assignment.points_per_item.

The WeekRule.submission_validation_enabled flag lives on the ArchetypeConfig.week_rules child rows, keyed (archetype, arm, week). It is operator-owned per archetype — there is no week-number-hardcoded "weeks 1-2 are lax" rule in code; that pattern, where it exists, is a deployment convention set by ArchetypeConfig values.

Points are applied via atomic SQL UPDATE (_award_submission_points_atomic, L-011 COALESCE pattern). pe.reload() after the bump ensures the subsequent transition's pe.save() doesn't clobber the freshly-bumped columns. 5. CR-004 + CR-007 routing (current contract): read Submission.submission_validity. If submission_validity == 'Invalid' AND WeekRule.submission_validation_enabled = 1 → fire t6b_failed_feedback_to_remedial. Otherwise → fire t12_feedback_ready. In lax mode (validation off), even invalid submissions stay on Core. (CR-004 originally specified branching on result_status; the implementation was later switched to submission_validity to separate the AI-scoring signal from the validation-gate signal. result_status still governs the points decision in step 4 above; submission_validity governs the routing decision here.) 6. _sync_contact_fields(pe) re-pushes PE state to Glific so the post-award totals land on the contact fields.

Remedial routing is per-week, not permanent. T14 (week advance) at state_machine.py:772 unconditionally sets current_path = PATH_CORE on every advance, so a failed submission only routes the student to Remedial for the current week — the next week starts on Core again.

9.4 Glific F5 callback

flow_callback.update_flow_status (flow_callback.py:92-149) dispatches by flow name. SP_Feedback_Delivery routes to _handle_feedback_delivery (flow_callback.py:225-245):

  1. Verify PE is still in feedback_ready (else "already_advanced" response).
  2. Fire t13_feedback_delivered(pe, "flow_callback")week_completed, schedules week_advancement.
  3. Reload PE and return _response(pe, "week_completed", current_week=pe.current_week).

9.5 Watchdog

pre_launch.feedback_ready_watchdog (pre_launch.py:129-195) runs hourly (hooks.py:94-96) to find PEs stuck in feedback_ready longer than stuck_hours=2. Log-only — does NOT auto-transition. Silent auto-advance would skip feedback delivery entirely (the reason feedback_ready exists). Operator action: retrigger F5 via Glific UI or call update_flow_status(action='feedback_complete') from bench shell.

10. Gamification — CR-002 v2

Nine PE fields capture the three streams (activity / quiz / submission) plus gems and streak.

10.1 Field roles

Field Cumulative? Resets at T19? Pushed to Glific?
total_activity_points yes no yes (total_activity_points)
weekly_activity_points no yes (→ 0) yes (weekly_activity_points)
total_quiz_points yes no yes (total_quiz_points)
weekly_quiz_points no yes (→ 0) yes (weekly_quiz_points)
bonus_quiz_points yes (independent) no yes (bonus_quiz_points)
total_submission_points yes no yes (total_submission_points)
weekly_submission_points no yes (→ 0) yes (weekly_submission_points)
special_gems yes no yes (special_gems)
current_streak yes conditional yes (current_streak)
total_points yes (roll-up) no yes (total_points)
weekly_video_done (sticky flag) n/a yes (→ 0) NO (internal-only — constants.py:218)
weekly_submission_done (sticky flag) n/a yes (→ 0) yes (weekly_submission_done)

10.2 Activity points (activity_points.py)

Wired on StudentContentLog.after_insert. Filtered to content_type == "VideoClass" AND action == "completed". The handler:

  1. Guards on scl.points_awarded > 0 (L-012 idempotency).
  2. Resolves VideoClass.points; returns early on 0/null (E11: zero-point pedagogical videos do not count toward "assigned this week").
  3. Resolves active PE; returns if none.
  4. Pre-UPDATE Python read of pe.weekly_video_done to detect first-video-of-week → logs grace_window_entered event BEFORE the UPDATE.
  5. Single atomic UPDATE that bumps activity points, sets weekly_video_done=1, and arms the grace clock via CASE WHEN weekly_video_done = 0 (see §8.1).
  6. Writes scl.points_awarded = pts (audit anchor written AFTER PE update so retries are safe).
  7. Enqueues contact-field sync; logs activity_points_awarded.

10.3 Quiz points (quiz_points.py)

Wired on StudentQuizAttempt.on_update. Filtered to completed_at IS NOT NULL AND points_earned == 0 (idempotency).

Per-question rule (compute_quiz_points at quiz_points.py:180-196):

  • Correct answer → QuizQuestion.points
  • Wrong answer → QuizQuestion.failed_points
  • Independent of attempt-level pass/fail

Cumulative-vs-weekly split (quiz_points.award_quiz_points at quiz_points.py:65-125):

  • total_quiz_points, total_points: apply DELTA vs the previous-latest attempt for the same (student, quiz) pair (latest-score semantics).
  • weekly_quiz_points: ALWAYS adds the new attempt's full earned value (effort semantics).

bonus_quiz_points is independent — award_bonus_quiz_points only touches that field and the matching Glific contact field, never affects total_points / weekly_quiz_points.

10.4 Submission streak and gems

T3/T7/T9/T17 each add to total_submission_points, weekly_submission_points, current_streak += 1, special_gems += 1, and set weekly_submission_done = 1 — all in the same atomic save (state_machine.py:398-414, 534-550, 585-601, 808-823).

10.5 T19 (week advance) streak/gem compute

t14_week_advance (state_machine.py:682-768) runs the compute in two phases:

  1. Phase 1 — read the sticky weekly flags BEFORE reset:
  2. if weekly_video_done = 1 AND weekly_submission_done = 0 (assigned but didn't submit): current_streak → 0, special_gems → max(0, special_gems - 1).
  3. else: streak/gems unchanged.
  4. Phase 2 — reset weekly_* counters + both sticky flags to 0; advance week; write the streak/gem values computed in Phase 1. Total counters are NEVER reset (E10).

Gem floor is enforced in Python (max(0, ...)). SQL GREATEST is not needed because the value is plain-int by the time we write it.

11. Glific integration

glific_integration.py houses the wrappers. The SP pushes 28 contact fields and reads/writes 5 kind-keyed groups.

11.1 Contact field catalogue (28 fields)

Category Fields
Identity (6) student_id, student_name, batch_id, archetype, language, course_level
State (7) resolved_flow_state, current_week, current_path, current_tier, program_status, experiment_arm, grace_window_end_at
Submission (3) current_expected_submission_type, submission_count, weekly_submission_done
Escalation (3) last_escalation_step, escalation_order, escalation_type
Gamification (9) total_points, current_streak, total_activity_points, weekly_activity_points, total_quiz_points, weekly_quiz_points, bonus_quiz_points, total_submission_points, weekly_submission_points, special_gems

weekly_video_done is internal-only and is intentionally NOT pushed (constants.py:218).

11.2 Push mechanics

state_machine._enqueue_contact_field_sync(pe) (state_machine.py:107-159) serializes the 28 fields into a dict and enqueues _sync_contact_fields_job with enqueue_after_commit=True. The job (state_machine.py:162-258) wraps update_contact_fields(glific_id, fields) with P-007 retry+DLQ:

  • GLIFIC_SYNC_MAX_RETRIES = 5 (constants.py:250).
  • Retries are IMMEDIATE (no backoff — known limitation, follow-up task).
  • DLQ log title: SP Glific Sync DLQ — manual replay required.
  • DLQ payload is JSON with student_id, pe_name, glific_id, fields, final_error, retries_attempted.
  • Double-fault handling: if the re-enqueue itself raises, the original error goes straight to DLQ.

update_contact_fields itself (glific_integration.py:133-251) uses a 3-step fetch-merge-update pattern: fetch existing fields → merge ours in → write back via updateContact. Without the fetch, Glific's blob-replace semantics would clobber fields set by other tools or flows.

11.3 Group write helpers

  • add_contact_to_group(contact_id, group_id)updateGroupContacts(addContactIds=[contact_id], deleteContactIds=[]). glific_integration.py:705-754.
  • remove_contact_from_group(contact_id, group_id)updateGroupContacts(addContactIds=[], deleteContactIds=[contact_id]). CR-005 addition. glific_integration.py:614-676.
  • create_group_if_missing(label, description) — looks up by label, creates if absent, returns Glific group ID. Used by _ensure_kind_keyed_pg_collections and the backfill patch. glific_integration.py:679-702.
  • start_contact_flow(flow_id, contact_id, default_results) — single-contact flow trigger; fire-and-forget. glific_integration.py:392-432.
  • start_group_flow(flow_id, group_id) — group-level flow trigger; used by the weekly Tuesday cron. Lives in summer_program/glific_extensions.py.

12. Flow callbacks (whitelisted endpoints)

summer_program/flow_callback.py is the bridge from Glific to the backend state machine. All endpoints comply with the L-008 flat-map contract.

12.1 The L-008 response helper

_response(pe, status_value, **extras) at flow_callback.py:56-89:

  • Writes directly to frappe.local.response via update(...) — no message: wrapper.
  • Always emits: success, status, action (deprecated alias for one release cycle — task #73), resolved_flow_state, next_action_type, next_action_at, program_status.
  • All values are scalar strings (timestamps stringified, no nested objects, no arrays).

12.2 update_flow_status (flow_callback.py:92-149)

Whitelisted as allow_guest=False. Args: student_id, flow_name, status, metadata (optional). Dispatches by flow_name via _get_handler (flow_callback.py:152-173):

Flow name Status values Handler Resulting transition
SP_Content_Delivery no_response / timeout _handle_content_delivery T1 — schedule first escalation
SP_Content_Delivery completed _handle_content_delivery no transition; "delivery_confirmed"
SP_Escalation any _handle_escalation no transition; "escalation_confirmed"
SP_Feedback_Delivery any _handle_feedback_delivery T13 — feedback_ready → week_completed
SP_Submission any _handle_submission_flow no transition; "submission_flow_completed"
SP_Paused_Binge any _handle_binge_info no transition; "binge_info_delivered"
SP_Program_Complete any _handle_info_flow no transition; "info_delivered"

Retired handlers per CR-003: _handle_grace_flow (SP_Grace_Entry), _handle_reengagement (SP_Paused_Reengagement), SP_Grace_Reminder. SP_Week_Summary removed per CR-002 v2.

12.3 save_submission

See §9.1. Whitelisted as allow_guest=True (Glific's flow webhook calls it directly). Returns the L-008 flat-map response with status{accepted, duplicate, terminal_state, no_active_enrollment, not_found, missing_param, insert_failed}.

12.4 award_bonus_quiz_points

quiz_points.award_bonus_quiz_points(student_id, points) (quiz_points.py:128-173) — decorated with @glific_response (so the helper handles the flat-map write). Awards independent bonus points (does not affect total_points, total_quiz_points, or weekly_quiz_points).

13. Scheduler topology

Wired in hooks.py:42-98. All times are UTC.

Schedule Path Purpose
daily tap_lms.tap_lms.page.onboarding_flow_trigger.onboarding_flow_trigger.update_incomplete_stages Legacy onboarding sweep (out of scope here)
daily tap_lms.summer_program.scheduler.run_daily_actions SP daily housekeeping: per-active-BPR collection actions, program-complete trigger when week > total_weeks
daily tap_lms.summer_program.batch_activation.check_auto_activate Auto-activate BPRs whose Batch.start_date <= today
*/1 * * * * tap_lms.summer_program.pe_dispatcher.process_program_actions The per-PE dispatcher loop (1000-row batch × 4 workers ≈ 240K/hour)
0 */2 * * * tap_lms.summer_program.escalation_runner.run_escalation_check 6-hour bulk escalation sweep (legacy batcher; to be replaced when collection-mode rollout lands)
0 0 * * 1 tap_lms.summer_program.batch_admin.auto_advance_batch_week Weekly Monday — bumps Batch.current_calendar_week; unblocks max_allowed_week on PEs
30 3 * * 2 tap_lms.summer_program.scheduler.weekly_content_delivery_trigger CR-005 weekly content delivery — Tuesday 09:00 IST against each active BPR's main group
0 * * * * tap_lms.summer_program.pre_launch.feedback_ready_watchdog Hourly log-only watchdog for stuck feedback_ready PEs (task #56, 2026-05-16)

Doc events (hooks.py:14-40):

  • StudentContentLog.after_insertsummer_program.activity_points.handle_content_log
  • StudentQuizAttempt.on_updatesummer_program.quiz_points.handle_attempt_update
  • (Plus pre-existing onboarding hooks on School, Teacher, StudentStageProgress.)

14. Operational tooling

summer_program/pre_launch.py and summer_program/dev_tools.py.

14.1 audit_and_null_stale_next_action_at

pre_launch.audit_and_null_stale_next_action_at(dry_run=False, stale_days=7) (pre_launch.py:32-122). Pre-launch one-shot audit (also re-runnable). Identifies and nulls stale next_action_at values in two buckets:

  • Overduenext_action_at < NOW() - stale_days AND not terminal.
  • Terminalnext_action_at IS NOT NULL AND resolved_flow_state IN ('program_completed', 'program_dropped').

Single atomic UPDATE covers both buckets. dry_run=True returns counts without writing. Returns {stale_overdue, stale_terminal, total_nulled, by_state}.

Invocation:

bench --site tap_lms.dev execute \
    tap_lms.summer_program.pre_launch.audit_and_null_stale_next_action_at

14.2 feedback_ready_watchdog

See §9.5. Hourly cron, log-only, structured Error Log per stuck PE.

14.3 dev_tools — reset utilities

summer_program/dev_tools.py provides destructive reset helpers for dev/test sites:

  • list_pes_for_batch(batch_name) — read-only; prints PEs with key state fields.
  • reset_pe_to_state_0(pe_name, dry_run=False) — resets a single PE to T0 state. Destroys Submission / StudentQuizAttempt / StudentContentLog / ProgramEventLog rows.
  • reset_pes_for_batch(batch_name) — bulk reset across a batch.

Safety guard _assert_dev_site (dev_tools.py:59-80) refuses to run on sites whose name matches prod | live | production unless the caller passes i_know_this_is_destructive=True. Patches maintain_collections and _enqueue_contact_field_sync at module-level for stable test patching.

15. Resilience patterns

Named patterns referenced throughout the codebase.

P-001 — idempotency primitive (atomic UPDATE-WHERE-RETURNING)

Atomic UPDATE guarding on journey_label (or another precondition column) + RETURNING name. 0-row return = parallel actor moved state; skip gracefully. Lesson L-010.

Canonical example — the dispatcher claim (pe_dispatcher.py:139-152):

UPDATE "tabProgramEnrollment"
SET next_action_at = NULL
WHERE name = %s AND journey_label = %s AND next_action_at IS NOT NULL
RETURNING name

Other instances: _try_claim_primary (save_submission.py:298-307), _record_delivery_failure (pe_dispatcher.py:724-733).

P-002 — atomic counter (COALESCE-update)

SET col = COALESCE(col, 0) + %s so a counter survives races with a peer write or a weekly reset. Lesson L-011.

Instances: activity_points._sql_update_pe_columns (activity_points.py:157-176), quiz_points.award_quiz_points (quiz_points.py:97-106), _bump_member_count (collection_membership.py:204-220), _record_delivery_failure (pe_dispatcher.py:724-733), total_enrolled bump (enrollment.py:213-220).

P-003 — per-attempt audit lock

Write the audit field first (e.g. StudentQuizAttempt.points_earned), then do rollups. The audit field is the L-012 idempotency anchor for retries.

Instances: quiz_points.award_quiz_points writes points_earned at line 75-78 before the PE bump; activity_points.award_activity_points writes scl.points_awarded at line 179-182 AFTER the PE update (audit-after-PE so retries skip correctly). The pattern shape differs by whether the audit write itself is the side-effect anchor.

P-005 — write-once anchor

A column whose presence (points_earned > 0, points_awarded > 0) is checked at entry to short-circuit re-runs. Same field that L-012 protects.

P-006 — backward-compat parameter alias

Whitelisted parameter renames keep the old name as an alias for one release cycle and log a deprecation. Lesson L-009.

Instance: save_submission(content_id=) (deprecated) → assignment_id (save_submission.py:68-77). Aliased and logged via frappe.log_error titled SP API Deprecation.

P-007 — retry + DLQ

Every external write goes through this. On exception: increment retry_count, re-enqueue self via frappe.enqueue up to MAX_RETRIES; on exhaustion, log a structured DLQ entry to Frappe Error Log so operators can replay manually. Lesson L-015.

Subsystem Max retries Constant DLQ log title
Glific contact-field sync 5 GLIFIC_SYNC_MAX_RETRIES SP Glific Sync DLQ — manual replay required
Glific group writes (CR-005) 5 (shares GLIFIC_SYNC_MAX_RETRIES) SP Glific Sync DLQ — manual replay required
RabbitMQ feedback publish 5 FEEDBACK_PIPELINE_MAX_RETRIES SP Feedback Pipeline DLQ — manual replay required
Vocallabs parent call 5 VOCALLABS_MAX_RETRIES SP Vocallabs DLQ — manual replay required

All currently use IMMEDIATE retries (no backoff). Exponential backoff is a deferred follow-up; the current scheme covers short Glific 502/503s and Redis hiccups, while a sustained outage (>~30s) drops to DLQ.

L-003 — FOR UPDATE SKIP LOCKED

PG-specific scheduler primitive. Lets N parallel workers each take a different slice of the candidate set without contention. Instance: process_program_actions SELECT (pe_dispatcher.py:104-122).

CR-005 — partial unique index (defense-in-depth)

A partial unique index on tabPGCollection (parent, kind) WHERE is_active = 1 enforces "at most one active row per (BPR, kind)" at the database level. Defense-in-depth against double-bootstrap; the bootstrap code itself is also idempotent (_ensure_kind_keyed_pg_collections at batch_activation.py:382-439).

16. Future work / explicitly deferred

These are NOT shipped today. Tracked in the task system; ADR-level commitments live in docs/adrs/.

  • M-AUDIT-8 — Glific delivery-confirmation webhook. Wires _trigger_flow failures into _record_delivery_failure → T23. Currently _trigger_flow is fire-and-forget; the chain is plumbed but inactive. Phase 1.
  • Exponential backoff scheduler. Replace the immediate-retry P-007 implementation with a delay-scheduled backoff. Covers sustained Glific or RabbitMQ outages. Follow-up to the Phase 1 webhook work.
  • TimescaleDB hypertable for ProgramEventLog. ADR-001. Not justified by current load.
  • Redis hot-state cache for PE. ADR-002. Not justified by current load.
  • Collection-mode batchers for escalation / feedback delivery. ADR-003 successors. The legacy escalation_runner 6-hourly sweep is the current implementation.
  • PE.last_dispatched_at column. Audit-trail field referenced in the dispatcher loop comments (pe_dispatcher.py:135-138) but not yet added; filed for DocType-UI follow-up.
  • Drop deprecated action alias from _response flat-map. After all Glific flows are audited to read @results.webhook.status instead of @results.webhook.action (task #73).
  • Cleanup CR for T6 stub. The t6_escalation_to_remedial function in state_machine.py:464-481 raises if called; will be deleted entirely once no caller imports it.
  • Cleanup CR for legacy archetype-keyed PGCollection rows. Currently is_active=0 post-CR-005 backfill; eventually delete.

17. Known gaps in this document

  • Course content authoring (CourseLevel, VideoClass, Quiz, QuizQuestion configuration) is not covered. Lives outside summer_program/ and is a separate concern.
  • School / teacher onboarding (onboarding_flow_trigger, Backend Student Onboarding) is referenced only where it touches the SP enrollment ingress. The full onboarding pipeline is owned by a different team and lives in the tap_lms/page/onboarding_flow_trigger/ namespace.
  • The legacy escalation_runner 6-hourly sweep is registered but not described — it predates the per-PE dispatcher and is in the process of being replaced.
  • SP_Incoming_Router — the inbound-only re-engagement entry point referenced in CR-003. This is a Glific flow (not backend code); the only backend touchpoint is the standard update_flow_status callback when the rejoin path completes.
  • reactivate_student — mentioned in state_machine.py:12 as a state-machine entry point but the implementation lives outside the files surveyed here.
  • escalation_runner.run_escalation_check internals — referenced in hooks.py:77-78 but its code was not opened for this doc.
  • Week-1 content for late activations — content delivery is now batch-driven via weekly_content_delivery_trigger; week-1 students added to the main collection mid-week will not get content until the following Tuesday. The admin's mitigation is to time activation Mon/Tue (batch_activation.py:223-227).