TAP LMS Architecture

Live single-source-of-truth for the Summer Program (SP) pipeline as it stands on 2026-05-16, post CR-001, CR-002 v2, CR-003, CR-004, CR-005, and CR-006. Cite this doc before designing new work; supersede it via CRs and ADRs rather than editing in place. References throughout use path/to/file.py:LINE form.

1. System overview

TAP LMS is a Frappe v15 custom app (tap_lms) running on Postgres. The Summer Program is a multi-week, archetype-driven, Glific-integrated student learning journey. A cohort lives on a Batch, a per-cohort run lives on a BatchProgramRun (BPR), and a per-student row lives on a ProgramEnrollment (PE). Each PE carries a resolved_flow_state (the state-machine label) and a next_action_at (when the scheduler should next look at it). State transitions are numbered T0–T25 in summer_program/state_machine.py. Every state change calls Glific to maintain group membership and to push contact fields back so the WhatsApp flows can branch on backend state.

Integrations the SP depends on:

Glific — WhatsApp messaging. Two surfaces: contact fields (push state to Glific via update_contact_fields) and groups (CR-005 kind-keyed collections: main, escalation, binge_paused, program_dropped, program_completed). Glific flows call back into the backend via the whitelisted update_flow_status endpoint and the save_submission endpoint.
RabbitMQ — feedback pipeline broker. Submissions are published to RabbitMQ by save_submission.enqueue_submission; the FeedbackConsumer external service drains the queue, generates AI feedback, writes back to the Submission row, then calls back into the backend via feedback_consumer_hook.on_feedback_ready.
Vocallabs — parent-call automation. CR-003 added a parent_call escalation channel; summer_program/vocallabs.py is the integration module with its own retry+DLQ budget.

2. Tech stack and invariants

Layer	Choice	Notes
Web framework	Frappe v15	Custom app `tap_lms`, module path `tap_lms.summer_program.*`
Database	Postgres	No MariaDB compat layer — `MariaDB fallback` branches are dead code (L-007)
Background jobs	RQ workers via `frappe.enqueue`	Queues used: `short`, `default`, `long`
Broker	RabbitMQ (`pika`)	Publisher-confirms + persistent delivery (`delivery_mode=2`) — `save_submission.py:725-737`
Messaging	Glific GraphQL API	Auth token cached in `Glific Settings`; mutations: `updateContact`, `updateGroupContacts`, `startContactFlow`, `createGroup`
Voice	Vocallabs HTTP API	Token cached in Redis under `VOCALLABS_TOKEN_CACHE_KEY`; default TTL 3600s

Cross-cutting invariants (project-wide rules — see .claude/memory/lessons.md):

L-005 (IN/array trap) — Frappe's PG param wrapper mangles WHERE col IN %s when binding a 2-tuple with an inner sequence. Use flat IN (%s, %s, ...) with scalar params, or = ANY(%s::text[]) with explicit cast. Repeats across pe_dispatcher.py:99-122, pe_dispatcher.py:172-187, save_submission.py:292-307, _record_delivery_failure at pe_dispatcher.py:721-733.
L-008 (Glific flat-map contract) — Glific flows read @results.webhook.<field> directly. Whitelisted endpoints write the response via frappe.local.response.update(...) (no message: wrapper, scalar values only). Contact-field names are public contract — never rename without auditing every flow. See flow_callback.py:56-89 and docs/api-standard-glific.md.
L-010 (journey_label idempotency) — every state-changing UPDATE includes WHERE journey_label = <expected> plus RETURNING name. A 0-row return means a parallel handler already moved the state; skip gracefully. Anchor pattern in the dispatcher's atomic claim — pe_dispatcher.py:139-152.
L-011 (atomic counter) — counters that race with the T19 weekly reset use SET col = COALESCE(col, 0) + %s. Read-then-write is unsafe. See activity_points.py:157-176, quiz_points.py:97-106, _record_delivery_failure at pe_dispatcher.py:724-733.
L-012 (per-attempt audit lock) — write the audit field (e.g. StudentQuizAttempt.points_earned) FIRST, then do rollups. The audit field is the idempotency anchor — quiz_points.py:75-78.
L-015 (retry+DLQ) — every external write (Glific, RabbitMQ, Vocallabs, group-write) carries P-007 retry+DLQ. State has already committed in PG; a silent failure leaves Glific stale or submissions lost.
L-017 (no commit in handlers) — controllers and flow handlers must not call frappe.db.commit(). Frappe commits at request-end; mid-handler commits poison transaction isolation and break test rollback.
L-024 (two-folder deploy) — app/ is the working folder; ~/frappe-bench/apps/tap_lms/ is the bench checkout. Edits flow working → bench → git, then git pull back into the working folder. Don't re-copy after the bench has pulled.

3. Key DocTypes

3.1 Student

The contact-level identity row. Lives in tap_lms/doctype/student/.

Student.name1 — canonical display name (NOT student_name, which does not exist). Resolved via summer_program/utils.py:get_student_display_name.
Student.name — the document name and the student ID (e.g. ST00051383); there is no separate student_id column.
Student.glific_id — Glific contact ID; null for students never synced.
Student.archetype — one of dormant | fence_sitter | irregular_submitter | submitter (lowercase snake_case per constants.py:18-28). Upstream-supplied data: set by CSV import, BigQuery sync, or manual entry. The SP backend never auto-assigns, randomizes, or recomputes this field. Task #13 (splitter) was explicitly cancelled — ADR-004 audit log.
Student.experiment_arm — one of default | arm_a | arm_b. Also upstream-supplied.
Student.phone — used by Vocallabs as the parent's phone (the student uses the parent's device — CR-003 §"Phone resolution").

3.2 Batch

Per-cohort container. tap_lms/doctype/batch/batch.json:1-100.

Field	Type	Purpose
`batch_id`	Data (unique)	Public batch identifier used in collection labels
`start_date`	Date	Cohort start; drives `check_auto_activate` and week calendar
`end_date`	Date	Cohort end
`program_type`	Select(`Summer\\|Regular`)	Owned by Batch (NOT BPR — clarification 2026)
`total_weeks`	Int	Length of the program (typically 8)
`grace_window_days`	Int	Per-cohort grace duration; per-week clock duration (CR-003)
`current_calendar_week`	Int	The "weekly Monday" cursor advanced by `batch_admin.auto_advance_batch_week`

3.3 BatchProgramRun (BPR)

The per-run wrapper around a Batch, holding pipeline status, validation report, Glific flow IDs, and the embedded pg_collections child table. tap_lms/doctype/batchprogramrun/batchprogramrun.json:1-100.

Key fields:

status — Select: draft | importing | enrolling | collections_ready | active | completed (constants.py:172-182).
validation_status — Select: not_run | passed | failed. Set by validate_bpr at batch_activation.py:39-151.
total_imported, total_enrolled — counters incremented atomically by enrollment.py:213-220.
content_delivery_flow, escalation_flow, reengagement_flow, grace_notification_flow, program_complete_flow, binge_info_flow — Glific flow UUIDs. Resolved per action type via constants.py:153-163 ACTION_FLOW_FIELD_MAP. (reengagement_flow and grace_notification_flow reminders are retired per CR-003; the columns remain for legacy callers.)
pg_collections — child table of PGCollection rows.
pg_onboarding_sets — child table linking to Backend Student Onboarding documents.

3.4 PGCollection (CR-005 kind-keyed)

Child table embedded under BatchProgramRun (istable=1). tap_lms/doctype/pgcollection/pgcollection.json:1-83.

Kind values (collection_membership.py:42-48 COLLECTION_KINDS):

Kind	Glific group label	Purpose
`main`	`SP_<batch>_main`	Source-of-truth for the weekly content delivery sweep
`escalation`	`SP_<batch>_escalation`	All PEs in `normal_escalation` or `remedial_escalation`
`binge_paused`	`SP_<batch>_binge_paused`	PEs in `paused_binge`
`program_dropped`	`SP_<batch>_program_dropped`	Terminal — PEs dropped via T17 grace expiry or T23
`program_completed`	`SP_<batch>_program_completed`	Terminal — PEs completed via T16

collection_label — Glific group label.
glific_group_id — Glific group ID (returned by create_group_if_missing).
is_active — Check. CR-005 backfill marks new kind-keyed rows as is_active=1; the legacy archetype-keyed rows (archetype/experiment_arm columns populated, kind null) get marked is_active=0 by the migration patch.
member_count — denormalized integer counter, maintained by _bump_member_count at collection_membership.py:191-221. Glific itself is the source of truth; this column lets the weekly cron cheaply skip empty BPRs.
archetype, experiment_arm, student_count — legacy fields retained for historical rows; not used by new code.

3.5 ProgramEnrollment (PE)

The per-student-per-batch row that holds every piece of state the dispatcher and state machine read or write. tap_lms/doctype/programenrollment/programenrollment.json.

Identity:

enrollment (Data, unique) — public enrollment ID.
student, batch, glific_id, course_level, language, experiment_arm, archetype, program_type.

State:

resolved_flow_state — Select with 11 values (see §5).
journey_label — Select with 12 values (constants.py:82-99). The atomic-claim guard column.
last_label_change_at — Datetime.
program_status — Select: active | paused | completed | dropped.
drop_reason — Data; valid values include grace_expired, delivery_failure, admin, manual.
current_expected_submission_type — Select: emoji | word_text_voice | summary_text_voice | image | video | photo_video_artefact.
current_path — Select: Core | Remedial.
current_tier — Select: Basic | Intermediate | Advanced. Resolved from TIER_BY_WEEK (constants.py:265-268).

Week / scheduling:

current_week, max_allowed_week — Int. Week advance respects max_allowed_week to gate binge progression.
next_action_at, next_action_type — the per-PE clock the dispatcher reads. Action types: content_delivery, escalation, week_advancement, feedback_notification, grace_check, pause_check, feedback_timeout (constants.py:121-127).
current_escalation_step — Int. Set by T2/T4/T8/T10.
current_escalation_type — Data. CR-003: written alongside the step counter, pushed to Glific as the contact field escalation_type.

Grace clock (per-week, CR-003):

in_grace_window (Check), grace_window_start (Datetime), grace_window_end_at (Datetime). Armed by activity_points.handle_content_log on the week's first VideoClass via atomic Postgres CASE WHEN. Cleared by T3/T7/T9/T17 (primary submission) or by T17/T23 drop.

Submission + delivery:

submission_count — Int. Owned exclusively by save_submission._try_claim_primary (save_submission.py:298-307); state-machine transitions no longer bump it (H-AUDIT-4).
quiz_completed — Int.
last_submission_at — Datetime.
delivery_failure_count — Int. Bumped by _record_delivery_failure; T23 fires at MAX_DELIVERY_FAILURES = 3.
last_flow_triggered, last_flow_triggered_at — Data / Datetime. Tracking.

Gamification (CR-002 v2, 9 fields):

total_activity_points, weekly_activity_points — VideoClass points.
total_quiz_points, weekly_quiz_points, bonus_quiz_points — quiz points.
total_submission_points, weekly_submission_points — submission points.
special_gems — Int. Floored at 0.
current_streak — Int.
total_points — cumulative roll-up across all three streams.
Sticky weekly flags: weekly_video_done (Check, internal-only — NOT pushed to Glific per constants.py:218), weekly_submission_done (Check, pushed as weekly_submission_done contact field).

ProgramEnrollment is the canonical read-side source for all Summer Program helpers (2026-05-19 incident fix). The legacy Student.enrollment child table is written by backend onboarding for non-SP flows; it is NOT updated by start_program_enrollment for new SP batches. Reading from the child table miss-resolves students enrolled into the current cohort but whose child rows still point at a previous batch. The following helpers now query ProgramEnrollment directly with program_status IN ('active', 'paused') and ORDER BY creation DESC:

_get_active_bpr_for_student (student_progression_sp.py:1802-1845) — finds the active BPR + Batch for a student. Used by get_next_content / get_weekly_content to fail fast with no_active_batch when no enrollment exists.
_get_course_level_for_student (student_progression_sp.py:1848-1867) — reads course_level from the matching PE. No legacy fallback (both callers are SP-only paths already gated by _get_active_bpr_for_student).
get_student_sp_status (api.py:191-249) — status endpoint; iterates PEs by creation desc and returns the first BPR match.

Regression coverage: tests/test_active_bpr_resolution.py pins the contract (empty child table + active PE → resolves; non-Summer batch skipped; dropped PE ignored).

3.6 ArchetypeConfig

Per-(batch, archetype, experiment_arm, week) row. Replaced the always-16 invariant after the 2026-05-13 ADR-004 supersession (per-tuple validation rule). Read by pe_dispatcher._get_week_rule (pe_dispatcher.py:658-674) and by the validator _validate_archetype_config_before_activation called from activate_bpr (batch_activation.py:192-195).

Fields used by the runtime:

expected_submission_type — drives PE.current_expected_submission_type after week advance.
core_learning_unit, remedial_learning_unit — content references for the week.

3.7 Submission

The output of save_submission. Lives outside the SP module but is owned conceptually by it. Fields the SP reads: student_id, program_enrollment, week, is_primary, status (Pending/Processing/Completed/Failed — overall lifecycle), result_status (Select: Pending / Success - Original / Success - Flagged / Failed — drives the points decision in _compute_submission_points; writers: feedback_processor._determine_result_status writes Success - Original when AI says clean and Success - Flagged when plagiarized or AI-generated; save_submission._apply_duplicate_submission_feedback ALSO writes Success - Flagged synchronously for duplicate submissions; the Failed Select option exists but no production code writes it today — kept as a defensive value in the points gate), submission_validity (Valid/Invalid — drives the routing decision in on_feedback_ready; "Invalid" plus WeekRule.submission_validation_enabled=1 routes to Remedial via T6b), overall_feedback, overall_feedback_translated, audio_feedback_url, assign_id, submission_type, submission_text, submission_url, escalation_step_at_submit, created_at.

3.8 StudentQuizAttempt, StudentContentLog, ProgramEventLog

StudentQuizAttempt — wired via doc_events["StudentQuizAttempt"]["on_update"] to quiz_points.handle_attempt_update. The points_earned column is the L-012 audit anchor.
StudentContentLog — wired via doc_events["StudentContentLog"]["after_insert"] to activity_points.handle_content_log. Filtered to content_type == "VideoClass" AND action == "completed". The points_awarded column is the audit anchor.
ProgramEventLog — append-only audit table. Written by summer_program/event_log.py:log_event and log_state_transition. Event types include state_transition, submission_received, flow_completed, path_changed, week_advanced, program_dropped, escalation_sent, delivery_failed, grace_window_entered, activity_points_awarded, quiz_points_awarded, bonus_quiz_points_awarded, feedback_delivered.

4. Lifecycle — enrollment to BPR activation

The pipeline that takes a draft BPR to active. Each step is gated by status (constants.py:172-182 BPR_STATUS_FLOW).

4.1 Bulk enrollment (steps 3a and 3b)

summer_program/enrollment.py:40-154:

start_enrollment(bpr_name) — gathers students from the BPR's onboarding sets via _get_students_for_bpr. Marks BPR status = enrolling. Chunks students at ENROLLMENT_CHUNK_SIZE=100 and enqueues _process_enrollment_chunk on the long queue.
_process_enrollment_chunk — for each student, builds the 6-field bootstrap (archetype, experiment_arm, program_type=Summer, batch_id, course_level, student_name) and enqueues _sync_contact_fields_job with pe_name=f"pre-pe:{sid}" (synthetic so the DLQ log entry is correlatable when there is not yet a PE row). Atomically bumps BatchProgramRun.total_enrolled via raw SQL (enrollment.py:213-220).
setup_collections(bpr_name) — for the 4×2 = 8 archetype/arm combinations, creates the legacy Glific groups via create_or_get_collection and bulk-adds students via add_contacts_to_group_bulk (COLLECTION_BATCH_SIZE=500 per call). Records each in bpr.pg_collections. Marks BPR status = collections_ready. Note: these legacy archetype-keyed PGCollection rows still exist post-CR-005 but are is_active=0 after the migration patch and not used by new code.

4.2 Program enrollment (step 3c — separate)

summer_program/program_enrollment_api.py:40-...:

start_program_enrollment(bpr_name) — bulk-creates ProgramEnrollment records. Idempotent — filters out students who already have a PE for the batch. Background chunks via frappe.enqueue.
create_program_enrollment(student_id, batch_name) — single-PE entry point (API A6). Sets the 14 baseline Glific contact fields and writes journey_label='enrolled', program_status='active', current_path='Core', current_week=1, resolved_flow_state='normal_content_delivery'.

4.3 Validation (step 4)

batch_activation.validate_bpr (batch_activation.py:38-151) checks:

BPR is in collections_ready.
8 archetype×arm legacy collections exist on the BPR.
At least one Glific flow is configured.
total_enrolled > 0 and at least one PE row exists.
Batch has start_date and total_weeks.

Result is persisted as bpr.validation_status plus JSON bpr.validation_report.

4.4 Activation (step 5)

batch_activation.activate_bpr (batch_activation.py:154-228) is gated on validation_status == passed:

Hard-fail on per-tuple ArchetypeConfig completeness via _validate_archetype_config_before_activation (replaced the old always-16 rule — ADR-004 audit log 2026-05-13).
Flip BPR status = active, set activated_at = now().
CR-005 _ensure_kind_keyed_pg_collections (batch_activation.py:382-439): idempotently create the 5 kind-keyed PGCollection rows + the matching Glific groups via create_group_if_missing. Each row carries kind, collection_label = f"SP_{batch}_{kind}", glific_group_id, member_count=0, is_active=1.
No per-PE seeding of next_action_at — CR-005 locked decision #4: content delivery is now batch-triggered every Tuesday at 09:00 IST (03:30 UTC) via weekly_content_delivery_trigger against the main group. The legacy _seed_pe_actions function is kept as dead code (operator escape hatch).

4.5 Auto-activation cron

batch_activation.check_auto_activate (batch_activation.py:334-377) is wired to the daily scheduler. It activates BPRs in collections_ready with validation_status=passed whose Batch.start_date <= today.

5. The state machine

summer_program/state_machine.py defines 11 valid resolved_flow_state values (constants.py:50-79) and the named transitions T0–T25. Every transition flows through the transition(...) engine at state_machine.py:64-104, which:

Updates pe.resolved_flow_state and pe.last_label_change_at.
Applies any extra_updates dict.
Calls pe.save(ignore_permissions=True).
Logs to ProgramEventLog via log_state_transition.
Enqueues _sync_contact_fields_job (P-007 retry+DLQ).
CR-005: calls collection_membership.maintain_collections(pe, from_state, to_state) to keep Glific group membership current.

5.1 The 11 states

State	Path	In active set?	Audit kind
`normal_content_delivery`	Core	yes (main)	none
`normal_escalation`	Core	no	escalation
`remedial_content_delivery`	Remedial	yes (main)	none
`remedial_escalation`	Remedial	no	escalation
`grace_waiting`	either	no	none (Gap-fill A, CR-005)
`paused_binge`	Core	no	binge_paused
`submitted_awaiting_feedback`	either	yes (main)	none
`feedback_ready`	either	yes (main)	none
`week_completed`	either	yes (main)	none
`program_completed`	terminal	no	program_completed
`program_dropped`	terminal	no	program_dropped

MAIN_ELIGIBLE_STATES is the set of 5 yes-yes rows (collection_membership.py:22-28). STATE_TO_AUDIT_KIND maps the right column (collection_membership.py:31-39). Note STATE_PAUSED_NO_ACTIVITY is RETIRED per CR-003 — the Select option remains for legacy rows but no new transition writes it.

5.2 Transition catalogue (T0–T25)

ID	From → To	Trigger source	Notes
T0	(new PE) → normal_content_delivery	scheduler	T0 NO LONGER arms the grace clock — armed by activity-points handler on first VideoClass. `state_machine.py:322-339`
T1	normal_content_delivery → normal_content_delivery	flow_callback	Schedules next escalation. `state_machine.py:343-358`
T2	normal_content_delivery → normal_escalation	scheduler	Writes `current_escalation_step` + `current_escalation_type`. `state_machine.py:362-374`
T3	normal_escalation → submitted_awaiting_feedback	flow_callback	Bumps submission points + streak + gems + clears grace clock. `state_machine.py:378-414`
T4	normal_escalation → normal_escalation	scheduler	Next escalation step. `state_machine.py:418-430`
T5	normal_escalation → grace_waiting	scheduler	Steps exhausted. Defensive grace-clock backfill for legacy PEs. `state_machine.py:434-461`
T6	REMOVED (CR-006)	n/a	`state_machine.py:464-481` raises if called. Remedial is now reserved for failed-feedback only.
T6b	submitted_awaiting_feedback → remedial_content_delivery	microservice	CR-004 path. AI feedback set `submission_validity='Invalid'` AND `WeekRule.submission_validation_enabled=1`. (Originally keyed on `result_status='failed'` per CR-004; switched to `submission_validity` post-ship — see §9.3.) `state_machine.py:485-512`
T7	normal_content_delivery → submitted_awaiting_feedback	flow_callback	Same shape as T3. `state_machine.py:516-550`
T8	remedial_content_delivery → remedial_escalation	scheduler	Mirrors T2. `state_machine.py:554-564`
T9	remedial_content_delivery → submitted_awaiting_feedback	flow_callback	Mirrors T3. `state_machine.py:568-601`
T10	remedial_escalation → remedial_escalation	scheduler	Mirrors T4. `state_machine.py:605-617`
T11	remedial_escalation → grace_waiting	scheduler	Mirrors T5. `state_machine.py:621-638`
T12	submitted_awaiting_feedback → feedback_ready	microservice	Called from `on_feedback_ready`. Clears `next_action_at`/`next_action_type` — the Glific F5 callback drives the next transition. `state_machine.py:642-656`
T13	feedback_ready → week_completed	flow_callback	Glific `SP_Feedback_Delivery` completed. Schedules `week_advancement`. `state_machine.py:660-666`
T14 (a.k.a. T19)	week_completed → normal_content_delivery	scheduler	The week-advance. CR-002 v2 streak/gem compute + weekly resets. Function is named `t14_week_advance` for historical reasons. `state_machine.py:682-768`
T15	week_completed → paused_binge	scheduler	Binge limit (calendar not yet caught up). `state_machine.py:772-780`
T16	week_completed → program_completed	scheduler	`current_week > total_weeks`. `state_machine.py:784-791`
T17 (sub.)	grace_waiting → submitted_awaiting_feedback	flow_callback	Submission during grace. Clears grace fields. `state_machine.py:795-823`
T17 (exp.)	grace_waiting → program_dropped	scheduler	CR-003 direct drop. `drop_reason='grace_expired'`. `state_machine.py:837-859`. Alias `t18_grace_expired` retained for one cycle (`state_machine.py:867`).
T18	DELETED (CR-003)	n/a	Pre-CR-003 transitioned to `paused_no_activity` + re-engagement loop.
T19 (reactivate)	DELETED (CR-003)	n/a	Re-engagement is now inbound-only via `SP_Incoming_Router`.
T20 (reactivate-remedial)	DELETED (CR-003)	n/a	Same reason.
T21	paused_binge → normal_content_delivery	scheduler	Calendar caught up. `state_machine.py:886-894`
T22	submitted_awaiting_feedback (no change)	flow_callback	Duplicate submission. Log-only. `state_machine.py:898-902`
T23	ANY → program_dropped	dispatcher	System-initiated. `reason in {delivery_failure, admin, manual}`. Idempotent on already-terminal PEs. `state_machine.py:906-937`
T24	ANY → program_dropped	admin	Admin drop. `state_machine.py:941-948`
T25	(no change)	scheduler	Delivery-failure log only. `state_machine.py:952-959`

5.3 Idempotency: the journey_label primitive (L-010)

Every state-changing UPDATE in this codebase guards on WHERE journey_label = <expected> and uses RETURNING name. The dispatcher's atomic claim is the canonical example (pe_dispatcher.py:139-152):

UPDATE "tabProgramEnrollment"
   SET next_action_at = NULL
 WHERE name = %s
   AND journey_label = %s
   AND next_action_at IS NOT NULL
RETURNING name

A 0-row return is the signal that a parallel worker, a Glific callback, or save_submission moved the PE under us. The handler then skips without raising. This is the entire reason journey_label exists as a separate column from resolved_flow_state — the label is the "before-image" of the most recent transition and is cheap to compare against.

6. Per-PE dispatcher

summer_program/pe_dispatcher.py is the brain of the time-based automation. It runs every minute (hooks.py:73-76, cron */1 * * * *) and routes PEs whose next_action_at <= NOW() to per-action handlers.

6.1 The dispatch loop

process_program_actions at pe_dispatcher.py:63-199:

Selects up to DISPATCH_BATCH_SIZE = 1000 PEs (pe_dispatcher.py:60) with:
next_action_at <= NOW()
program_status IN ('active', 'paused') (paused included so binge-resume is reachable — fix B3)
next_action_type != ''
ORDER BY next_action_at ASC
FOR UPDATE SKIP LOCKED (L-003) so 4 parallel workers each take a different slice.
For each row, runs the atomic claim (L-010) — clears next_action_at conditional on journey_label matching the value read in the SELECT.
On successful claim, calls _dispatch_single(pe_row) which looks up the handler in HANDLER_MAP (pe_dispatcher.py:590-603) and runs it inside a try/except.
After the loop, logs a structured sp_dispatcher line with claimed, skipped, errors, queue_depth.

Scaling envelope (architecture §8.8 sizing math, ADR-003 audit 2026-05-13): 1000-row batch × 1-min cron × 4 workers ≈ 240K actions/hour. A 100K-student week-boundary T19 burst drains in ~25 min. Prerequisite: partial index idx_pe_next_action (task #24).

6.2 The handlers

Action type	Handler	What it does
`content_delivery`	`handle_content_delivery` (`pe_dispatcher.py:245-275`)	Preserved escape hatch; NOT reached in normal flow under CR-005. Logs that it fired and triggers the content flow for a single contact.
`escalation`	`handle_escalation` (`pe_dispatcher.py:278-389`)	CR-003 channel-aware. Resolves the step's `escalation_type` from ArchetypeConfig, transitions PE via T2/T4/T8/T10, then branches: `parent_call` → enqueue Vocallabs; other types → fire `SP_Escalation` flow. On step exhaustion, routes to T5/T11 (grace).
`feedback_timeout`	`handle_feedback_timeout` (`pe_dispatcher.py:399-451`)	Safety-net fallback. Checks Submission status; if feedback arrived and PE wasn't advanced, fires T12 fallback. Otherwise reschedules for +1h, max 3 retries.
`week_advancement`	`handle_week_advancement` (`pe_dispatcher.py:454-499`)	Decides T16 (program done) vs T15 (binge pause) vs T14/T19 (normal advance).
`grace_check`	`handle_grace_check` (`pe_dispatcher.py:502-547`)	CR-003 direct drop. If `weekly_submission_done=1` no-op. If clock not yet expired, reschedule. Otherwise fire T17 grace-expired.
`pause_check`	`handle_pause_check` (`pe_dispatcher.py:557-583`)	Binge-resume check. T21 if calendar caught up; else reschedule +7 days.

Retired handlers per CR-003: handle_re_engagement and handle_grace_reminder — re-engagement is inbound-only via SP_Incoming_Router and per-week escalation steps replace proactive grace reminders.

6.3 Delivery failure chain

_record_delivery_failure(pe_name) at pe_dispatcher.py:687-740 is the bridge from a failed Glific call to T23 auto-drop:

Atomic UPDATE: delivery_failure_count = COALESCE(delivery_failure_count, 0) + 1 with program_status IN ('active', 'paused') folded into the WHERE so a PE that went terminal can't get its counter bumped (M1 fix 2026-05-13).
RETURNING gives the post-update value; if 0 rows returned, no-op.
If new_count >= MAX_DELIVERY_FAILURES (3), fires t23_auto_drop(pe, reason='delivery_failure').

The helper is shipped but not yet wired in — _trigger_flow is fire-and-forget (pe_dispatcher.py:628-643) and there is no delivery-confirmation webhook from Glific yet. The webhook wire-up is M-AUDIT-8 (deferred Phase 1).

7. Content delivery — CR-005 Approach B (state-driven)

The current weekly content delivery model. Replaces the previous per-PE content_delivery scheduling that armed ACTION_CONTENT_DELIVERY on every T0 and T19.

7.1 Continuous membership maintenance

Every state-machine transition (via transition() at state_machine.py:64-104) calls maintain_collections(pe, from_state, to_state) at collection_membership.py:51-86. This:

Computes from_main, to_main (membership in MAIN_ELIGIBLE_STATES) and from_audit, to_audit via STATE_TO_AUDIT_KIND.
If from_main != to_main, enqueues an add/remove on the BPR's main PGCollection.
If from_audit != to_audit, enqueues remove of the old audit kind and add of the new audit kind.

Membership churn is fully delta-based — re-running with same from_state==to_state is a no-op.

7.2 Glific group writes (retry + DLQ + counter)

_enqueue_group_write (collection_membership.py:89-111) looks up the kind-keyed PGCollection row and enqueues _group_write_job with enqueue_after_commit=True. The job (collection_membership.py:114-188) wraps add_contact_to_group / remove_contact_from_group (both in glific_integration.py) with P-007 retry (5 immediate retries) + DLQ.

On successful Glific call, _bump_member_count (collection_membership.py:191-221) atomically maintains PGCollection.member_count:

UPDATE "tabPGCollection"
   SET member_count = COALESCE(member_count, 0) + 1                       -- add
       member_count = GREATEST(0, COALESCE(member_count, 0) - 1)          -- remove
 WHERE name = %s

The GREATEST(0, ...) guard prevents negative drift if a remove fires for a contact that was never in the group.

7.3 The weekly fire

scheduler.weekly_content_delivery_trigger (scheduler.py:223-292) runs Tuesday 03:30 UTC = 09:00 IST (cron 30 3 * * 2 in hooks.py:87-89):

Selects active BPRs with a configured content_delivery_flow.
For each, reads the main PGCollection row, skipping when member_count <= 0.
Calls start_group_flow(content_delivery_flow, main_glific_group_id) — exactly one Glific API call per BPR.

No recompute, no reconcile. Idempotency is operator discipline ("don't manually invoke during the cron window"); Glific deduplicates identical group-flow starts within a short window.

8. Escalation, grace, and drop

The "weekly nudge" loop, per CR-001 + CR-003 + CR-006.

8.1 Per-week grace clock

activity_points.award_activity_points (activity_points.py:75-202) is the only arming path. On the week's first VideoClass completion (detected via pe.weekly_video_done == 0 pre-UPDATE), it both:

Logs grace_window_entered to ProgramEventLog with the expected grace_window_end_at.
Issues a single atomic UPDATE that bumps activity points, sets weekly_video_done=1, AND arms grace_window_start = NOW(), grace_window_end_at = NOW() + (X || ' days')::interval, in_grace_window=1 — gated by CASE WHEN weekly_video_done = 0 (Postgres evaluates against the OLD row value within the same UPDATE). Second-video-same-week is a no-op for the grace fields.

X is Batch.grace_window_days, falling back to DEFAULT_GRACE_WINDOW_DAYS = 14 (constants.py:281).

T19 re-arming is automatic: T19 resets weekly_video_done = 0, so the next VideoClass watch re-trips the CASE WHEN and writes a fresh grace_window_end_at. T0 no longer arms the clock (the activity-points handler does — CR-003 follow-up 2026-05-13).

Grace clearing: T3/T7/T9/T17 (primary submission) explicitly null in_grace_window, grace_window_end_at, grace_window_start.

8.2 Escalation channel routing

handle_escalation (pe_dispatcher.py:278-389) resolves the step's escalation_type from ArchetypeConfig and branches:

help_note_a / help_note_b / voice_note → fire SP_Escalation Glific flow. Glific reads the contact fields escalation_order + escalation_type and renders the per-channel content.
parent_call → enqueue vocallabs.initiate_parent_call. Glific is NOT involved for parent calls.

Defaults: escalation_type defaults to help_note_a if missing. hours_after_previous defaults to 24.

8.2.1 Vocallabs parent-call flow (cache-on-Student + auto-backfill — task #80/#81, CR-010)

vocallabs.py is the integration module. The 2026-05-24 hardening (CR-010) changed the flow from "addMultipleContactsToGroup every call" to a permanently-cached prospect_id on the Student doctype.

Why the change: Vocallabs enforces a uniqueness constraint on (client_id, prospect_group_id, phone) in their Hasura backend. Without a cache, every repeat call to the same parent's phone hits prospects_client_id_prospect_group_id_phone_key violation and never dials. Original behavior was to treat this as transient and retry 6 times (~75s of wasted worker time per failure, then DLQ). The cached design hits the insert ONCE then reuses the returned prospect_id forever.

Storage: Student.vocallabs_prospect_id (Data, read-only in UI, no_copy=1). Populated by _store_prospect_id(student, prospect_id) via frappe.db.set_value with update_modified=False (no audit-trail bump for what is essentially a tech-side cache).

Call sequencing (_call_vocallabs):

Cache hit (Student.vocallabs_prospect_id populated):
POST /b2b/vocallabs/updateContactData with the freshly rendered data block (best-effort — failure logs to SP Vocallabs UpdateData and proceeds).
POST /b2b/vocallabs/initiateVocallabsCall with the cached prospect_id.
2 HTTP calls (plus cached auth token).
Cache miss + first encounter (cache empty AND parent phone NOT in Vocallabs):
POST /b2b/vocallabs/addMultipleContactsToGroup → returns prospect_id.
_store_prospect_id writes to Student.vocallabs_prospect_id (before initiateVocallabsCall fires, so a mid-flight call failure still leaves the cache populated).
POST /b2b/vocallabs/initiateVocallabsCall.
Cache miss + auto-backfill (cache empty AND parent phone already in Vocallabs from a prior insert):
addMultipleContactsToGroup returns the Hasura uniqueness violation.
_is_duplicate_prospect_response detects the prospects_client_id_prospect_group_id_phone_key constraint name OR the GraphQL code: "constraint-violation".
_lookup_prospect_id_by_phone paginates GET /b2b/vocallabs/getContacts (limit=200, max 50 pages = 10,000 prospects). Matches by exact phone OR last-10-digits (handles E.164 ↔ 10-digit drift across Indian numbers).
On match: cache the recovered UUID, run updateContactData, proceed to initiateVocallabsCall.
On miss (phone genuinely not there, or unrecognized response shape): raise PermanentVocallabsError → fail-fast under SP Vocallabs Duplicate Prospect (lookup required) Error Log title.

Status template freshness: data.status is bound to the prospect record (not the call). On cache hit we MUST refresh via updateContactData before each call, otherwise the agent reads the variables from the original insert (week 1's text on a week 5 call). The team configures the template via ParentCallConfig.status_template per-week via UnitContentItem on LearningUnit, falling back to VoiceAgentSettings.default_parent_call_config. The code never hard-codes content.

Sibling race caveat: Two Student rows can share one parent phone (multiple kids enrolled, single device). For the cache to work, both Student rows must store the same prospect_id. The auto-backfill handles this — first sibling's call recovers the existing UUID, second sibling's call recovers the same UUID independently. But data.status updates race: if both siblings hit step 4 within the same second, the second updateContactData overwrites the first's variables before the first call has dialed out. Team-side mitigation is to make ParentCallConfig.status_template sibling-agnostic ("Your child hasn't submitted") rather than name-specific.

Single agent limitation (task #48): Today there's one VoiceAgentSettings.agent_id for all calls. Multi-language students get the English agent. Vocallabs is exploring language-detection-from-voice on their side; if that ships, task #48 may become moot. Otherwise the plan is one Vocallabs agent per language with a Student.language → agent_id lookup at call time.

Error Log taxonomy (all under method-column titles starting with SP Vocallabs): - SP Vocallabs Retry — transient failure; will retry (P-007 budget). - SP Vocallabs DLQ — manual replay required — retries exhausted. - SP Vocallabs Duplicate Prospect (lookup required) — permanent fail; cache empty AND lookup couldn't recover prospect_id. - SP Vocallabs Lookup — pagination diagnostics (only logged on miss/error, not success). - SP Vocallabs UpdateData — updateContactData failed; call still placed with stale prospect data. - SP Vocallabs Config / SP Vocallabs Template / SP Vocallabs Skipped — early-exit branches.

Verified end-to-end on test server 2026-05-24: 3 test parents (ST00051383 / ST00051387 / ST00051238) all had polluted phones from prior test runs. Phase C confirmed fail-fast under the dedicated Error Log title (3 entries × ~0.6s each vs. yesterday's 12 retries + 2 DLQ over 75s). Phase B (post-fix) confirmed auto-backfill: cache populated with real UUIDs (9f5f108c-..., 9e27ba86-..., 41556126-...), three subsequent cache-hit calls succeeded end-to-end on the API side with valid call_ids returned. Vocallabs-side dial-out failed (unrelated config issue on their end — call_ids handed off to Vocallabs team).

8.3 Step exhaustion (CR-006 unified)

When next_step > len(steps), escalation has exhausted within the week. CR-006 (2026-05-15) removes T6 (escalation_to_remedial) — remedial is now reserved for failed-feedback students via T6b. All zero-submission exhausters take the same path:

Core state (normal_content_delivery / normal_escalation) → t5_escalation_to_grace → grace_waiting.
Remedial state (remedial_content_delivery / remedial_escalation) → t11_remedial_to_grace → grace_waiting.

8.4 Grace expiry → direct drop (CR-001)

handle_grace_check (pe_dispatcher.py:502-547) fires at grace_window_end_at:

If PE has moved out of grace_waiting, no-op.
If weekly_submission_done=1, no-op (student submitted within the window).
If clock hasn't actually expired (skew), reschedule to the exact expiry time.
Otherwise fire t17_grace_expired(pe, "dispatcher") → program_dropped with drop_reason='grace_expired'. No paused_no_activity hop, no re-engagement (those branches were deleted in CR-001 / CR-003).

9. Submission pipeline

The path from a student WhatsApp message to AI feedback delivered back to the student.

9.1 save_submission (`save_submission.py:46-237`)

Whitelisted as allow_guest=True so Glific can call it. Flow:

Resolve student_id (utils.resolve_student accepts name, glific_id, or phone).
Resolve active PE via get_active_pe (state_machine.py:1005-1034) — uses program_status IN ('active', 'paused') and ORDER BY modified DESC to pick the live row when multiple PEs exist for the same student.
Reject if PE is in a terminal state.
Normalize the payload — _normalize_submission_payload infers submission_type as text | emoji | audio | image | video from the input shape (URL → infer by extension; non-URL → emoji if no alnum chars else text).
Insert Submission FIRST inside a savepoint (save_submission.py:135-160, task #81 audit 2026-05-15). is_primary=False is provisional; flipped after the atomic claim succeeds. Order matters: if the insert failed AFTER claiming primary, retries would see "duplicate" and silently lose the real submission.
_try_claim_primary(pe, week) (save_submission.py:279-315) — the atomic primary claim. Uses UPDATE ... RETURNING: UPDATE `tabProgramEnrollment` SET journey_label = 'submitted', last_label_change_at = NOW(), submission_count = COALESCE(submission_count, 0) + 1, last_submission_at = NOW() WHERE name = %s AND journey_label IN (%s, %s, %s, %s, %s) RETURNING name The 5 valid pre-submission labels are enrolled | content_delivered | grace_window | resumed | week_advanced. RETURNING-empty means duplicate. submission_count is bumped here, exactly once — T3/T7/T9/T17 no longer touch it (H-AUDIT-4, 2026-05-15).
If primary: apply state transition via apply_submission_transition(pe, points=0, "flow_callback") which dispatches on pe.resolved_flow_state to T7/T3/T9/T17 or T22 for duplicates. CR-007 (2026-05-19): points are NOT computed here. AI validation runs asynchronously after save_submission, so the actual point award (depending on Submission.result_status) is deferred to feedback_consumer_hook.on_feedback_ready (§9.3). The transition still bumps streak / gems / weekly_submission_done on every submission regardless of validity (user spec).
Update EngagementState (legacy bridge) and log to StudentContentLog with a savepoint (so a bridge-log insert failure doesn't poison the outer transaction — L-030).
If primary: enqueue process_submission_async (save_submission.py:629-666) which uploads media to GCS, marks submission.status = "Processing", then enqueues enqueue_submission.
enqueue_submission (save_submission.py:669-815) publishes to RabbitMQ via pika:
- Publisher confirms enabled (channel.confirm_delivery()).
- Durable queue (queue_declare(durable=True)).
- Persistent delivery (delivery_mode=2).
- Mandatory routing (mandatory=True).
- P-007 retry: 5 retries via FEEDBACK_PIPELINE_MAX_RETRIES, DLQ on exhaustion with a JSON-structured Error Log entry including submission_id, student_id, full pe_context.

9.2 FeedbackConsumer (external)

Runs outside this app. Drains RabbitMQ, generates AI feedback, writes back to the Submission row (overall_feedback, overall_feedback_translated, audio_feedback_url, result_status, status="Completed"), and calls back into the backend via feedback_consumer_hook.on_feedback_ready(submission_name, student_id).

9.3 on_feedback_ready (`feedback_consumer_hook.py`)

Resolve student from submission if not provided.
Find the active PE in submitted_awaiting_feedback. If none, return {"status": "no_pe"}.
Verify the submission's week matches the PE's current_week; mismatch → skip.
Points: CR-007 (2026-05-19), revised 2026-06-15. _compute_submission_points(pe, submission_name, result_status) evaluates branches in this order — the AI-verdict gate sits at the top and overrides every later branch, including the escalation-tier reward for late submissions:
Branch 0 — result_status == 'Pending': 0 points (defensive — the hook contract is that FeedbackConsumer writes the AI verdict before calling on_feedback_ready, so Pending should never reach here; if a race delivers it anyway, treat as no-verdict and award nothing).
Branch 1 — WeekRule.submission_validation_enabled = 1 AND result_status in ('Failed', 'Success - Flagged'): 0 points. Pre-2026-06-15 the late branch ran first and shielded sent_count >= 1 + Failed/Flagged from this gate; the spec now zeroes them out so the AI verdict universally trumps the tier reward. 'Failed' is dead schema in production (no writer emits it) — kept defensively.
Branch 2 — sent_count >= 1 (late submission, escalation fired): EscalationStep[sent_count].points_awarded.
Branch 3 — on-time + valid (result_status == 'Success - Original', OR any non-Pending value in lax mode): Assignment.points_per_item.

The WeekRule.submission_validation_enabled flag lives on the ArchetypeConfig.week_rules child rows, keyed (archetype, arm, week). It is operator-owned per archetype — there is no week-number-hardcoded "weeks 1-2 are lax" rule in code; that pattern, where it exists, is a deployment convention set by ArchetypeConfig values.

Points are applied via atomic SQL UPDATE (_award_submission_points_atomic, L-011 COALESCE pattern). pe.reload() after the bump ensures the subsequent transition's pe.save() doesn't clobber the freshly-bumped columns. 5. CR-004 + CR-007 routing (current contract): read Submission.submission_validity. If submission_validity == 'Invalid' AND WeekRule.submission_validation_enabled = 1 → fire t6b_failed_feedback_to_remedial. Otherwise → fire t12_feedback_ready. In lax mode (validation off), even invalid submissions stay on Core. (CR-004 originally specified branching on result_status; the implementation was later switched to submission_validity to separate the AI-scoring signal from the validation-gate signal. result_status still governs the points decision in step 4 above; submission_validity governs the routing decision here.) 6. _sync_contact_fields(pe) re-pushes PE state to Glific so the post-award totals land on the contact fields.

Remedial routing is per-week, not permanent. T14 (week advance) at state_machine.py:772 unconditionally sets current_path = PATH_CORE on every advance, so a failed submission only routes the student to Remedial for the current week — the next week starts on Core again.

9.4 Glific F5 callback

flow_callback.update_flow_status (flow_callback.py:92-149) dispatches by flow name. SP_Feedback_Delivery routes to _handle_feedback_delivery (flow_callback.py:225-245):

Verify PE is still in feedback_ready (else "already_advanced" response).
Fire t13_feedback_delivered(pe, "flow_callback") → week_completed, schedules week_advancement.
Reload PE and return _response(pe, "week_completed", current_week=pe.current_week).

9.5 Watchdog

pre_launch.feedback_ready_watchdog (pre_launch.py:129-195) runs hourly (hooks.py:94-96) to find PEs stuck in feedback_ready longer than stuck_hours=2. Log-only — does NOT auto-transition. Silent auto-advance would skip feedback delivery entirely (the reason feedback_ready exists). Operator action: retrigger F5 via Glific UI or call update_flow_status(action='feedback_complete') from bench shell.

10. Gamification — CR-002 v2

Nine PE fields capture the three streams (activity / quiz / submission) plus gems and streak.

10.1 Field roles

Field	Cumulative?	Resets at T19?	Pushed to Glific?
`total_activity_points`	yes	no	yes (`total_activity_points`)
`weekly_activity_points`	no	yes (→ 0)	yes (`weekly_activity_points`)
`total_quiz_points`	yes	no	yes (`total_quiz_points`)
`weekly_quiz_points`	no	yes (→ 0)	yes (`weekly_quiz_points`)
`bonus_quiz_points`	yes (independent)	no	yes (`bonus_quiz_points`)
`total_submission_points`	yes	no	yes (`total_submission_points`)
`weekly_submission_points`	no	yes (→ 0)	yes (`weekly_submission_points`)
`special_gems`	yes	no	yes (`special_gems`)
`current_streak`	yes	conditional	yes (`current_streak`)
`total_points`	yes (roll-up)	no	yes (`total_points`)
`weekly_video_done` (sticky flag)	n/a	yes (→ 0)	NO (internal-only — `constants.py:218`)
`weekly_submission_done` (sticky flag)	n/a	yes (→ 0)	yes (`weekly_submission_done`)

10.2 Activity points (`activity_points.py`)

Wired on StudentContentLog.after_insert. Filtered to content_type == "VideoClass" AND action == "completed". The handler:

Guards on scl.points_awarded > 0 (L-012 idempotency).
Resolves VideoClass.points; returns early on 0/null (E11: zero-point pedagogical videos do not count toward "assigned this week").
Resolves active PE; returns if none.
Pre-UPDATE Python read of pe.weekly_video_done to detect first-video-of-week → logs grace_window_entered event BEFORE the UPDATE.
Single atomic UPDATE that bumps activity points, sets weekly_video_done=1, and arms the grace clock via CASE WHEN weekly_video_done = 0 (see §8.1).
Writes scl.points_awarded = pts (audit anchor written AFTER PE update so retries are safe).
Enqueues contact-field sync; logs activity_points_awarded.

10.3 Quiz points (`quiz_points.py`)

Wired on StudentQuizAttempt.on_update. Filtered to completed_at IS NOT NULL AND points_earned == 0 (idempotency).

Per-question rule (compute_quiz_points at quiz_points.py:180-196):

Correct answer → QuizQuestion.points
Wrong answer → QuizQuestion.failed_points
Independent of attempt-level pass/fail

Cumulative-vs-weekly split (quiz_points.award_quiz_points at quiz_points.py:65-125):

total_quiz_points, total_points: apply DELTA vs the previous-latest attempt for the same (student, quiz) pair (latest-score semantics).
weekly_quiz_points: ALWAYS adds the new attempt's full earned value (effort semantics).

bonus_quiz_points is independent — award_bonus_quiz_points only touches that field and the matching Glific contact field, never affects total_points / weekly_quiz_points.

10.4 Submission streak and gems

T3/T7/T9/T17 each add to total_submission_points, weekly_submission_points, current_streak += 1, special_gems += 1, and set weekly_submission_done = 1 — all in the same atomic save (state_machine.py:398-414, 534-550, 585-601, 808-823).

10.5 T19 (week advance) streak/gem compute

t14_week_advance (state_machine.py:682-768) runs the compute in two phases:

Phase 1 — read the sticky weekly flags BEFORE reset:
if weekly_video_done = 1 AND weekly_submission_done = 0 (assigned but didn't submit): current_streak → 0, special_gems → max(0, special_gems - 1).
else: streak/gems unchanged.
Phase 2 — reset weekly_* counters + both sticky flags to 0; advance week; write the streak/gem values computed in Phase 1. Total counters are NEVER reset (E10).

Gem floor is enforced in Python (max(0, ...)). SQL GREATEST is not needed because the value is plain-int by the time we write it.

11. Glific integration

glific_integration.py houses the wrappers. The SP pushes 28 contact fields and reads/writes 5 kind-keyed groups.

11.1 Contact field catalogue (28 fields)

Category	Fields
Identity (6)	`student_id`, `student_name`, `batch_id`, `archetype`, `language`, `course_level`
State (7)	`resolved_flow_state`, `current_week`, `current_path`, `current_tier`, `program_status`, `experiment_arm`, `grace_window_end_at`
Submission (3)	`current_expected_submission_type`, `submission_count`, `weekly_submission_done`
Escalation (3)	`last_escalation_step`, `escalation_order`, `escalation_type`
Gamification (9)	`total_points`, `current_streak`, `total_activity_points`, `weekly_activity_points`, `total_quiz_points`, `weekly_quiz_points`, `bonus_quiz_points`, `total_submission_points`, `weekly_submission_points`, `special_gems`

weekly_video_done is internal-only and is intentionally NOT pushed (constants.py:218).

11.2 Push mechanics

state_machine._enqueue_contact_field_sync(pe) (state_machine.py:107-159) serializes the 28 fields into a dict and enqueues _sync_contact_fields_job with enqueue_after_commit=True. The job (state_machine.py:162-258) wraps update_contact_fields(glific_id, fields) with P-007 retry+DLQ:

GLIFIC_SYNC_MAX_RETRIES = 5 (constants.py:250).
Retries are IMMEDIATE (no backoff — known limitation, follow-up task).
DLQ log title: SP Glific Sync DLQ — manual replay required.
DLQ payload is JSON with student_id, pe_name, glific_id, fields, final_error, retries_attempted.
Double-fault handling: if the re-enqueue itself raises, the original error goes straight to DLQ.

update_contact_fields itself (glific_integration.py:133-251) uses a 3-step fetch-merge-update pattern: fetch existing fields → merge ours in → write back via updateContact. Without the fetch, Glific's blob-replace semantics would clobber fields set by other tools or flows.

11.3 Group write helpers

add_contact_to_group(contact_id, group_id) — updateGroupContacts(addContactIds=[contact_id], deleteContactIds=[]). glific_integration.py:705-754.
remove_contact_from_group(contact_id, group_id) — updateGroupContacts(addContactIds=[], deleteContactIds=[contact_id]). CR-005 addition. glific_integration.py:614-676.
create_group_if_missing(label, description) — looks up by label, creates if absent, returns Glific group ID. Used by _ensure_kind_keyed_pg_collections and the backfill patch. glific_integration.py:679-702.
start_contact_flow(flow_id, contact_id, default_results) — single-contact flow trigger; fire-and-forget. glific_integration.py:392-432.
start_group_flow(flow_id, group_id) — group-level flow trigger; used by the weekly Tuesday cron. Lives in summer_program/glific_extensions.py.

12. Flow callbacks (whitelisted endpoints)

summer_program/flow_callback.py is the bridge from Glific to the backend state machine. All endpoints comply with the L-008 flat-map contract.

12.1 The L-008 response helper

_response(pe, status_value, **extras) at flow_callback.py:56-89:

Writes directly to frappe.local.response via update(...) — no message: wrapper.
Always emits: success, status, action (deprecated alias for one release cycle — task #73), resolved_flow_state, next_action_type, next_action_at, program_status.
All values are scalar strings (timestamps stringified, no nested objects, no arrays).

12.2 update_flow_status (`flow_callback.py:92-149`)

Whitelisted as allow_guest=False. Args: student_id, flow_name, status, metadata (optional). Dispatches by flow_name via _get_handler (flow_callback.py:152-173):

Flow name	Status values	Handler	Resulting transition
`SP_Content_Delivery`	`no_response` / `timeout`	`_handle_content_delivery`	T1 — schedule first escalation
`SP_Content_Delivery`	`completed`	`_handle_content_delivery`	no transition; "delivery_confirmed"
`SP_Escalation`	any	`_handle_escalation`	no transition; "escalation_confirmed"
`SP_Feedback_Delivery`	any	`_handle_feedback_delivery`	T13 — feedback_ready → week_completed
`SP_Submission`	any	`_handle_submission_flow`	no transition; "submission_flow_completed"
`SP_Paused_Binge`	any	`_handle_binge_info`	no transition; "binge_info_delivered"
`SP_Program_Complete`	any	`_handle_info_flow`	no transition; "info_delivered"

Retired handlers per CR-003: _handle_grace_flow (SP_Grace_Entry), _handle_reengagement (SP_Paused_Reengagement), SP_Grace_Reminder. SP_Week_Summary removed per CR-002 v2.

12.3 save_submission

See §9.1. Whitelisted as allow_guest=True (Glific's flow webhook calls it directly). Returns the L-008 flat-map response with status ∈ {accepted, duplicate, terminal_state, no_active_enrollment, not_found, missing_param, insert_failed}.

12.4 award_bonus_quiz_points

quiz_points.award_bonus_quiz_points(student_id, points) (quiz_points.py:128-173) — decorated with @glific_response (so the helper handles the flat-map write). Awards independent bonus points (does not affect total_points, total_quiz_points, or weekly_quiz_points).

13. Scheduler topology

Wired in hooks.py:42-98. All times are UTC.

Schedule	Path	Purpose
`daily`	`tap_lms.tap_lms.page.onboarding_flow_trigger.onboarding_flow_trigger.update_incomplete_stages`	Legacy onboarding sweep (out of scope here)
`daily`	`tap_lms.summer_program.scheduler.run_daily_actions`	SP daily housekeeping: per-active-BPR collection actions, program-complete trigger when week > total_weeks
`daily`	`tap_lms.summer_program.batch_activation.check_auto_activate`	Auto-activate BPRs whose `Batch.start_date <= today`
`/1 * * *`	`tap_lms.summer_program.pe_dispatcher.process_program_actions`	The per-PE dispatcher loop (1000-row batch × 4 workers ≈ 240K/hour)
`0 /2 * *`	`tap_lms.summer_program.escalation_runner.run_escalation_check`	6-hour bulk escalation sweep (legacy batcher; to be replaced when collection-mode rollout lands)
`0 0 * * 1`	`tap_lms.summer_program.batch_admin.auto_advance_batch_week`	Weekly Monday — bumps `Batch.current_calendar_week`; unblocks `max_allowed_week` on PEs
`30 3 * * 2`	`tap_lms.summer_program.scheduler.weekly_content_delivery_trigger`	CR-005 weekly content delivery — Tuesday 09:00 IST against each active BPR's `main` group
`0 * * * *`	`tap_lms.summer_program.pre_launch.feedback_ready_watchdog`	Hourly log-only watchdog for stuck `feedback_ready` PEs (task #56, 2026-05-16)

Doc events (hooks.py:14-40):

StudentContentLog.after_insert → summer_program.activity_points.handle_content_log
StudentQuizAttempt.on_update → summer_program.quiz_points.handle_attempt_update
(Plus pre-existing onboarding hooks on School, Teacher, StudentStageProgress.)

14. Operational tooling

summer_program/pre_launch.py and summer_program/dev_tools.py.

14.1 audit_and_null_stale_next_action_at

pre_launch.audit_and_null_stale_next_action_at(dry_run=False, stale_days=7) (pre_launch.py:32-122). Pre-launch one-shot audit (also re-runnable). Identifies and nulls stale next_action_at values in two buckets:

Overdue — next_action_at < NOW() - stale_days AND not terminal.
Terminal — next_action_at IS NOT NULL AND resolved_flow_state IN ('program_completed', 'program_dropped').

Single atomic UPDATE covers both buckets. dry_run=True returns counts without writing. Returns {stale_overdue, stale_terminal, total_nulled, by_state}.

Invocation:

bench --site tap_lms.dev execute \
    tap_lms.summer_program.pre_launch.audit_and_null_stale_next_action_at

14.2 feedback_ready_watchdog

See §9.5. Hourly cron, log-only, structured Error Log per stuck PE.

14.3 dev_tools — reset utilities

summer_program/dev_tools.py provides destructive reset helpers for dev/test sites:

list_pes_for_batch(batch_name) — read-only; prints PEs with key state fields.
reset_pe_to_state_0(pe_name, dry_run=False) — resets a single PE to T0 state. Destroys Submission / StudentQuizAttempt / StudentContentLog / ProgramEventLog rows.
reset_pes_for_batch(batch_name) — bulk reset across a batch.

Safety guard _assert_dev_site (dev_tools.py:59-80) refuses to run on sites whose name matches prod | live | production unless the caller passes i_know_this_is_destructive=True. Patches maintain_collections and _enqueue_contact_field_sync at module-level for stable test patching.

15. Resilience patterns

Named patterns referenced throughout the codebase.

P-001 — idempotency primitive (atomic UPDATE-WHERE-RETURNING)

Atomic UPDATE guarding on journey_label (or another precondition column) + RETURNING name. 0-row return = parallel actor moved state; skip gracefully. Lesson L-010.

Canonical example — the dispatcher claim (pe_dispatcher.py:139-152):

UPDATE "tabProgramEnrollment"
SET next_action_at = NULL
WHERE name = %s AND journey_label = %s AND next_action_at IS NOT NULL
RETURNING name

Other instances: _try_claim_primary (save_submission.py:298-307), _record_delivery_failure (pe_dispatcher.py:724-733).

P-002 — atomic counter (COALESCE-update)

SET col = COALESCE(col, 0) + %s so a counter survives races with a peer write or a weekly reset. Lesson L-011.

Instances: activity_points._sql_update_pe_columns (activity_points.py:157-176), quiz_points.award_quiz_points (quiz_points.py:97-106), _bump_member_count (collection_membership.py:204-220), _record_delivery_failure (pe_dispatcher.py:724-733), total_enrolled bump (enrollment.py:213-220).

P-003 — per-attempt audit lock

Write the audit field first (e.g. StudentQuizAttempt.points_earned), then do rollups. The audit field is the L-012 idempotency anchor for retries.

Instances: quiz_points.award_quiz_points writes points_earned at line 75-78 before the PE bump; activity_points.award_activity_points writes scl.points_awarded at line 179-182 AFTER the PE update (audit-after-PE so retries skip correctly). The pattern shape differs by whether the audit write itself is the side-effect anchor.

P-005 — write-once anchor

A column whose presence (points_earned > 0, points_awarded > 0) is checked at entry to short-circuit re-runs. Same field that L-012 protects.

P-006 — backward-compat parameter alias

Whitelisted parameter renames keep the old name as an alias for one release cycle and log a deprecation. Lesson L-009.

Instance: save_submission(content_id=) (deprecated) → assignment_id (save_submission.py:68-77). Aliased and logged via frappe.log_error titled SP API Deprecation.

P-007 — retry + DLQ

Every external write goes through this. On exception: increment retry_count, re-enqueue self via frappe.enqueue up to MAX_RETRIES; on exhaustion, log a structured DLQ entry to Frappe Error Log so operators can replay manually. Lesson L-015.

Subsystem	Max retries	Constant	DLQ log title
Glific contact-field sync	5	`GLIFIC_SYNC_MAX_RETRIES`	`SP Glific Sync DLQ — manual replay required`
Glific group writes (CR-005)	5	(shares `GLIFIC_SYNC_MAX_RETRIES`)	`SP Glific Sync DLQ — manual replay required`
RabbitMQ feedback publish	5	`FEEDBACK_PIPELINE_MAX_RETRIES`	`SP Feedback Pipeline DLQ — manual replay required`
Vocallabs parent call	5	`VOCALLABS_MAX_RETRIES`	`SP Vocallabs DLQ — manual replay required`

All currently use IMMEDIATE retries (no backoff). Exponential backoff is a deferred follow-up; the current scheme covers short Glific 502/503s and Redis hiccups, while a sustained outage (>~30s) drops to DLQ.

L-003 — FOR UPDATE SKIP LOCKED

PG-specific scheduler primitive. Lets N parallel workers each take a different slice of the candidate set without contention. Instance: process_program_actions SELECT (pe_dispatcher.py:104-122).

CR-005 — partial unique index (defense-in-depth)

A partial unique index on tabPGCollection (parent, kind) WHERE is_active = 1 enforces "at most one active row per (BPR, kind)" at the database level. Defense-in-depth against double-bootstrap; the bootstrap code itself is also idempotent (_ensure_kind_keyed_pg_collections at batch_activation.py:382-439).

16. Future work / explicitly deferred

These are NOT shipped today. Tracked in the task system; ADR-level commitments live in docs/adrs/.

M-AUDIT-8 — Glific delivery-confirmation webhook. Wires _trigger_flow failures into _record_delivery_failure → T23. Currently _trigger_flow is fire-and-forget; the chain is plumbed but inactive. Phase 1.
Exponential backoff scheduler. Replace the immediate-retry P-007 implementation with a delay-scheduled backoff. Covers sustained Glific or RabbitMQ outages. Follow-up to the Phase 1 webhook work.
TimescaleDB hypertable for ProgramEventLog. ADR-001. Not justified by current load.
Redis hot-state cache for PE. ADR-002. Not justified by current load.
Collection-mode batchers for escalation / feedback delivery. ADR-003 successors. The legacy escalation_runner 6-hourly sweep is the current implementation.
PE.last_dispatched_at column. Audit-trail field referenced in the dispatcher loop comments (pe_dispatcher.py:135-138) but not yet added; filed for DocType-UI follow-up.
Drop deprecated action alias from _response flat-map. After all Glific flows are audited to read @results.webhook.status instead of @results.webhook.action (task #73).
Cleanup CR for T6 stub. The t6_escalation_to_remedial function in state_machine.py:464-481 raises if called; will be deleted entirely once no caller imports it.
Cleanup CR for legacy archetype-keyed PGCollection rows. Currently is_active=0 post-CR-005 backfill; eventually delete.

17. Known gaps in this document

Course content authoring (CourseLevel, VideoClass, Quiz, QuizQuestion configuration) is not covered. Lives outside summer_program/ and is a separate concern.
School / teacher onboarding (onboarding_flow_trigger, Backend Student Onboarding) is referenced only where it touches the SP enrollment ingress. The full onboarding pipeline is owned by a different team and lives in the tap_lms/page/onboarding_flow_trigger/ namespace.
The legacy escalation_runner 6-hourly sweep is registered but not described — it predates the per-PE dispatcher and is in the process of being replaced.
SP_Incoming_Router — the inbound-only re-engagement entry point referenced in CR-003. This is a Glific flow (not backend code); the only backend touchpoint is the standard update_flow_status callback when the rejoin path completes.
reactivate_student — mentioned in state_machine.py:12 as a state-machine entry point but the implementation lives outside the files surveyed here.
escalation_runner.run_escalation_check internals — referenced in hooks.py:77-78 but its code was not opened for this doc.
Week-1 content for late activations — content delivery is now batch-driven via weekly_content_delivery_trigger; week-1 students added to the main collection mid-week will not get content until the following Tuesday. The admin's mitigation is to time activation Mon/Tue (batch_activation.py:223-227).