TAP LMS Architecture
Live single-source-of-truth for the Summer Program (SP) pipeline as it stands on 2026-05-16, post CR-001, CR-002 v2, CR-003, CR-004, CR-005, and CR-006. Cite this doc before designing new work; supersede it via CRs and ADRs rather than editing in place. References throughout use path/to/file.py:LINE form.
1. System overview
TAP LMS is a Frappe v15 custom app (tap_lms) running on Postgres. The Summer Program is a multi-week, archetype-driven, Glific-integrated student learning journey. A cohort lives on a Batch, a per-cohort run lives on a BatchProgramRun (BPR), and a per-student row lives on a ProgramEnrollment (PE). Each PE carries a resolved_flow_state (the state-machine label) and a next_action_at (when the scheduler should next look at it). State transitions are numbered T0–T25 in summer_program/state_machine.py. Every state change calls Glific to maintain group membership and to push contact fields back so the WhatsApp flows can branch on backend state.
Integrations the SP depends on:
- Glific — WhatsApp messaging. Two surfaces: contact fields (push state to Glific via
update_contact_fields) and groups (CR-005 kind-keyed collections:main,escalation,binge_paused,program_dropped,program_completed). Glific flows call back into the backend via the whitelistedupdate_flow_statusendpoint and thesave_submissionendpoint. - RabbitMQ — feedback pipeline broker. Submissions are published to RabbitMQ by
save_submission.enqueue_submission; the FeedbackConsumer external service drains the queue, generates AI feedback, writes back to the Submission row, then calls back into the backend viafeedback_consumer_hook.on_feedback_ready. - Vocallabs — parent-call automation. CR-003 added a
parent_callescalation channel;summer_program/vocallabs.pyis the integration module with its own retry+DLQ budget.
2. Tech stack and invariants
| Layer | Choice | Notes |
|---|---|---|
| Web framework | Frappe v15 | Custom app tap_lms, module path tap_lms.summer_program.* |
| Database | Postgres | No MariaDB compat layer — MariaDB fallback branches are dead code (L-007) |
| Background jobs | RQ workers via frappe.enqueue |
Queues used: short, default, long |
| Broker | RabbitMQ (pika) |
Publisher-confirms + persistent delivery (delivery_mode=2) — save_submission.py:725-737 |
| Messaging | Glific GraphQL API | Auth token cached in Glific Settings; mutations: updateContact, updateGroupContacts, startContactFlow, createGroup |
| Voice | Vocallabs HTTP API | Token cached in Redis under VOCALLABS_TOKEN_CACHE_KEY; default TTL 3600s |
Cross-cutting invariants (project-wide rules — see .claude/memory/lessons.md):
- L-005 (IN/array trap) — Frappe's PG param wrapper mangles
WHERE col IN %swhen binding a 2-tuple with an inner sequence. Use flatIN (%s, %s, ...)with scalar params, or= ANY(%s::text[])with explicit cast. Repeats acrosspe_dispatcher.py:99-122,pe_dispatcher.py:172-187,save_submission.py:292-307,_record_delivery_failureatpe_dispatcher.py:721-733. - L-008 (Glific flat-map contract) — Glific flows read
@results.webhook.<field>directly. Whitelisted endpoints write the response viafrappe.local.response.update(...)(nomessage:wrapper, scalar values only). Contact-field names are public contract — never rename without auditing every flow. Seeflow_callback.py:56-89anddocs/api-standard-glific.md. - L-010 (journey_label idempotency) — every state-changing UPDATE includes
WHERE journey_label = <expected>plusRETURNING name. A 0-row return means a parallel handler already moved the state; skip gracefully. Anchor pattern in the dispatcher's atomic claim —pe_dispatcher.py:139-152. - L-011 (atomic counter) — counters that race with the T19 weekly reset use
SET col = COALESCE(col, 0) + %s. Read-then-write is unsafe. Seeactivity_points.py:157-176,quiz_points.py:97-106,_record_delivery_failureatpe_dispatcher.py:724-733. - L-012 (per-attempt audit lock) — write the audit field (e.g.
StudentQuizAttempt.points_earned) FIRST, then do rollups. The audit field is the idempotency anchor —quiz_points.py:75-78. - L-015 (retry+DLQ) — every external write (Glific, RabbitMQ, Vocallabs, group-write) carries P-007 retry+DLQ. State has already committed in PG; a silent failure leaves Glific stale or submissions lost.
- L-017 (no commit in handlers) — controllers and flow handlers must not call
frappe.db.commit(). Frappe commits at request-end; mid-handler commits poison transaction isolation and break test rollback. - L-024 (two-folder deploy) —
app/is the working folder;~/frappe-bench/apps/tap_lms/is the bench checkout. Edits flow working → bench → git, thengit pullback into the working folder. Don't re-copy after the bench has pulled.
3. Key DocTypes
3.1 Student
The contact-level identity row. Lives in tap_lms/doctype/student/.
Student.name1— canonical display name (NOTstudent_name, which does not exist). Resolved viasummer_program/utils.py:get_student_display_name.Student.name— the document name and the student ID (e.g.ST00051383); there is no separatestudent_idcolumn.Student.glific_id— Glific contact ID; null for students never synced.Student.archetype— one ofdormant | fence_sitter | irregular_submitter | submitter(lowercase snake_case perconstants.py:18-28). Upstream-supplied data: set by CSV import, BigQuery sync, or manual entry. The SP backend never auto-assigns, randomizes, or recomputes this field. Task #13 (splitter) was explicitly cancelled — ADR-004 audit log.Student.experiment_arm— one ofdefault | arm_a | arm_b. Also upstream-supplied.Student.phone— used by Vocallabs as the parent's phone (the student uses the parent's device — CR-003 §"Phone resolution").
3.2 Batch
Per-cohort container. tap_lms/doctype/batch/batch.json:1-100.
| Field | Type | Purpose |
|---|---|---|
batch_id |
Data (unique) | Public batch identifier used in collection labels |
start_date |
Date | Cohort start; drives check_auto_activate and week calendar |
end_date |
Date | Cohort end |
program_type |
Select(Summer\|Regular) |
Owned by Batch (NOT BPR — clarification 2026) |
total_weeks |
Int | Length of the program (typically 8) |
grace_window_days |
Int | Per-cohort grace duration; per-week clock duration (CR-003) |
current_calendar_week |
Int | The "weekly Monday" cursor advanced by batch_admin.auto_advance_batch_week |
3.3 BatchProgramRun (BPR)
The per-run wrapper around a Batch, holding pipeline status, validation report, Glific flow IDs, and the embedded pg_collections child table. tap_lms/doctype/batchprogramrun/batchprogramrun.json:1-100.
Key fields:
status— Select:draft | importing | enrolling | collections_ready | active | completed(constants.py:172-182).validation_status— Select:not_run | passed | failed. Set byvalidate_bpratbatch_activation.py:39-151.total_imported,total_enrolled— counters incremented atomically byenrollment.py:213-220.content_delivery_flow,escalation_flow,reengagement_flow,grace_notification_flow,program_complete_flow,binge_info_flow— Glific flow UUIDs. Resolved per action type viaconstants.py:153-163ACTION_FLOW_FIELD_MAP. (reengagement_flowandgrace_notification_flowreminders are retired per CR-003; the columns remain for legacy callers.)pg_collections— child table of PGCollection rows.pg_onboarding_sets— child table linking toBackend Student Onboardingdocuments.
3.4 PGCollection (CR-005 kind-keyed)
Child table embedded under BatchProgramRun (istable=1). tap_lms/doctype/pgcollection/pgcollection.json:1-83.
Kind values (collection_membership.py:42-48 COLLECTION_KINDS):
| Kind | Glific group label | Purpose |
|---|---|---|
main |
SP_<batch>_main |
Source-of-truth for the weekly content delivery sweep |
escalation |
SP_<batch>_escalation |
All PEs in normal_escalation or remedial_escalation |
binge_paused |
SP_<batch>_binge_paused |
PEs in paused_binge |
program_dropped |
SP_<batch>_program_dropped |
Terminal — PEs dropped via T17 grace expiry or T23 |
program_completed |
SP_<batch>_program_completed |
Terminal — PEs completed via T16 |
collection_label— Glific group label.glific_group_id— Glific group ID (returned bycreate_group_if_missing).is_active— Check. CR-005 backfill marks new kind-keyed rows asis_active=1; the legacy archetype-keyed rows (archetype/experiment_armcolumns populated,kindnull) get markedis_active=0by the migration patch.member_count— denormalized integer counter, maintained by_bump_member_countatcollection_membership.py:191-221. Glific itself is the source of truth; this column lets the weekly cron cheaply skip empty BPRs.archetype,experiment_arm,student_count— legacy fields retained for historical rows; not used by new code.
3.5 ProgramEnrollment (PE)
The per-student-per-batch row that holds every piece of state the dispatcher and state machine read or write. tap_lms/doctype/programenrollment/programenrollment.json.
Identity:
enrollment(Data, unique) — public enrollment ID.student,batch,glific_id,course_level,language,experiment_arm,archetype,program_type.
State:
resolved_flow_state— Select with 11 values (see §5).journey_label— Select with 12 values (constants.py:82-99). The atomic-claim guard column.last_label_change_at— Datetime.program_status— Select:active | paused | completed | dropped.drop_reason— Data; valid values includegrace_expired,delivery_failure,admin,manual.current_expected_submission_type— Select:emoji | word_text_voice | summary_text_voice | image | video | photo_video_artefact.current_path— Select:Core | Remedial.current_tier— Select:Basic | Intermediate | Advanced. Resolved fromTIER_BY_WEEK(constants.py:265-268).
Week / scheduling:
current_week,max_allowed_week— Int. Week advance respectsmax_allowed_weekto gate binge progression.next_action_at,next_action_type— the per-PE clock the dispatcher reads. Action types:content_delivery,escalation,week_advancement,feedback_notification,grace_check,pause_check,feedback_timeout(constants.py:121-127).current_escalation_step— Int. Set by T2/T4/T8/T10.current_escalation_type— Data. CR-003: written alongside the step counter, pushed to Glific as the contact fieldescalation_type.
Grace clock (per-week, CR-003):
in_grace_window(Check),grace_window_start(Datetime),grace_window_end_at(Datetime). Armed byactivity_points.handle_content_logon the week's first VideoClass via atomic PostgresCASE WHEN. Cleared by T3/T7/T9/T17 (primary submission) or by T17/T23 drop.
Submission + delivery:
submission_count— Int. Owned exclusively bysave_submission._try_claim_primary(save_submission.py:298-307); state-machine transitions no longer bump it (H-AUDIT-4).quiz_completed— Int.last_submission_at— Datetime.delivery_failure_count— Int. Bumped by_record_delivery_failure; T23 fires atMAX_DELIVERY_FAILURES = 3.last_flow_triggered,last_flow_triggered_at— Data / Datetime. Tracking.
Gamification (CR-002 v2, 9 fields):
total_activity_points,weekly_activity_points— VideoClass points.total_quiz_points,weekly_quiz_points,bonus_quiz_points— quiz points.total_submission_points,weekly_submission_points— submission points.special_gems— Int. Floored at 0.current_streak— Int.total_points— cumulative roll-up across all three streams.- Sticky weekly flags:
weekly_video_done(Check, internal-only — NOT pushed to Glific perconstants.py:218),weekly_submission_done(Check, pushed asweekly_submission_donecontact field).
ProgramEnrollment is the canonical read-side source for all Summer Program helpers (2026-05-19 incident fix). The legacy Student.enrollment child table is written by backend onboarding for non-SP flows; it is NOT updated by start_program_enrollment for new SP batches. Reading from the child table miss-resolves students enrolled into the current cohort but whose child rows still point at a previous batch. The following helpers now query ProgramEnrollment directly with program_status IN ('active', 'paused') and ORDER BY creation DESC:
_get_active_bpr_for_student(student_progression_sp.py:1802-1845) — finds the active BPR + Batch for a student. Used byget_next_content/get_weekly_contentto fail fast withno_active_batchwhen no enrollment exists._get_course_level_for_student(student_progression_sp.py:1848-1867) — readscourse_levelfrom the matching PE. No legacy fallback (both callers are SP-only paths already gated by_get_active_bpr_for_student).get_student_sp_status(api.py:191-249) — status endpoint; iterates PEs bycreation descand returns the first BPR match.
Regression coverage: tests/test_active_bpr_resolution.py pins the contract (empty child table + active PE → resolves; non-Summer batch skipped; dropped PE ignored).
3.6 ArchetypeConfig
Per-(batch, archetype, experiment_arm, week) row. Replaced the always-16 invariant after the 2026-05-13 ADR-004 supersession (per-tuple validation rule). Read by pe_dispatcher._get_week_rule (pe_dispatcher.py:658-674) and by the validator _validate_archetype_config_before_activation called from activate_bpr (batch_activation.py:192-195).
Fields used by the runtime:
expected_submission_type— drivesPE.current_expected_submission_typeafter week advance.core_learning_unit,remedial_learning_unit— content references for the week.
3.7 Submission
The output of save_submission. Lives outside the SP module but is owned conceptually by it. Fields the SP reads: student_id, program_enrollment, week, is_primary, status (Pending/Processing/Completed/Failed — overall lifecycle), result_status (Select: Pending / Success - Original / Success - Flagged / Failed — drives the points decision in _compute_submission_points; writers: feedback_processor._determine_result_status writes Success - Original when AI says clean and Success - Flagged when plagiarized or AI-generated; save_submission._apply_duplicate_submission_feedback ALSO writes Success - Flagged synchronously for duplicate submissions; the Failed Select option exists but no production code writes it today — kept as a defensive value in the points gate), submission_validity (Valid/Invalid — drives the routing decision in on_feedback_ready; "Invalid" plus WeekRule.submission_validation_enabled=1 routes to Remedial via T6b), overall_feedback, overall_feedback_translated, audio_feedback_url, assign_id, submission_type, submission_text, submission_url, escalation_step_at_submit, created_at.
3.8 StudentQuizAttempt, StudentContentLog, ProgramEventLog
StudentQuizAttempt— wired viadoc_events["StudentQuizAttempt"]["on_update"]toquiz_points.handle_attempt_update. Thepoints_earnedcolumn is the L-012 audit anchor.StudentContentLog— wired viadoc_events["StudentContentLog"]["after_insert"]toactivity_points.handle_content_log. Filtered tocontent_type == "VideoClass" AND action == "completed". Thepoints_awardedcolumn is the audit anchor.ProgramEventLog— append-only audit table. Written bysummer_program/event_log.py:log_eventandlog_state_transition. Event types includestate_transition,submission_received,flow_completed,path_changed,week_advanced,program_dropped,escalation_sent,delivery_failed,grace_window_entered,activity_points_awarded,quiz_points_awarded,bonus_quiz_points_awarded,feedback_delivered.
4. Lifecycle — enrollment to BPR activation
The pipeline that takes a draft BPR to active. Each step is gated by status (constants.py:172-182 BPR_STATUS_FLOW).
4.1 Bulk enrollment (steps 3a and 3b)
summer_program/enrollment.py:40-154:
start_enrollment(bpr_name)— gathers students from the BPR's onboarding sets via_get_students_for_bpr. Marks BPRstatus = enrolling. Chunks students atENROLLMENT_CHUNK_SIZE=100and enqueues_process_enrollment_chunkon thelongqueue._process_enrollment_chunk— for each student, builds the 6-field bootstrap (archetype,experiment_arm,program_type=Summer,batch_id,course_level,student_name) and enqueues_sync_contact_fields_jobwithpe_name=f"pre-pe:{sid}"(synthetic so the DLQ log entry is correlatable when there is not yet a PE row). Atomically bumpsBatchProgramRun.total_enrolledvia raw SQL (enrollment.py:213-220).setup_collections(bpr_name)— for the 4×2 = 8 archetype/arm combinations, creates the legacy Glific groups viacreate_or_get_collectionand bulk-adds students viaadd_contacts_to_group_bulk(COLLECTION_BATCH_SIZE=500per call). Records each inbpr.pg_collections. Marks BPRstatus = collections_ready. Note: these legacy archetype-keyed PGCollection rows still exist post-CR-005 but areis_active=0after the migration patch and not used by new code.
4.2 Program enrollment (step 3c — separate)
summer_program/program_enrollment_api.py:40-...:
start_program_enrollment(bpr_name)— bulk-creates ProgramEnrollment records. Idempotent — filters out students who already have a PE for the batch. Background chunks viafrappe.enqueue.create_program_enrollment(student_id, batch_name)— single-PE entry point (API A6). Sets the 14 baseline Glific contact fields and writesjourney_label='enrolled',program_status='active',current_path='Core',current_week=1,resolved_flow_state='normal_content_delivery'.
4.3 Validation (step 4)
batch_activation.validate_bpr (batch_activation.py:38-151) checks:
- BPR is in
collections_ready. - 8 archetype×arm legacy collections exist on the BPR.
- At least one Glific flow is configured.
total_enrolled > 0and at least one PE row exists.- Batch has
start_dateandtotal_weeks.
Result is persisted as bpr.validation_status plus JSON bpr.validation_report.
4.4 Activation (step 5)
batch_activation.activate_bpr (batch_activation.py:154-228) is gated on validation_status == passed:
- Hard-fail on per-tuple ArchetypeConfig completeness via
_validate_archetype_config_before_activation(replaced the old always-16 rule — ADR-004 audit log 2026-05-13). - Flip BPR
status = active, setactivated_at = now(). - CR-005
_ensure_kind_keyed_pg_collections(batch_activation.py:382-439): idempotently create the 5 kind-keyed PGCollection rows + the matching Glific groups viacreate_group_if_missing. Each row carrieskind,collection_label = f"SP_{batch}_{kind}",glific_group_id,member_count=0,is_active=1. - No per-PE seeding of
next_action_at— CR-005 locked decision #4: content delivery is now batch-triggered every Tuesday at 09:00 IST (03:30 UTC) viaweekly_content_delivery_triggeragainst themaingroup. The legacy_seed_pe_actionsfunction is kept as dead code (operator escape hatch).
4.5 Auto-activation cron
batch_activation.check_auto_activate (batch_activation.py:334-377) is wired to the daily scheduler. It activates BPRs in collections_ready with validation_status=passed whose Batch.start_date <= today.
5. The state machine
summer_program/state_machine.py defines 11 valid resolved_flow_state values (constants.py:50-79) and the named transitions T0–T25. Every transition flows through the transition(...) engine at state_machine.py:64-104, which:
- Updates
pe.resolved_flow_stateandpe.last_label_change_at. - Applies any
extra_updatesdict. - Calls
pe.save(ignore_permissions=True). - Logs to
ProgramEventLogvialog_state_transition. - Enqueues
_sync_contact_fields_job(P-007 retry+DLQ). - CR-005: calls
collection_membership.maintain_collections(pe, from_state, to_state)to keep Glific group membership current.
5.1 The 11 states
| State | Path | In active set? | Audit kind |
|---|---|---|---|
normal_content_delivery |
Core | yes (main) | none |
normal_escalation |
Core | no | escalation |
remedial_content_delivery |
Remedial | yes (main) | none |
remedial_escalation |
Remedial | no | escalation |
grace_waiting |
either | no | none (Gap-fill A, CR-005) |
paused_binge |
Core | no | binge_paused |
submitted_awaiting_feedback |
either | yes (main) | none |
feedback_ready |
either | yes (main) | none |
week_completed |
either | yes (main) | none |
program_completed |
terminal | no | program_completed |
program_dropped |
terminal | no | program_dropped |
MAIN_ELIGIBLE_STATES is the set of 5 yes-yes rows (collection_membership.py:22-28). STATE_TO_AUDIT_KIND maps the right column (collection_membership.py:31-39). Note STATE_PAUSED_NO_ACTIVITY is RETIRED per CR-003 — the Select option remains for legacy rows but no new transition writes it.
5.2 Transition catalogue (T0–T25)
| ID | From → To | Trigger source | Notes |
|---|---|---|---|
| T0 | (new PE) → normal_content_delivery | scheduler | T0 NO LONGER arms the grace clock — armed by activity-points handler on first VideoClass. state_machine.py:322-339 |
| T1 | normal_content_delivery → normal_content_delivery | flow_callback | Schedules next escalation. state_machine.py:343-358 |
| T2 | normal_content_delivery → normal_escalation | scheduler | Writes current_escalation_step + current_escalation_type. state_machine.py:362-374 |
| T3 | normal_escalation → submitted_awaiting_feedback | flow_callback | Bumps submission points + streak + gems + clears grace clock. state_machine.py:378-414 |
| T4 | normal_escalation → normal_escalation | scheduler | Next escalation step. state_machine.py:418-430 |
| T5 | normal_escalation → grace_waiting | scheduler | Steps exhausted. Defensive grace-clock backfill for legacy PEs. state_machine.py:434-461 |
| T6 | REMOVED (CR-006) | n/a | state_machine.py:464-481 raises if called. Remedial is now reserved for failed-feedback only. |
| T6b | submitted_awaiting_feedback → remedial_content_delivery | microservice | CR-004 path. AI feedback set submission_validity='Invalid' AND WeekRule.submission_validation_enabled=1. (Originally keyed on result_status='failed' per CR-004; switched to submission_validity post-ship — see §9.3.) state_machine.py:485-512 |
| T7 | normal_content_delivery → submitted_awaiting_feedback | flow_callback | Same shape as T3. state_machine.py:516-550 |
| T8 | remedial_content_delivery → remedial_escalation | scheduler | Mirrors T2. state_machine.py:554-564 |
| T9 | remedial_content_delivery → submitted_awaiting_feedback | flow_callback | Mirrors T3. state_machine.py:568-601 |
| T10 | remedial_escalation → remedial_escalation | scheduler | Mirrors T4. state_machine.py:605-617 |
| T11 | remedial_escalation → grace_waiting | scheduler | Mirrors T5. state_machine.py:621-638 |
| T12 | submitted_awaiting_feedback → feedback_ready | microservice | Called from on_feedback_ready. Clears next_action_at/next_action_type — the Glific F5 callback drives the next transition. state_machine.py:642-656 |
| T13 | feedback_ready → week_completed | flow_callback | Glific SP_Feedback_Delivery completed. Schedules week_advancement. state_machine.py:660-666 |
| T14 (a.k.a. T19) | week_completed → normal_content_delivery | scheduler | The week-advance. CR-002 v2 streak/gem compute + weekly resets. Function is named t14_week_advance for historical reasons. state_machine.py:682-768 |
| T15 | week_completed → paused_binge | scheduler | Binge limit (calendar not yet caught up). state_machine.py:772-780 |
| T16 | week_completed → program_completed | scheduler | current_week > total_weeks. state_machine.py:784-791 |
| T17 (sub.) | grace_waiting → submitted_awaiting_feedback | flow_callback | Submission during grace. Clears grace fields. state_machine.py:795-823 |
| T17 (exp.) | grace_waiting → program_dropped | scheduler | CR-003 direct drop. drop_reason='grace_expired'. state_machine.py:837-859. Alias t18_grace_expired retained for one cycle (state_machine.py:867). |
| T18 | DELETED (CR-003) | n/a | Pre-CR-003 transitioned to paused_no_activity + re-engagement loop. |
| T19 (reactivate) | DELETED (CR-003) | n/a | Re-engagement is now inbound-only via SP_Incoming_Router. |
| T20 (reactivate-remedial) | DELETED (CR-003) | n/a | Same reason. |
| T21 | paused_binge → normal_content_delivery | scheduler | Calendar caught up. state_machine.py:886-894 |
| T22 | submitted_awaiting_feedback (no change) | flow_callback | Duplicate submission. Log-only. state_machine.py:898-902 |
| T23 | ANY → program_dropped | dispatcher | System-initiated. reason in {delivery_failure, admin, manual}. Idempotent on already-terminal PEs. state_machine.py:906-937 |
| T24 | ANY → program_dropped | admin | Admin drop. state_machine.py:941-948 |
| T25 | (no change) | scheduler | Delivery-failure log only. state_machine.py:952-959 |
5.3 Idempotency: the journey_label primitive (L-010)
Every state-changing UPDATE in this codebase guards on WHERE journey_label = <expected> and uses RETURNING name. The dispatcher's atomic claim is the canonical example (pe_dispatcher.py:139-152):
UPDATE "tabProgramEnrollment"
SET next_action_at = NULL
WHERE name = %s
AND journey_label = %s
AND next_action_at IS NOT NULL
RETURNING name
A 0-row return is the signal that a parallel worker, a Glific callback, or save_submission moved the PE under us. The handler then skips without raising. This is the entire reason journey_label exists as a separate column from resolved_flow_state — the label is the "before-image" of the most recent transition and is cheap to compare against.
6. Per-PE dispatcher
summer_program/pe_dispatcher.py is the brain of the time-based automation. It runs every minute (hooks.py:73-76, cron */1 * * * *) and routes PEs whose next_action_at <= NOW() to per-action handlers.
6.1 The dispatch loop
process_program_actions at pe_dispatcher.py:63-199:
- Selects up to
DISPATCH_BATCH_SIZE = 1000PEs (pe_dispatcher.py:60) with: next_action_at <= NOW()program_status IN ('active', 'paused')(paused included so binge-resume is reachable — fix B3)next_action_type != ''ORDER BY next_action_at ASCFOR UPDATE SKIP LOCKED(L-003) so 4 parallel workers each take a different slice.- For each row, runs the atomic claim (L-010) — clears
next_action_atconditional onjourney_labelmatching the value read in the SELECT. - On successful claim, calls
_dispatch_single(pe_row)which looks up the handler inHANDLER_MAP(pe_dispatcher.py:590-603) and runs it inside a try/except. - After the loop, logs a structured
sp_dispatcherline withclaimed,skipped,errors,queue_depth.
Scaling envelope (architecture §8.8 sizing math, ADR-003 audit 2026-05-13): 1000-row batch × 1-min cron × 4 workers ≈ 240K actions/hour. A 100K-student week-boundary T19 burst drains in ~25 min. Prerequisite: partial index idx_pe_next_action (task #24).
6.2 The handlers
| Action type | Handler | What it does |
|---|---|---|
content_delivery |
handle_content_delivery (pe_dispatcher.py:245-275) |
Preserved escape hatch; NOT reached in normal flow under CR-005. Logs that it fired and triggers the content flow for a single contact. |
escalation |
handle_escalation (pe_dispatcher.py:278-389) |
CR-003 channel-aware. Resolves the step's escalation_type from ArchetypeConfig, transitions PE via T2/T4/T8/T10, then branches: parent_call → enqueue Vocallabs; other types → fire SP_Escalation flow. On step exhaustion, routes to T5/T11 (grace). |
feedback_timeout |
handle_feedback_timeout (pe_dispatcher.py:399-451) |
Safety-net fallback. Checks Submission status; if feedback arrived and PE wasn't advanced, fires T12 fallback. Otherwise reschedules for +1h, max 3 retries. |
week_advancement |
handle_week_advancement (pe_dispatcher.py:454-499) |
Decides T16 (program done) vs T15 (binge pause) vs T14/T19 (normal advance). |
grace_check |
handle_grace_check (pe_dispatcher.py:502-547) |
CR-003 direct drop. If weekly_submission_done=1 no-op. If clock not yet expired, reschedule. Otherwise fire T17 grace-expired. |
pause_check |
handle_pause_check (pe_dispatcher.py:557-583) |
Binge-resume check. T21 if calendar caught up; else reschedule +7 days. |
Retired handlers per CR-003: handle_re_engagement and handle_grace_reminder — re-engagement is inbound-only via SP_Incoming_Router and per-week escalation steps replace proactive grace reminders.
6.3 Delivery failure chain
_record_delivery_failure(pe_name) at pe_dispatcher.py:687-740 is the bridge from a failed Glific call to T23 auto-drop:
- Atomic UPDATE:
delivery_failure_count = COALESCE(delivery_failure_count, 0) + 1withprogram_status IN ('active', 'paused')folded into the WHERE so a PE that went terminal can't get its counter bumped (M1 fix 2026-05-13). - RETURNING gives the post-update value; if 0 rows returned, no-op.
- If
new_count >= MAX_DELIVERY_FAILURES (3), firest23_auto_drop(pe, reason='delivery_failure').
The helper is shipped but not yet wired in — _trigger_flow is fire-and-forget (pe_dispatcher.py:628-643) and there is no delivery-confirmation webhook from Glific yet. The webhook wire-up is M-AUDIT-8 (deferred Phase 1).
7. Content delivery — CR-005 Approach B (state-driven)
The current weekly content delivery model. Replaces the previous per-PE content_delivery scheduling that armed ACTION_CONTENT_DELIVERY on every T0 and T19.
7.1 Continuous membership maintenance
Every state-machine transition (via transition() at state_machine.py:64-104) calls maintain_collections(pe, from_state, to_state) at collection_membership.py:51-86. This:
- Computes
from_main,to_main(membership inMAIN_ELIGIBLE_STATES) andfrom_audit,to_auditviaSTATE_TO_AUDIT_KIND. - If
from_main != to_main, enqueues an add/remove on the BPR'smainPGCollection. - If
from_audit != to_audit, enqueues remove of the old audit kind and add of the new audit kind.
Membership churn is fully delta-based — re-running with same from_state==to_state is a no-op.
7.2 Glific group writes (retry + DLQ + counter)
_enqueue_group_write (collection_membership.py:89-111) looks up the kind-keyed PGCollection row and enqueues _group_write_job with enqueue_after_commit=True. The job (collection_membership.py:114-188) wraps add_contact_to_group / remove_contact_from_group (both in glific_integration.py) with P-007 retry (5 immediate retries) + DLQ.
On successful Glific call, _bump_member_count (collection_membership.py:191-221) atomically maintains PGCollection.member_count:
UPDATE "tabPGCollection"
SET member_count = COALESCE(member_count, 0) + 1 -- add
member_count = GREATEST(0, COALESCE(member_count, 0) - 1) -- remove
WHERE name = %s
The GREATEST(0, ...) guard prevents negative drift if a remove fires for a contact that was never in the group.
7.3 The weekly fire
scheduler.weekly_content_delivery_trigger (scheduler.py:223-292) runs Tuesday 03:30 UTC = 09:00 IST (cron 30 3 * * 2 in hooks.py:87-89):
- Selects active BPRs with a configured
content_delivery_flow. - For each, reads the
mainPGCollection row, skipping whenmember_count <= 0. - Calls
start_group_flow(content_delivery_flow, main_glific_group_id)— exactly one Glific API call per BPR.
No recompute, no reconcile. Idempotency is operator discipline ("don't manually invoke during the cron window"); Glific deduplicates identical group-flow starts within a short window.
8. Escalation, grace, and drop
The "weekly nudge" loop, per CR-001 + CR-003 + CR-006.
8.1 Per-week grace clock
activity_points.award_activity_points (activity_points.py:75-202) is the only arming path. On the week's first VideoClass completion (detected via pe.weekly_video_done == 0 pre-UPDATE), it both:
- Logs
grace_window_enteredto ProgramEventLog with the expectedgrace_window_end_at. - Issues a single atomic UPDATE that bumps activity points, sets
weekly_video_done=1, AND armsgrace_window_start = NOW(),grace_window_end_at = NOW() + (X || ' days')::interval,in_grace_window=1— gated byCASE WHEN weekly_video_done = 0(Postgres evaluates against the OLD row value within the same UPDATE). Second-video-same-week is a no-op for the grace fields.
X is Batch.grace_window_days, falling back to DEFAULT_GRACE_WINDOW_DAYS = 14 (constants.py:281).
T19 re-arming is automatic: T19 resets weekly_video_done = 0, so the next VideoClass watch re-trips the CASE WHEN and writes a fresh grace_window_end_at. T0 no longer arms the clock (the activity-points handler does — CR-003 follow-up 2026-05-13).
Grace clearing: T3/T7/T9/T17 (primary submission) explicitly null in_grace_window, grace_window_end_at, grace_window_start.
8.2 Escalation channel routing
handle_escalation (pe_dispatcher.py:278-389) resolves the step's escalation_type from ArchetypeConfig and branches:
help_note_a/help_note_b/voice_note→ fireSP_EscalationGlific flow. Glific reads the contact fieldsescalation_order+escalation_typeand renders the per-channel content.parent_call→ enqueuevocallabs.initiate_parent_call. Glific is NOT involved for parent calls.
Defaults: escalation_type defaults to help_note_a if missing. hours_after_previous defaults to 24.
8.2.1 Vocallabs parent-call flow (cache-on-Student + auto-backfill — task #80/#81, CR-010)
vocallabs.py is the integration module. The 2026-05-24 hardening (CR-010) changed the flow from "addMultipleContactsToGroup every call" to a permanently-cached prospect_id on the Student doctype.
Why the change: Vocallabs enforces a uniqueness constraint on (client_id, prospect_group_id, phone) in their Hasura backend. Without a cache, every repeat call to the same parent's phone hits prospects_client_id_prospect_group_id_phone_key violation and never dials. Original behavior was to treat this as transient and retry 6 times (~75s of wasted worker time per failure, then DLQ). The cached design hits the insert ONCE then reuses the returned prospect_id forever.
Storage: Student.vocallabs_prospect_id (Data, read-only in UI, no_copy=1). Populated by _store_prospect_id(student, prospect_id) via frappe.db.set_value with update_modified=False (no audit-trail bump for what is essentially a tech-side cache).
Call sequencing (_call_vocallabs):
- Cache hit (
Student.vocallabs_prospect_idpopulated): POST /b2b/vocallabs/updateContactDatawith the freshly rendereddatablock (best-effort — failure logs toSP Vocallabs UpdateDataand proceeds).POST /b2b/vocallabs/initiateVocallabsCallwith the cachedprospect_id.-
2 HTTP calls (plus cached auth token).
-
Cache miss + first encounter (cache empty AND parent phone NOT in Vocallabs):
POST /b2b/vocallabs/addMultipleContactsToGroup→ returnsprospect_id._store_prospect_idwrites toStudent.vocallabs_prospect_id(beforeinitiateVocallabsCallfires, so a mid-flight call failure still leaves the cache populated).-
POST /b2b/vocallabs/initiateVocallabsCall. -
Cache miss + auto-backfill (cache empty AND parent phone already in Vocallabs from a prior insert):
addMultipleContactsToGroupreturns the Hasura uniqueness violation._is_duplicate_prospect_responsedetects theprospects_client_id_prospect_group_id_phone_keyconstraint name OR the GraphQLcode: "constraint-violation"._lookup_prospect_id_by_phonepaginatesGET /b2b/vocallabs/getContacts(limit=200, max 50 pages = 10,000 prospects). Matches by exact phone OR last-10-digits (handles E.164 ↔ 10-digit drift across Indian numbers).- On match: cache the recovered UUID, run
updateContactData, proceed toinitiateVocallabsCall. - On miss (phone genuinely not there, or unrecognized response shape): raise
PermanentVocallabsError→ fail-fast underSP Vocallabs Duplicate Prospect (lookup required)Error Log title.
Status template freshness: data.status is bound to the prospect record (not the call). On cache hit we MUST refresh via updateContactData before each call, otherwise the agent reads the variables from the original insert (week 1's text on a week 5 call). The team configures the template via ParentCallConfig.status_template per-week via UnitContentItem on LearningUnit, falling back to VoiceAgentSettings.default_parent_call_config. The code never hard-codes content.
Sibling race caveat: Two Student rows can share one parent phone (multiple kids enrolled, single device). For the cache to work, both Student rows must store the same prospect_id. The auto-backfill handles this — first sibling's call recovers the existing UUID, second sibling's call recovers the same UUID independently. But data.status updates race: if both siblings hit step 4 within the same second, the second updateContactData overwrites the first's variables before the first call has dialed out. Team-side mitigation is to make ParentCallConfig.status_template sibling-agnostic ("Your child hasn't submitted") rather than name-specific.
Single agent limitation (task #48): Today there's one VoiceAgentSettings.agent_id for all calls. Multi-language students get the English agent. Vocallabs is exploring language-detection-from-voice on their side; if that ships, task #48 may become moot. Otherwise the plan is one Vocallabs agent per language with a Student.language → agent_id lookup at call time.
Error Log taxonomy (all under method-column titles starting with SP Vocallabs):
- SP Vocallabs Retry — transient failure; will retry (P-007 budget).
- SP Vocallabs DLQ — manual replay required — retries exhausted.
- SP Vocallabs Duplicate Prospect (lookup required) — permanent fail; cache empty AND lookup couldn't recover prospect_id.
- SP Vocallabs Lookup — pagination diagnostics (only logged on miss/error, not success).
- SP Vocallabs UpdateData — updateContactData failed; call still placed with stale prospect data.
- SP Vocallabs Config / SP Vocallabs Template / SP Vocallabs Skipped — early-exit branches.
Verified end-to-end on test server 2026-05-24: 3 test parents (ST00051383 / ST00051387 / ST00051238) all had polluted phones from prior test runs. Phase C confirmed fail-fast under the dedicated Error Log title (3 entries × ~0.6s each vs. yesterday's 12 retries + 2 DLQ over 75s). Phase B (post-fix) confirmed auto-backfill: cache populated with real UUIDs (9f5f108c-..., 9e27ba86-..., 41556126-...), three subsequent cache-hit calls succeeded end-to-end on the API side with valid call_ids returned. Vocallabs-side dial-out failed (unrelated config issue on their end — call_ids handed off to Vocallabs team).
8.3 Step exhaustion (CR-006 unified)
When next_step > len(steps), escalation has exhausted within the week. CR-006 (2026-05-15) removes T6 (escalation_to_remedial) — remedial is now reserved for failed-feedback students via T6b. All zero-submission exhausters take the same path:
- Core state (
normal_content_delivery/normal_escalation) →t5_escalation_to_grace→grace_waiting. - Remedial state (
remedial_content_delivery/remedial_escalation) →t11_remedial_to_grace→grace_waiting.
8.4 Grace expiry → direct drop (CR-001)
handle_grace_check (pe_dispatcher.py:502-547) fires at grace_window_end_at:
- If PE has moved out of
grace_waiting, no-op. - If
weekly_submission_done=1, no-op (student submitted within the window). - If clock hasn't actually expired (skew), reschedule to the exact expiry time.
- Otherwise fire
t17_grace_expired(pe, "dispatcher")→program_droppedwithdrop_reason='grace_expired'. No paused_no_activity hop, no re-engagement (those branches were deleted in CR-001 / CR-003).
9. Submission pipeline
The path from a student WhatsApp message to AI feedback delivered back to the student.
9.1 save_submission (save_submission.py:46-237)
Whitelisted as allow_guest=True so Glific can call it. Flow:
- Resolve
student_id(utils.resolve_studentaccepts name, glific_id, or phone). - Resolve active PE via
get_active_pe(state_machine.py:1005-1034) — usesprogram_status IN ('active', 'paused')andORDER BY modified DESCto pick the live row when multiple PEs exist for the same student. - Reject if PE is in a terminal state.
- Normalize the payload —
_normalize_submission_payloadinferssubmission_typeastext | emoji | audio | image | videofrom the input shape (URL → infer by extension; non-URL →emojiif no alnum chars elsetext). - Insert Submission FIRST inside a savepoint (
save_submission.py:135-160, task #81 audit 2026-05-15).is_primary=Falseis provisional; flipped after the atomic claim succeeds. Order matters: if the insert failed AFTER claiming primary, retries would see "duplicate" and silently lose the real submission. _try_claim_primary(pe, week)(save_submission.py:279-315) — the atomic primary claim. UsesUPDATE ... RETURNING:UPDATE `tabProgramEnrollment` SET journey_label = 'submitted', last_label_change_at = NOW(), submission_count = COALESCE(submission_count, 0) + 1, last_submission_at = NOW() WHERE name = %s AND journey_label IN (%s, %s, %s, %s, %s) RETURNING nameThe 5 valid pre-submission labels areenrolled | content_delivered | grace_window | resumed | week_advanced. RETURNING-empty means duplicate.submission_countis bumped here, exactly once — T3/T7/T9/T17 no longer touch it (H-AUDIT-4, 2026-05-15).- If primary: apply state transition via
apply_submission_transition(pe, points=0, "flow_callback")which dispatches onpe.resolved_flow_stateto T7/T3/T9/T17 or T22 for duplicates. CR-007 (2026-05-19): points are NOT computed here. AI validation runs asynchronously after save_submission, so the actual point award (depending onSubmission.result_status) is deferred tofeedback_consumer_hook.on_feedback_ready(§9.3). The transition still bumpsstreak / gems / weekly_submission_doneon every submission regardless of validity (user spec). - Update
EngagementState(legacy bridge) and log toStudentContentLogwith a savepoint (so a bridge-log insert failure doesn't poison the outer transaction — L-030). - If primary: enqueue
process_submission_async(save_submission.py:629-666) which uploads media to GCS, markssubmission.status = "Processing", then enqueuesenqueue_submission. enqueue_submission(save_submission.py:669-815) publishes to RabbitMQ viapika:- Publisher confirms enabled (
channel.confirm_delivery()). - Durable queue (
queue_declare(durable=True)). - Persistent delivery (
delivery_mode=2). - Mandatory routing (
mandatory=True). - P-007 retry: 5 retries via
FEEDBACK_PIPELINE_MAX_RETRIES, DLQ on exhaustion with a JSON-structured Error Log entry including submission_id, student_id, full pe_context.
- Publisher confirms enabled (
9.2 FeedbackConsumer (external)
Runs outside this app. Drains RabbitMQ, generates AI feedback, writes back to the Submission row (overall_feedback, overall_feedback_translated, audio_feedback_url, result_status, status="Completed"), and calls back into the backend via feedback_consumer_hook.on_feedback_ready(submission_name, student_id).
9.3 on_feedback_ready (feedback_consumer_hook.py)
- Resolve student from submission if not provided.
- Find the active PE in
submitted_awaiting_feedback. If none, return{"status": "no_pe"}. - Verify the submission's week matches the PE's
current_week; mismatch → skip. - Points: CR-007 (2026-05-19), revised 2026-06-15.
_compute_submission_points(pe, submission_name, result_status)evaluates branches in this order — the AI-verdict gate sits at the top and overrides every later branch, including the escalation-tier reward for late submissions: - Branch 0 —
result_status == 'Pending': 0 points (defensive — the hook contract is that FeedbackConsumer writes the AI verdict before callingon_feedback_ready, so Pending should never reach here; if a race delivers it anyway, treat as no-verdict and award nothing). - Branch 1 —
WeekRule.submission_validation_enabled = 1ANDresult_status in ('Failed', 'Success - Flagged'): 0 points. Pre-2026-06-15 the late branch ran first and shieldedsent_count >= 1 + Failed/Flaggedfrom this gate; the spec now zeroes them out so the AI verdict universally trumps the tier reward.'Failed'is dead schema in production (no writer emits it) — kept defensively. - Branch 2 —
sent_count >= 1(late submission, escalation fired):EscalationStep[sent_count].points_awarded. - Branch 3 — on-time + valid (
result_status == 'Success - Original', OR any non-Pending value in lax mode):Assignment.points_per_item.
The WeekRule.submission_validation_enabled flag lives on the ArchetypeConfig.week_rules child rows, keyed (archetype, arm, week). It is operator-owned per archetype — there is no week-number-hardcoded "weeks 1-2 are lax" rule in code; that pattern, where it exists, is a deployment convention set by ArchetypeConfig values.
Points are applied via atomic SQL UPDATE (_award_submission_points_atomic, L-011 COALESCE pattern). pe.reload() after the bump ensures the subsequent transition's pe.save() doesn't clobber the freshly-bumped columns.
5. CR-004 + CR-007 routing (current contract): read Submission.submission_validity. If submission_validity == 'Invalid' AND WeekRule.submission_validation_enabled = 1 → fire t6b_failed_feedback_to_remedial. Otherwise → fire t12_feedback_ready. In lax mode (validation off), even invalid submissions stay on Core. (CR-004 originally specified branching on result_status; the implementation was later switched to submission_validity to separate the AI-scoring signal from the validation-gate signal. result_status still governs the points decision in step 4 above; submission_validity governs the routing decision here.)
6. _sync_contact_fields(pe) re-pushes PE state to Glific so the post-award totals land on the contact fields.
Remedial routing is per-week, not permanent. T14 (week advance) at state_machine.py:772 unconditionally sets current_path = PATH_CORE on every advance, so a failed submission only routes the student to Remedial for the current week — the next week starts on Core again.
9.4 Glific F5 callback
flow_callback.update_flow_status (flow_callback.py:92-149) dispatches by flow name. SP_Feedback_Delivery routes to _handle_feedback_delivery (flow_callback.py:225-245):
- Verify PE is still in
feedback_ready(else"already_advanced"response). - Fire
t13_feedback_delivered(pe, "flow_callback")→week_completed, schedulesweek_advancement. - Reload PE and return
_response(pe, "week_completed", current_week=pe.current_week).
9.5 Watchdog
pre_launch.feedback_ready_watchdog (pre_launch.py:129-195) runs hourly (hooks.py:94-96) to find PEs stuck in feedback_ready longer than stuck_hours=2. Log-only — does NOT auto-transition. Silent auto-advance would skip feedback delivery entirely (the reason feedback_ready exists). Operator action: retrigger F5 via Glific UI or call update_flow_status(action='feedback_complete') from bench shell.
10. Gamification — CR-002 v2
Nine PE fields capture the three streams (activity / quiz / submission) plus gems and streak.
10.1 Field roles
| Field | Cumulative? | Resets at T19? | Pushed to Glific? |
|---|---|---|---|
total_activity_points |
yes | no | yes (total_activity_points) |
weekly_activity_points |
no | yes (→ 0) | yes (weekly_activity_points) |
total_quiz_points |
yes | no | yes (total_quiz_points) |
weekly_quiz_points |
no | yes (→ 0) | yes (weekly_quiz_points) |
bonus_quiz_points |
yes (independent) | no | yes (bonus_quiz_points) |
total_submission_points |
yes | no | yes (total_submission_points) |
weekly_submission_points |
no | yes (→ 0) | yes (weekly_submission_points) |
special_gems |
yes | no | yes (special_gems) |
current_streak |
yes | conditional | yes (current_streak) |
total_points |
yes (roll-up) | no | yes (total_points) |
weekly_video_done (sticky flag) |
n/a | yes (→ 0) | NO (internal-only — constants.py:218) |
weekly_submission_done (sticky flag) |
n/a | yes (→ 0) | yes (weekly_submission_done) |
10.2 Activity points (activity_points.py)
Wired on StudentContentLog.after_insert. Filtered to content_type == "VideoClass" AND action == "completed". The handler:
- Guards on
scl.points_awarded > 0(L-012 idempotency). - Resolves
VideoClass.points; returns early on 0/null (E11: zero-point pedagogical videos do not count toward "assigned this week"). - Resolves active PE; returns if none.
- Pre-UPDATE Python read of
pe.weekly_video_doneto detect first-video-of-week → logsgrace_window_enteredevent BEFORE the UPDATE. - Single atomic UPDATE that bumps activity points, sets
weekly_video_done=1, and arms the grace clock viaCASE WHEN weekly_video_done = 0(see §8.1). - Writes
scl.points_awarded = pts(audit anchor written AFTER PE update so retries are safe). - Enqueues contact-field sync; logs
activity_points_awarded.
10.3 Quiz points (quiz_points.py)
Wired on StudentQuizAttempt.on_update. Filtered to completed_at IS NOT NULL AND points_earned == 0 (idempotency).
Per-question rule (compute_quiz_points at quiz_points.py:180-196):
- Correct answer →
QuizQuestion.points - Wrong answer →
QuizQuestion.failed_points - Independent of attempt-level pass/fail
Cumulative-vs-weekly split (quiz_points.award_quiz_points at quiz_points.py:65-125):
total_quiz_points,total_points: apply DELTA vs the previous-latest attempt for the same (student, quiz) pair (latest-score semantics).weekly_quiz_points: ALWAYS adds the new attempt's full earned value (effort semantics).
bonus_quiz_points is independent — award_bonus_quiz_points only touches that field and the matching Glific contact field, never affects total_points / weekly_quiz_points.
10.4 Submission streak and gems
T3/T7/T9/T17 each add to total_submission_points, weekly_submission_points, current_streak += 1, special_gems += 1, and set weekly_submission_done = 1 — all in the same atomic save (state_machine.py:398-414, 534-550, 585-601, 808-823).
10.5 T19 (week advance) streak/gem compute
t14_week_advance (state_machine.py:682-768) runs the compute in two phases:
- Phase 1 — read the sticky weekly flags BEFORE reset:
- if
weekly_video_done = 1 AND weekly_submission_done = 0(assigned but didn't submit):current_streak → 0,special_gems → max(0, special_gems - 1). - else: streak/gems unchanged.
- Phase 2 — reset
weekly_*counters + both sticky flags to 0; advance week; write the streak/gem values computed in Phase 1. Total counters are NEVER reset (E10).
Gem floor is enforced in Python (max(0, ...)). SQL GREATEST is not needed because the value is plain-int by the time we write it.
11. Glific integration
glific_integration.py houses the wrappers. The SP pushes 28 contact fields and reads/writes 5 kind-keyed groups.
11.1 Contact field catalogue (28 fields)
| Category | Fields |
|---|---|
| Identity (6) | student_id, student_name, batch_id, archetype, language, course_level |
| State (7) | resolved_flow_state, current_week, current_path, current_tier, program_status, experiment_arm, grace_window_end_at |
| Submission (3) | current_expected_submission_type, submission_count, weekly_submission_done |
| Escalation (3) | last_escalation_step, escalation_order, escalation_type |
| Gamification (9) | total_points, current_streak, total_activity_points, weekly_activity_points, total_quiz_points, weekly_quiz_points, bonus_quiz_points, total_submission_points, weekly_submission_points, special_gems |
weekly_video_done is internal-only and is intentionally NOT pushed (constants.py:218).
11.2 Push mechanics
state_machine._enqueue_contact_field_sync(pe) (state_machine.py:107-159) serializes the 28 fields into a dict and enqueues _sync_contact_fields_job with enqueue_after_commit=True. The job (state_machine.py:162-258) wraps update_contact_fields(glific_id, fields) with P-007 retry+DLQ:
GLIFIC_SYNC_MAX_RETRIES = 5(constants.py:250).- Retries are IMMEDIATE (no backoff — known limitation, follow-up task).
- DLQ log title:
SP Glific Sync DLQ — manual replay required. - DLQ payload is JSON with
student_id,pe_name,glific_id,fields,final_error,retries_attempted. - Double-fault handling: if the re-enqueue itself raises, the original error goes straight to DLQ.
update_contact_fields itself (glific_integration.py:133-251) uses a 3-step fetch-merge-update pattern: fetch existing fields → merge ours in → write back via updateContact. Without the fetch, Glific's blob-replace semantics would clobber fields set by other tools or flows.
11.3 Group write helpers
add_contact_to_group(contact_id, group_id)—updateGroupContacts(addContactIds=[contact_id], deleteContactIds=[]).glific_integration.py:705-754.remove_contact_from_group(contact_id, group_id)—updateGroupContacts(addContactIds=[], deleteContactIds=[contact_id]). CR-005 addition.glific_integration.py:614-676.create_group_if_missing(label, description)— looks up by label, creates if absent, returns Glific group ID. Used by_ensure_kind_keyed_pg_collectionsand the backfill patch.glific_integration.py:679-702.start_contact_flow(flow_id, contact_id, default_results)— single-contact flow trigger; fire-and-forget.glific_integration.py:392-432.start_group_flow(flow_id, group_id)— group-level flow trigger; used by the weekly Tuesday cron. Lives insummer_program/glific_extensions.py.
12. Flow callbacks (whitelisted endpoints)
summer_program/flow_callback.py is the bridge from Glific to the backend state machine. All endpoints comply with the L-008 flat-map contract.
12.1 The L-008 response helper
_response(pe, status_value, **extras) at flow_callback.py:56-89:
- Writes directly to
frappe.local.responseviaupdate(...)— nomessage:wrapper. - Always emits:
success,status,action(deprecated alias for one release cycle — task #73),resolved_flow_state,next_action_type,next_action_at,program_status. - All values are scalar strings (timestamps stringified, no nested objects, no arrays).
12.2 update_flow_status (flow_callback.py:92-149)
Whitelisted as allow_guest=False. Args: student_id, flow_name, status, metadata (optional). Dispatches by flow_name via _get_handler (flow_callback.py:152-173):
| Flow name | Status values | Handler | Resulting transition |
|---|---|---|---|
SP_Content_Delivery |
no_response / timeout |
_handle_content_delivery |
T1 — schedule first escalation |
SP_Content_Delivery |
completed |
_handle_content_delivery |
no transition; "delivery_confirmed" |
SP_Escalation |
any | _handle_escalation |
no transition; "escalation_confirmed" |
SP_Feedback_Delivery |
any | _handle_feedback_delivery |
T13 — feedback_ready → week_completed |
SP_Submission |
any | _handle_submission_flow |
no transition; "submission_flow_completed" |
SP_Paused_Binge |
any | _handle_binge_info |
no transition; "binge_info_delivered" |
SP_Program_Complete |
any | _handle_info_flow |
no transition; "info_delivered" |
Retired handlers per CR-003: _handle_grace_flow (SP_Grace_Entry), _handle_reengagement (SP_Paused_Reengagement), SP_Grace_Reminder. SP_Week_Summary removed per CR-002 v2.
12.3 save_submission
See §9.1. Whitelisted as allow_guest=True (Glific's flow webhook calls it directly). Returns the L-008 flat-map response with status ∈ {accepted, duplicate, terminal_state, no_active_enrollment, not_found, missing_param, insert_failed}.
12.4 award_bonus_quiz_points
quiz_points.award_bonus_quiz_points(student_id, points) (quiz_points.py:128-173) — decorated with @glific_response (so the helper handles the flat-map write). Awards independent bonus points (does not affect total_points, total_quiz_points, or weekly_quiz_points).
13. Scheduler topology
Wired in hooks.py:42-98. All times are UTC.
| Schedule | Path | Purpose |
|---|---|---|
daily |
tap_lms.tap_lms.page.onboarding_flow_trigger.onboarding_flow_trigger.update_incomplete_stages |
Legacy onboarding sweep (out of scope here) |
daily |
tap_lms.summer_program.scheduler.run_daily_actions |
SP daily housekeeping: per-active-BPR collection actions, program-complete trigger when week > total_weeks |
daily |
tap_lms.summer_program.batch_activation.check_auto_activate |
Auto-activate BPRs whose Batch.start_date <= today |
*/1 * * * * |
tap_lms.summer_program.pe_dispatcher.process_program_actions |
The per-PE dispatcher loop (1000-row batch × 4 workers ≈ 240K/hour) |
0 */2 * * * |
tap_lms.summer_program.escalation_runner.run_escalation_check |
6-hour bulk escalation sweep (legacy batcher; to be replaced when collection-mode rollout lands) |
0 0 * * 1 |
tap_lms.summer_program.batch_admin.auto_advance_batch_week |
Weekly Monday — bumps Batch.current_calendar_week; unblocks max_allowed_week on PEs |
30 3 * * 2 |
tap_lms.summer_program.scheduler.weekly_content_delivery_trigger |
CR-005 weekly content delivery — Tuesday 09:00 IST against each active BPR's main group |
0 * * * * |
tap_lms.summer_program.pre_launch.feedback_ready_watchdog |
Hourly log-only watchdog for stuck feedback_ready PEs (task #56, 2026-05-16) |
Doc events (hooks.py:14-40):
StudentContentLog.after_insert→summer_program.activity_points.handle_content_logStudentQuizAttempt.on_update→summer_program.quiz_points.handle_attempt_update- (Plus pre-existing onboarding hooks on School, Teacher, StudentStageProgress.)
14. Operational tooling
summer_program/pre_launch.py and summer_program/dev_tools.py.
14.1 audit_and_null_stale_next_action_at
pre_launch.audit_and_null_stale_next_action_at(dry_run=False, stale_days=7) (pre_launch.py:32-122). Pre-launch one-shot audit (also re-runnable). Identifies and nulls stale next_action_at values in two buckets:
- Overdue —
next_action_at < NOW() - stale_daysAND not terminal. - Terminal —
next_action_at IS NOT NULLANDresolved_flow_state IN ('program_completed', 'program_dropped').
Single atomic UPDATE covers both buckets. dry_run=True returns counts without writing. Returns {stale_overdue, stale_terminal, total_nulled, by_state}.
Invocation:
bench --site tap_lms.dev execute \
tap_lms.summer_program.pre_launch.audit_and_null_stale_next_action_at
14.2 feedback_ready_watchdog
See §9.5. Hourly cron, log-only, structured Error Log per stuck PE.
14.3 dev_tools — reset utilities
summer_program/dev_tools.py provides destructive reset helpers for dev/test sites:
list_pes_for_batch(batch_name)— read-only; prints PEs with key state fields.reset_pe_to_state_0(pe_name, dry_run=False)— resets a single PE to T0 state. Destroys Submission / StudentQuizAttempt / StudentContentLog / ProgramEventLog rows.reset_pes_for_batch(batch_name)— bulk reset across a batch.
Safety guard _assert_dev_site (dev_tools.py:59-80) refuses to run on sites whose name matches prod | live | production unless the caller passes i_know_this_is_destructive=True. Patches maintain_collections and _enqueue_contact_field_sync at module-level for stable test patching.
15. Resilience patterns
Named patterns referenced throughout the codebase.
P-001 — idempotency primitive (atomic UPDATE-WHERE-RETURNING)
Atomic UPDATE guarding on journey_label (or another precondition column) + RETURNING name. 0-row return = parallel actor moved state; skip gracefully. Lesson L-010.
Canonical example — the dispatcher claim (pe_dispatcher.py:139-152):
UPDATE "tabProgramEnrollment"
SET next_action_at = NULL
WHERE name = %s AND journey_label = %s AND next_action_at IS NOT NULL
RETURNING name
Other instances: _try_claim_primary (save_submission.py:298-307), _record_delivery_failure (pe_dispatcher.py:724-733).
P-002 — atomic counter (COALESCE-update)
SET col = COALESCE(col, 0) + %s so a counter survives races with a peer write or a weekly reset. Lesson L-011.
Instances: activity_points._sql_update_pe_columns (activity_points.py:157-176), quiz_points.award_quiz_points (quiz_points.py:97-106), _bump_member_count (collection_membership.py:204-220), _record_delivery_failure (pe_dispatcher.py:724-733), total_enrolled bump (enrollment.py:213-220).
P-003 — per-attempt audit lock
Write the audit field first (e.g. StudentQuizAttempt.points_earned), then do rollups. The audit field is the L-012 idempotency anchor for retries.
Instances: quiz_points.award_quiz_points writes points_earned at line 75-78 before the PE bump; activity_points.award_activity_points writes scl.points_awarded at line 179-182 AFTER the PE update (audit-after-PE so retries skip correctly). The pattern shape differs by whether the audit write itself is the side-effect anchor.
P-005 — write-once anchor
A column whose presence (points_earned > 0, points_awarded > 0) is checked at entry to short-circuit re-runs. Same field that L-012 protects.
P-006 — backward-compat parameter alias
Whitelisted parameter renames keep the old name as an alias for one release cycle and log a deprecation. Lesson L-009.
Instance: save_submission(content_id=) (deprecated) → assignment_id (save_submission.py:68-77). Aliased and logged via frappe.log_error titled SP API Deprecation.
P-007 — retry + DLQ
Every external write goes through this. On exception: increment retry_count, re-enqueue self via frappe.enqueue up to MAX_RETRIES; on exhaustion, log a structured DLQ entry to Frappe Error Log so operators can replay manually. Lesson L-015.
| Subsystem | Max retries | Constant | DLQ log title |
|---|---|---|---|
| Glific contact-field sync | 5 | GLIFIC_SYNC_MAX_RETRIES |
SP Glific Sync DLQ — manual replay required |
| Glific group writes (CR-005) | 5 | (shares GLIFIC_SYNC_MAX_RETRIES) |
SP Glific Sync DLQ — manual replay required |
| RabbitMQ feedback publish | 5 | FEEDBACK_PIPELINE_MAX_RETRIES |
SP Feedback Pipeline DLQ — manual replay required |
| Vocallabs parent call | 5 | VOCALLABS_MAX_RETRIES |
SP Vocallabs DLQ — manual replay required |
All currently use IMMEDIATE retries (no backoff). Exponential backoff is a deferred follow-up; the current scheme covers short Glific 502/503s and Redis hiccups, while a sustained outage (>~30s) drops to DLQ.
L-003 — FOR UPDATE SKIP LOCKED
PG-specific scheduler primitive. Lets N parallel workers each take a different slice of the candidate set without contention. Instance: process_program_actions SELECT (pe_dispatcher.py:104-122).
CR-005 — partial unique index (defense-in-depth)
A partial unique index on tabPGCollection (parent, kind) WHERE is_active = 1 enforces "at most one active row per (BPR, kind)" at the database level. Defense-in-depth against double-bootstrap; the bootstrap code itself is also idempotent (_ensure_kind_keyed_pg_collections at batch_activation.py:382-439).
16. Future work / explicitly deferred
These are NOT shipped today. Tracked in the task system; ADR-level commitments live in docs/adrs/.
- M-AUDIT-8 — Glific delivery-confirmation webhook. Wires
_trigger_flowfailures into_record_delivery_failure → T23. Currently_trigger_flowis fire-and-forget; the chain is plumbed but inactive. Phase 1. - Exponential backoff scheduler. Replace the immediate-retry P-007 implementation with a delay-scheduled backoff. Covers sustained Glific or RabbitMQ outages. Follow-up to the Phase 1 webhook work.
- TimescaleDB hypertable for ProgramEventLog. ADR-001. Not justified by current load.
- Redis hot-state cache for PE. ADR-002. Not justified by current load.
- Collection-mode batchers for escalation / feedback delivery. ADR-003 successors. The legacy
escalation_runner6-hourly sweep is the current implementation. - PE.last_dispatched_at column. Audit-trail field referenced in the dispatcher loop comments (
pe_dispatcher.py:135-138) but not yet added; filed for DocType-UI follow-up. - Drop deprecated
actionalias from_responseflat-map. After all Glific flows are audited to read@results.webhook.statusinstead of@results.webhook.action(task #73). - Cleanup CR for T6 stub. The
t6_escalation_to_remedialfunction instate_machine.py:464-481raises if called; will be deleted entirely once no caller imports it. - Cleanup CR for legacy archetype-keyed PGCollection rows. Currently
is_active=0post-CR-005 backfill; eventually delete.
17. Known gaps in this document
- Course content authoring (CourseLevel, VideoClass, Quiz, QuizQuestion configuration) is not covered. Lives outside
summer_program/and is a separate concern. - School / teacher onboarding (
onboarding_flow_trigger,Backend Student Onboarding) is referenced only where it touches the SP enrollment ingress. The full onboarding pipeline is owned by a different team and lives in thetap_lms/page/onboarding_flow_trigger/namespace. - The legacy
escalation_runner6-hourly sweep is registered but not described — it predates the per-PE dispatcher and is in the process of being replaced. - SP_Incoming_Router — the inbound-only re-engagement entry point referenced in CR-003. This is a Glific flow (not backend code); the only backend touchpoint is the standard
update_flow_statuscallback when the rejoin path completes. reactivate_student— mentioned instate_machine.py:12as a state-machine entry point but the implementation lives outside the files surveyed here.escalation_runner.run_escalation_checkinternals — referenced inhooks.py:77-78but its code was not opened for this doc.- Week-1 content for late activations — content delivery is now batch-driven via
weekly_content_delivery_trigger; week-1 students added to themaincollection mid-week will not get content until the following Tuesday. The admin's mitigation is to time activation Mon/Tue (batch_activation.py:223-227).