Global Supervisor — Operations¶
This content was migrated from
Documentation/GLOBAL_SUPERVISOR.mdand restructured into audience sections. Review for accuracy against the current codebase.
Deployment¶
The Global Supervisor is deployed as part of the Swisper backend application. It does not have a separate deployment unit — it runs within the backend API process.
Dependencies:
| Dependency | Purpose | Required |
|---|---|---|
| Redis | Primary state checkpointer — stores graph state after each node for HITL resume and crash recovery | Yes |
| PostgreSQL | Long-term state fallback, chat history storage, entity/fact persistence | Yes |
| LLM Provider | Intent classification, fact extraction, response generation (multiple LLM calls per turn) | Yes |
| Domain Agent Registry | Access to domain agents (Productivity, Wealth, Research, etc.) | Yes |
Startup: The GlobalSupervisor is instantiated per-request by OrchestrationService. There is no long-lived supervisor process — each message gets a fresh graph instance with state loaded from the Redis checkpointer.
⚠ Gap: Specific deployment commands, rollback procedures, and deployment pipeline details could not be determined from the available inputs. These are managed at the platform level rather than per-module. Manual review needed.
Monitoring¶
| # | Metric | What It Measures | Normal Range | Alert Threshold |
|---|---|---|---|---|
| 1 | Token usage per turn | Total LLM tokens consumed across all nodes in a single turn | 5,000–20,000 tokens | >50,000 tokens (indicates runaway prompt or missing optimization flags) |
| 2 | Time-to-first-token | Latency from message receipt to first streamed response chunk | 200–500ms (simple), 1–3s (complex) | >5s for simple queries, >10s for complex queries |
| 3 | Summarization trigger rate | Percentage of turns that trigger conversation summarization | 5–15% of turns | >30% (may indicate summarization threshold too low or conversations not being compressed) |
| 4 | HITL interrupt rate | Percentage of turns that result in a disambiguation or clarification pause | 2–10% of turns | >25% (may indicate entity resolution is too aggressive) |
| 5 | Agent execution failures | Domain agent exceptions caught by the supervisor | 0–2% of agent calls | >5% (indicates agent instability or dependency issues) |
Token Usage: The Swisper UI displays a per-user token usage counter in the sidebar. This tracks cumulative token consumption across the user's conversations.
The sidebar token usage counter helps monitor per-user LLM consumption.
⚠ Gap: Specific monitoring dashboard URLs, alert configurations, and Prometheus/Grafana metric names could not be determined from the available inputs. Manual review needed.
Common Failure Modes¶
| # | Trigger | Symptoms | Impact | Resolution |
|---|---|---|---|---|
| 1 | Redis unavailable | State checkpoint fails; HITL interrupts cannot resume; ConnectionError in logs |
Conversations cannot be paused for disambiguation; crash recovery unavailable. Text chat still works for single-turn messages | Verify Redis connectivity. Check Redis memory usage. If Redis is down, the PostgreSQL fallback should activate — verify StatePersistenceManager fallback logic |
| 2 | LLM provider timeout | Intent classification or response generation hangs; request times out | User sees no response or a timeout error | Check LLM provider status. Review timeout configuration. If regional outage: switch to fallback LLM provider if configured |
| 3 | Entity resolution infinite loop | Disambiguation keeps triggering on the same entity across turns | User is repeatedly asked "which X?" even after answering | Check disambiguation_resolution.py — verify the resolved entity ID is being written to state correctly. Check Redis checkpoint is persisting the resolution |
| 4 | Summarization corrupts context | After summarization, the assistant loses important context from earlier in the conversation | Assistant gives responses that contradict earlier conversation | Review SUMMARY_RECENCY_TURNS (should be ≥2). Check summarization prompt quality. Increase SUMMARY_TOKEN_THRESHOLD to delay summarization |
| 5 | Domain agent hangs | agent_execution node never completes; planner→agent loop stalls |
User sees no response; the turn hangs until timeout | Check max_iterations (default 10). Review the specific domain agent's health. Kill the hung request and investigate the agent's logs |
Runbooks¶
Runbook 1: HITL State Stuck in "Waiting"¶
Trigger: A user reports they cannot continue a conversation — the system keeps showing a disambiguation prompt even after they responded.
Steps:
- Get the
chat_idfrom the user's session (use Developer Tools → Copy chat ID) - Check Redis for the checkpoint state:
- Inspect the
user_in_the_loopfield in the state — verifyis_waitingistrueand thetypematches the expected interrupt - If the state is corrupted or stuck:
- Ask the user to send a new message — the supervisor will start fresh without the stale interrupt
- Rollback: If deleting the checkpoint causes issues, the PostgreSQL fallback should have the last good state. Check the
state_persistencetable for thechat_id
Runbook 2: Token Budget Exceeded¶
Trigger: A user hits their token limit (e.g., 1500K) and cannot send more messages.
Steps:
- Verify the user's token usage in the admin panel
- Check if summarization is working — long conversations without summarization consume tokens faster
- Review recent conversations for abnormally high token consumption (possible causes: missing optimization flags, large file uploads being processed repeatedly)
- If the usage is legitimate, increase the user's token quota
- If the usage is abnormal, check for the specific conversation and review the node-level token tracking
Escalation¶
⚠ Gap: Specific escalation contacts, on-call rotations, and communication channels could not be determined from the available inputs. Manual review needed.
| Criteria | Action |
|---|---|
| Redis down for >5 minutes | Escalate to infrastructure team |
| LLM provider outage | Escalate to platform team; consider enabling fallback provider |
| Persistent HITL state corruption | Escalate to backend team (supervisor module owner) |
| Token tracking discrepancy | Escalate to backend team |