Skip to content

Global Supervisor — Operations

This content was migrated from Documentation/GLOBAL_SUPERVISOR.md and restructured into audience sections. Review for accuracy against the current codebase.

Deployment

The Global Supervisor is deployed as part of the Swisper backend application. It does not have a separate deployment unit — it runs within the backend API process.

Dependencies:

Dependency Purpose Required
Redis Primary state checkpointer — stores graph state after each node for HITL resume and crash recovery Yes
PostgreSQL Long-term state fallback, chat history storage, entity/fact persistence Yes
LLM Provider Intent classification, fact extraction, response generation (multiple LLM calls per turn) Yes
Domain Agent Registry Access to domain agents (Productivity, Wealth, Research, etc.) Yes

Startup: The GlobalSupervisor is instantiated per-request by OrchestrationService. There is no long-lived supervisor process — each message gets a fresh graph instance with state loaded from the Redis checkpointer.

Gap: Specific deployment commands, rollback procedures, and deployment pipeline details could not be determined from the available inputs. These are managed at the platform level rather than per-module. Manual review needed.

Monitoring

# Metric What It Measures Normal Range Alert Threshold
1 Token usage per turn Total LLM tokens consumed across all nodes in a single turn 5,000–20,000 tokens >50,000 tokens (indicates runaway prompt or missing optimization flags)
2 Time-to-first-token Latency from message receipt to first streamed response chunk 200–500ms (simple), 1–3s (complex) >5s for simple queries, >10s for complex queries
3 Summarization trigger rate Percentage of turns that trigger conversation summarization 5–15% of turns >30% (may indicate summarization threshold too low or conversations not being compressed)
4 HITL interrupt rate Percentage of turns that result in a disambiguation or clarification pause 2–10% of turns >25% (may indicate entity resolution is too aggressive)
5 Agent execution failures Domain agent exceptions caught by the supervisor 0–2% of agent calls >5% (indicates agent instability or dependency issues)

Token Usage: The Swisper UI displays a per-user token usage counter in the sidebar. This tracks cumulative token consumption across the user's conversations.

Token usage counter in the Swisper sidebar showing 0K / 1500K usage with a progress bar The sidebar token usage counter helps monitor per-user LLM consumption.

Gap: Specific monitoring dashboard URLs, alert configurations, and Prometheus/Grafana metric names could not be determined from the available inputs. Manual review needed.

Common Failure Modes

# Trigger Symptoms Impact Resolution
1 Redis unavailable State checkpoint fails; HITL interrupts cannot resume; ConnectionError in logs Conversations cannot be paused for disambiguation; crash recovery unavailable. Text chat still works for single-turn messages Verify Redis connectivity. Check Redis memory usage. If Redis is down, the PostgreSQL fallback should activate — verify StatePersistenceManager fallback logic
2 LLM provider timeout Intent classification or response generation hangs; request times out User sees no response or a timeout error Check LLM provider status. Review timeout configuration. If regional outage: switch to fallback LLM provider if configured
3 Entity resolution infinite loop Disambiguation keeps triggering on the same entity across turns User is repeatedly asked "which X?" even after answering Check disambiguation_resolution.py — verify the resolved entity ID is being written to state correctly. Check Redis checkpoint is persisting the resolution
4 Summarization corrupts context After summarization, the assistant loses important context from earlier in the conversation Assistant gives responses that contradict earlier conversation Review SUMMARY_RECENCY_TURNS (should be ≥2). Check summarization prompt quality. Increase SUMMARY_TOKEN_THRESHOLD to delay summarization
5 Domain agent hangs agent_execution node never completes; planner→agent loop stalls User sees no response; the turn hangs until timeout Check max_iterations (default 10). Review the specific domain agent's health. Kill the hung request and investigate the agent's logs

Runbooks

Runbook 1: HITL State Stuck in "Waiting"

Trigger: A user reports they cannot continue a conversation — the system keeps showing a disambiguation prompt even after they responded.

Steps:

  1. Get the chat_id from the user's session (use Developer Tools → Copy chat ID)
  2. Check Redis for the checkpoint state:
    redis-cli GET "checkpoint:{chat_id}"
    
  3. Inspect the user_in_the_loop field in the state — verify is_waiting is true and the type matches the expected interrupt
  4. If the state is corrupted or stuck:
    redis-cli DEL "checkpoint:{chat_id}"
    
  5. Ask the user to send a new message — the supervisor will start fresh without the stale interrupt
  6. Rollback: If deleting the checkpoint causes issues, the PostgreSQL fallback should have the last good state. Check the state_persistence table for the chat_id

Runbook 2: Token Budget Exceeded

Trigger: A user hits their token limit (e.g., 1500K) and cannot send more messages.

Steps:

  1. Verify the user's token usage in the admin panel
  2. Check if summarization is working — long conversations without summarization consume tokens faster
  3. Review recent conversations for abnormally high token consumption (possible causes: missing optimization flags, large file uploads being processed repeatedly)
  4. If the usage is legitimate, increase the user's token quota
  5. If the usage is abnormal, check for the specific conversation and review the node-level token tracking

Escalation

Gap: Specific escalation contacts, on-call rotations, and communication channels could not be determined from the available inputs. Manual review needed.

Criteria Action
Redis down for >5 minutes Escalate to infrastructure team
LLM provider outage Escalate to platform team; consider enabling fallback provider
Persistent HITL state corruption Escalate to backend team (supervisor module owner)
Token tracking discrepancy Escalate to backend team