Global Supervisor — Operations¶

This content was migrated from Documentation/GLOBAL_SUPERVISOR.md and restructured into audience sections. Review for accuracy against the current codebase.

Deployment¶

The Global Supervisor is deployed as part of the Swisper backend application. It does not have a separate deployment unit — it runs within the backend API process.

Dependencies:

Dependency	Purpose	Required
Redis	Primary state checkpointer — stores graph state after each node for HITL resume and crash recovery	Yes
PostgreSQL	Long-term state fallback, chat history storage, entity/fact persistence	Yes
LLM Provider	Intent classification, fact extraction, response generation (multiple LLM calls per turn)	Yes
Domain Agent Registry	Access to domain agents (Productivity, Wealth, Research, etc.)	Yes

Startup: The GlobalSupervisor is instantiated per-request by OrchestrationService. There is no long-lived supervisor process — each message gets a fresh graph instance with state loaded from the Redis checkpointer.

⚠ Gap: Specific deployment commands, rollback procedures, and deployment pipeline details could not be determined from the available inputs. These are managed at the platform level rather than per-module. Manual review needed.

Monitoring¶

#	Metric	What It Measures	Normal Range	Alert Threshold
1	Token usage per turn	Total LLM tokens consumed across all nodes in a single turn	5,000–20,000 tokens	>50,000 tokens (indicates runaway prompt or missing optimization flags)
2	Time-to-first-token	Latency from message receipt to first streamed response chunk	200–500ms (simple), 1–3s (complex)	>5s for simple queries, >10s for complex queries
3	Summarization trigger rate	Percentage of turns that trigger conversation summarization	5–15% of turns	>30% (may indicate summarization threshold too low or conversations not being compressed)
4	HITL interrupt rate	Percentage of turns that result in a disambiguation or clarification pause	2–10% of turns	>25% (may indicate entity resolution is too aggressive)
5	Agent execution failures	Domain agent exceptions caught by the supervisor	0–2% of agent calls	>5% (indicates agent instability or dependency issues)

Token Usage: The Swisper UI displays a per-user token usage counter in the sidebar. This tracks cumulative token consumption across the user's conversations.

Token usage counter in the Swisper sidebar showing 0K / 1500K usage with a progress bar The sidebar token usage counter helps monitor per-user LLM consumption.

⚠ Gap: Specific monitoring dashboard URLs, alert configurations, and Prometheus/Grafana metric names could not be determined from the available inputs. Manual review needed.

Common Failure Modes¶

#	Trigger	Symptoms	Impact	Resolution
1	Redis unavailable	State checkpoint fails; HITL interrupts cannot resume; `ConnectionError` in logs	Conversations cannot be paused for disambiguation; crash recovery unavailable. Text chat still works for single-turn messages	Verify Redis connectivity. Check Redis memory usage. If Redis is down, the PostgreSQL fallback should activate — verify `StatePersistenceManager` fallback logic
2	LLM provider timeout	Intent classification or response generation hangs; request times out	User sees no response or a timeout error	Check LLM provider status. Review timeout configuration. If regional outage: switch to fallback LLM provider if configured
3	Entity resolution infinite loop	Disambiguation keeps triggering on the same entity across turns	User is repeatedly asked "which X?" even after answering	Check `disambiguation_resolution.py` — verify the resolved entity ID is being written to state correctly. Check Redis checkpoint is persisting the resolution
4	Summarization corrupts context	After summarization, the assistant loses important context from earlier in the conversation	Assistant gives responses that contradict earlier conversation	Review `SUMMARY_RECENCY_TURNS` (should be ≥2). Check summarization prompt quality. Increase `SUMMARY_TOKEN_THRESHOLD` to delay summarization
5	Domain agent hangs	`agent_execution` node never completes; planner→agent loop stalls	User sees no response; the turn hangs until timeout	Check `max_iterations` (default 10). Review the specific domain agent's health. Kill the hung request and investigate the agent's logs

Runbooks¶

Runbook 1: HITL State Stuck in "Waiting"¶

Trigger: A user reports they cannot continue a conversation — the system keeps showing a disambiguation prompt even after they responded.

Steps:

Get the chat_id from the user's session (use Developer Tools → Copy chat ID)
Check Redis for the checkpoint state:
```
redis-cli GET "checkpoint:{chat_id}"
```
Inspect the user_in_the_loop field in the state — verify is_waiting is true and the type matches the expected interrupt
If the state is corrupted or stuck:
```
redis-cli DEL "checkpoint:{chat_id}"
```
Ask the user to send a new message — the supervisor will start fresh without the stale interrupt
Rollback: If deleting the checkpoint causes issues, the PostgreSQL fallback should have the last good state. Check the state_persistence table for the chat_id

Runbook 2: Token Budget Exceeded¶

Trigger: A user hits their token limit (e.g., 1500K) and cannot send more messages.

Steps:

Verify the user's token usage in the admin panel
Check if summarization is working — long conversations without summarization consume tokens faster
Review recent conversations for abnormally high token consumption (possible causes: missing optimization flags, large file uploads being processed repeatedly)
If the usage is legitimate, increase the user's token quota
If the usage is abnormal, check for the specific conversation and review the node-level token tracking

Escalation¶

⚠ Gap: Specific escalation contacts, on-call rotations, and communication channels could not be determined from the available inputs. Manual review needed.

Criteria	Action
Redis down for >5 minutes	Escalate to infrastructure team
LLM provider outage	Escalate to platform team; consider enabling fallback provider
Persistent HITL state corruption	Escalate to backend team (supervisor module owner)
Token tracking discrepancy	Escalate to backend team