Global Supervisor — Architecture¶
This content was migrated from
Documentation/GLOBAL_SUPERVISOR.mdand restructured into audience sections. Review for accuracy against the current codebase.
Context and Purpose¶
The Global Supervisor exists because Swisper needs a single, deterministic orchestration layer that coordinates every aspect of a conversation turn — from intent classification through memory retrieval, agent delegation, and response generation.
The key driving requirements behind its design are:
- Deterministic flow control — Each user message must follow a predictable path through well-defined processing stages, making debugging and auditing possible
- Entity-first disambiguation — Facts must never be stored against the wrong entity. Entity resolution must complete before fact extraction proceeds
- Minimal time-to-first-token — Simple queries should skip expensive processing stages via optimization flags, while complex queries get full pipeline treatment
- State persistence for HITL — When the system needs to ask the user a clarification question (e.g., disambiguation), the entire conversation state must be checkpointed so execution can resume exactly where it left off
Architecture Overview¶
The Global Supervisor is implemented as a LangGraph StateGraph — a directed graph where each node is a processing step and edges define the flow between them. All nodes share a single GlobalSupervisorState (TypedDict) that accumulates data as the message progresses through the pipeline.
The graph is organized into seven logical stages:
flowchart TB
subgraph Entry["1. Entry & Session"]
START([User Message]) --> SI[Session Init]
SI --> SC[Summarization Check]
SC -->|needs summarization| SUM[Summarization]
SC -->|no| CL
SUM --> CL[Context Loader]
end
subgraph HITL["2. HITL Resume"]
CL --> UIL[HITL Handler]
UIL -->|has interrupt| RESUME{Resume Type}
UIL -->|no interrupt| IC
RESUME -->|disambiguation| IC
RESUME -->|agent HITL| AE
end
subgraph Classification["3. Intent Classification"]
IC[Intent Classification]
end
subgraph Memory["4. Memory Pipeline"]
IC --> RR{Retrieval Router}
RR -->|skip retrieval| MA
RR -->|semantic only| ER
RR -->|parallel retrieval| ER
ER[Entity Resolution] --> SR[Semantic Retrieval]
SR --> TR[Temporal Retrieval]
TR --> MA[Memory Assembly]
MA --> FE[Fact Extraction]
FE --> EM[Extraction Merge]
end
subgraph Disambiguation["5. Disambiguation"]
EM --> AMB{Entity Ambiguity?}
AMB -->|blocking| DB[Disambiguation Blocking]
AMB -->|non-blocking| DS[Disambiguation Simple]
AMB -->|none| ROUTE
DB --> MP
DS --> MP
end
subgraph Planning["6. Planning & Execution"]
ROUTE{Route}
ROUTE -->|simple chat| UIR[UI Router]
ROUTE -->|complex| GP[Global Planner]
GP --> AE[Agent Execution]
AE --> GP2{More agents?}
GP2 -->|yes| GP
GP2 -->|no| UIR
end
subgraph Response["7. Response Generation"]
UIR --> UINODES{Response Type}
UINODES -->|simple| ST[Simple Text]
UINODES -->|complex| CT[Complex Text]
UINODES -->|hitl| HT[HITL Text]
ST --> MP[Message Persist]
CT --> MP
HT --> MP
end
MP --> DONE([End / HITL Interrupt])
Flow summary: A message enters at Session Init, gets classified, passes through memory retrieval and entity resolution, optionally triggers disambiguation, gets routed to either direct response or planner-driven agent execution, and exits through a specialized UI response node.
Component Responsibilities¶
| Component | Responsibility |
|---|---|
| Session Init | Loads chat history, initializes token tracking, sets up turn context |
| Summarization Check / Summarization | Detects when conversation history exceeds thresholds (>20 messages or >4,000 tokens) and compresses it |
| Context Loader | Loads avatar configuration, presentation rules, and preloaded facts (parallel execution) |
| HITL Handler | Detects pending human-in-the-loop interrupts and routes to the correct resume point |
| Intent Classification | Classifies intent (simple/complex/greeting), extracts entities, and sets optimization flags (has_extractable_facts, has_preferences, needs_semantic_retrieval) |
| Retrieval Router | Decides which memory retrieval path to take based on optimization flags |
| Entity Resolution | Resolves mentioned entities against the user's contact database; detects ambiguity |
| Semantic Retrieval | Vector search for relevant facts based on message content |
| Temporal Retrieval | Time-based fact retrieval (e.g., upcoming events, recent changes) |
| Memory Assembly | Merges all retrieved facts into a unified context |
| Fact Extraction | Extracts new facts from the user's message (runs in parallel with entity resolution) |
| Extraction Merge | Links extracted facts to resolved entities and persists them to the database |
| Disambiguation (Blocking) | Pauses execution and asks the user which entity they mean — generates a HITL interrupt |
| Disambiguation (Simple) | Non-blocking disambiguation: answers the question and appends a "by the way, which X?" follow-up |
| Global Planner | Creates a multi-step execution plan determining which domain agents to invoke and in what order |
| Agent Execution | Executes domain agents (Productivity, Wealth, Research, etc.) and collects results |
| UI Response Nodes | Specialized response generators: Simple Text, Complex Text (with agent result synthesis), HITL Text |
| Message Persist | Saves the assistant's response and conversation metadata to the database |
Data Model¶
GlobalSupervisorState¶
The shared state is a Python TypedDict with these key domains:
| Domain | Key Fields | Purpose |
|---|---|---|
| Conversation | user_message, messages_history, chat_id, user_id, avatar_id, model |
Core conversation identity and history |
| Intent | intent_classification (route, entities, optimization flags) |
Routing decision from classification |
| Memory | memory_domain (conversation context, facts), resolved_entities, extracted_facts, pending_facts |
Retrieved and extracted knowledge |
| Disambiguation | entity_ambiguity, btw_disambiguation_resolved, blocking_disambiguation_resolved |
Entity ambiguity tracking |
| Planning | global_planner_decision, current_agent_result, recent_agent_results |
Execution plan and agent outputs |
| HITL | user_in_the_loop (UserInTheLoop model), hitl_user_response |
Human-in-the-loop interrupt state |
| Response | user_interface_response, presentation_rules, modality |
Generated response and display rules |
State Persistence¶
| Store | Role | Mechanism |
|---|---|---|
| Redis | Primary state checkpointer | Snapshots after each node via LangGraph checkpointer |
| PostgreSQL | Long-term fallback | Recovery when Redis state is evicted |
The checkpointer uses chat_id as the thread identifier, enabling resume-from-checkpoint for HITL interrupts and crash recovery.
Key Design Decisions¶
1. Synchronous Entity Resolution Before Fact Storage¶
- Chosen: Entity resolution runs synchronously and blocks fact extraction from persisting until entities are resolved
- Rejected: Parallel entity resolution + fact extraction with post-hoc linking
- Rationale: When a user says "Thomas is traveling to Mallorca" and there are two contacts named Thomas, storing the fact before knowing which Thomas it belongs to leads to orphaned or misattributed data. The 500–1,500ms latency cost is acceptable to guarantee data correctness
2. LangGraph StateGraph Over Custom Orchestration¶
- Chosen: LangGraph
StateGraphwith conditional edges and built-in checkpointing - Rejected: Custom async pipeline, event-driven choreography
- Rationale: LangGraph provides deterministic execution, built-in state persistence (checkpointers), native HITL support via
interrupt(), and visual debugging. The structured graph makes it straightforward to add, remove, or reorder nodes
3. Optimization Flags for Node Skipping¶
- Chosen: Intent classification sets boolean flags (
has_extractable_facts,needs_semantic_retrieval,has_preferences) that downstream routing functions use to skip unnecessary nodes - Rejected: Running all nodes for every message; lazy evaluation at each node
- Rationale: A greeting like "Hi" doesn't need fact extraction (~9,000 tokens) or semantic retrieval (~500ms). The flags save 2–3 seconds and ~15,000 tokens on simple messages while keeping the graph structure uniform
Interfaces and Contracts¶
| Interface | Direction | Format | Consumer |
|---|---|---|---|
| Orchestration Service → GlobalSupervisor | Inbound | GlobalSupervisor.run(user_message, model, chat_id, user_id, ...) |
Orchestration Service (API layer) |
| GlobalSupervisor → Domain Agents | Outbound | DomainAgentInterface.execute(state) via Agent Registry |
Productivity, Wealth, Research, and other domain agents |
| GlobalSupervisor → Memory System | Bidirectional | Repository pattern (read facts, write extracted facts) | Fact storage, entity resolution, semantic search |
| GlobalSupervisor → Redis/PostgreSQL | Outbound | LangGraph checkpointer protocol | State persistence for HITL and recovery |
| GlobalSupervisor → UI (streaming) | Outbound | SupervisorResponseChunkEvent (Server-Sent Events) |
Frontend text/voice rendering |
Known Trade-offs and Debt¶
| Item | Impact | Remediation |
|---|---|---|
Large build_graph() method |
The graph construction in agent.py is ~1,000 lines. Adding or reordering nodes requires reading the full method to understand edge dependencies |
Potential refactor: extract stage-level subgraphs (entry, memory, planning, response) into separate builder functions |
Legacy memory.py node |
The old monolithic memory node still exists and is used only for the greeting flow. All other flows use the split memory pipeline (retrieval_router → semantic → temporal → assembly) | Remove legacy node once greeting flow is migrated to the split pipeline |
| State field sprawl | GlobalSupervisorState has grown to ~77 fields across multiple concerns, making it hard to know which fields are relevant at each node |
Consider namespaced sub-states or a state schema validation layer |
| No retry on agent execution failures | If a domain agent raises an exception, the supervisor catches it and reports the failure but does not retry | Add configurable retry with exponential backoff for transient failures |