Skip to content

Global Supervisor — Architecture

This content was migrated from Documentation/GLOBAL_SUPERVISOR.md and restructured into audience sections. Review for accuracy against the current codebase.

Context and Purpose

The Global Supervisor exists because Swisper needs a single, deterministic orchestration layer that coordinates every aspect of a conversation turn — from intent classification through memory retrieval, agent delegation, and response generation.

The key driving requirements behind its design are:

  • Deterministic flow control — Each user message must follow a predictable path through well-defined processing stages, making debugging and auditing possible
  • Entity-first disambiguation — Facts must never be stored against the wrong entity. Entity resolution must complete before fact extraction proceeds
  • Minimal time-to-first-token — Simple queries should skip expensive processing stages via optimization flags, while complex queries get full pipeline treatment
  • State persistence for HITL — When the system needs to ask the user a clarification question (e.g., disambiguation), the entire conversation state must be checkpointed so execution can resume exactly where it left off

Architecture Overview

The Global Supervisor is implemented as a LangGraph StateGraph — a directed graph where each node is a processing step and edges define the flow between them. All nodes share a single GlobalSupervisorState (TypedDict) that accumulates data as the message progresses through the pipeline.

The graph is organized into seven logical stages:

flowchart TB
    subgraph Entry["1. Entry & Session"]
        START([User Message]) --> SI[Session Init]
        SI --> SC[Summarization Check]
        SC -->|needs summarization| SUM[Summarization]
        SC -->|no| CL
        SUM --> CL[Context Loader]
    end

    subgraph HITL["2. HITL Resume"]
        CL --> UIL[HITL Handler]
        UIL -->|has interrupt| RESUME{Resume Type}
        UIL -->|no interrupt| IC
        RESUME -->|disambiguation| IC
        RESUME -->|agent HITL| AE
    end

    subgraph Classification["3. Intent Classification"]
        IC[Intent Classification]
    end

    subgraph Memory["4. Memory Pipeline"]
        IC --> RR{Retrieval Router}
        RR -->|skip retrieval| MA
        RR -->|semantic only| ER
        RR -->|parallel retrieval| ER
        ER[Entity Resolution] --> SR[Semantic Retrieval]
        SR --> TR[Temporal Retrieval]
        TR --> MA[Memory Assembly]
        MA --> FE[Fact Extraction]
        FE --> EM[Extraction Merge]
    end

    subgraph Disambiguation["5. Disambiguation"]
        EM --> AMB{Entity Ambiguity?}
        AMB -->|blocking| DB[Disambiguation Blocking]
        AMB -->|non-blocking| DS[Disambiguation Simple]
        AMB -->|none| ROUTE
        DB --> MP
        DS --> MP
    end

    subgraph Planning["6. Planning & Execution"]
        ROUTE{Route}
        ROUTE -->|simple chat| UIR[UI Router]
        ROUTE -->|complex| GP[Global Planner]
        GP --> AE[Agent Execution]
        AE --> GP2{More agents?}
        GP2 -->|yes| GP
        GP2 -->|no| UIR
    end

    subgraph Response["7. Response Generation"]
        UIR --> UINODES{Response Type}
        UINODES -->|simple| ST[Simple Text]
        UINODES -->|complex| CT[Complex Text]
        UINODES -->|hitl| HT[HITL Text]
        ST --> MP[Message Persist]
        CT --> MP
        HT --> MP
    end

    MP --> DONE([End / HITL Interrupt])

Flow summary: A message enters at Session Init, gets classified, passes through memory retrieval and entity resolution, optionally triggers disambiguation, gets routed to either direct response or planner-driven agent execution, and exits through a specialized UI response node.

Component Responsibilities

Component Responsibility
Session Init Loads chat history, initializes token tracking, sets up turn context
Summarization Check / Summarization Detects when conversation history exceeds thresholds (>20 messages or >4,000 tokens) and compresses it
Context Loader Loads avatar configuration, presentation rules, and preloaded facts (parallel execution)
HITL Handler Detects pending human-in-the-loop interrupts and routes to the correct resume point
Intent Classification Classifies intent (simple/complex/greeting), extracts entities, and sets optimization flags (has_extractable_facts, has_preferences, needs_semantic_retrieval)
Retrieval Router Decides which memory retrieval path to take based on optimization flags
Entity Resolution Resolves mentioned entities against the user's contact database; detects ambiguity
Semantic Retrieval Vector search for relevant facts based on message content
Temporal Retrieval Time-based fact retrieval (e.g., upcoming events, recent changes)
Memory Assembly Merges all retrieved facts into a unified context
Fact Extraction Extracts new facts from the user's message (runs in parallel with entity resolution)
Extraction Merge Links extracted facts to resolved entities and persists them to the database
Disambiguation (Blocking) Pauses execution and asks the user which entity they mean — generates a HITL interrupt
Disambiguation (Simple) Non-blocking disambiguation: answers the question and appends a "by the way, which X?" follow-up
Global Planner Creates a multi-step execution plan determining which domain agents to invoke and in what order
Agent Execution Executes domain agents (Productivity, Wealth, Research, etc.) and collects results
UI Response Nodes Specialized response generators: Simple Text, Complex Text (with agent result synthesis), HITL Text
Message Persist Saves the assistant's response and conversation metadata to the database

Data Model

GlobalSupervisorState

The shared state is a Python TypedDict with these key domains:

Domain Key Fields Purpose
Conversation user_message, messages_history, chat_id, user_id, avatar_id, model Core conversation identity and history
Intent intent_classification (route, entities, optimization flags) Routing decision from classification
Memory memory_domain (conversation context, facts), resolved_entities, extracted_facts, pending_facts Retrieved and extracted knowledge
Disambiguation entity_ambiguity, btw_disambiguation_resolved, blocking_disambiguation_resolved Entity ambiguity tracking
Planning global_planner_decision, current_agent_result, recent_agent_results Execution plan and agent outputs
HITL user_in_the_loop (UserInTheLoop model), hitl_user_response Human-in-the-loop interrupt state
Response user_interface_response, presentation_rules, modality Generated response and display rules

State Persistence

Store Role Mechanism
Redis Primary state checkpointer Snapshots after each node via LangGraph checkpointer
PostgreSQL Long-term fallback Recovery when Redis state is evicted

The checkpointer uses chat_id as the thread identifier, enabling resume-from-checkpoint for HITL interrupts and crash recovery.

Key Design Decisions

1. Synchronous Entity Resolution Before Fact Storage

  • Chosen: Entity resolution runs synchronously and blocks fact extraction from persisting until entities are resolved
  • Rejected: Parallel entity resolution + fact extraction with post-hoc linking
  • Rationale: When a user says "Thomas is traveling to Mallorca" and there are two contacts named Thomas, storing the fact before knowing which Thomas it belongs to leads to orphaned or misattributed data. The 500–1,500ms latency cost is acceptable to guarantee data correctness

2. LangGraph StateGraph Over Custom Orchestration

  • Chosen: LangGraph StateGraph with conditional edges and built-in checkpointing
  • Rejected: Custom async pipeline, event-driven choreography
  • Rationale: LangGraph provides deterministic execution, built-in state persistence (checkpointers), native HITL support via interrupt(), and visual debugging. The structured graph makes it straightforward to add, remove, or reorder nodes

3. Optimization Flags for Node Skipping

  • Chosen: Intent classification sets boolean flags (has_extractable_facts, needs_semantic_retrieval, has_preferences) that downstream routing functions use to skip unnecessary nodes
  • Rejected: Running all nodes for every message; lazy evaluation at each node
  • Rationale: A greeting like "Hi" doesn't need fact extraction (~9,000 tokens) or semantic retrieval (~500ms). The flags save 2–3 seconds and ~15,000 tokens on simple messages while keeping the graph structure uniform

Interfaces and Contracts

Interface Direction Format Consumer
Orchestration Service → GlobalSupervisor Inbound GlobalSupervisor.run(user_message, model, chat_id, user_id, ...) Orchestration Service (API layer)
GlobalSupervisor → Domain Agents Outbound DomainAgentInterface.execute(state) via Agent Registry Productivity, Wealth, Research, and other domain agents
GlobalSupervisor → Memory System Bidirectional Repository pattern (read facts, write extracted facts) Fact storage, entity resolution, semantic search
GlobalSupervisor → Redis/PostgreSQL Outbound LangGraph checkpointer protocol State persistence for HITL and recovery
GlobalSupervisor → UI (streaming) Outbound SupervisorResponseChunkEvent (Server-Sent Events) Frontend text/voice rendering

Known Trade-offs and Debt

Item Impact Remediation
Large build_graph() method The graph construction in agent.py is ~1,000 lines. Adding or reordering nodes requires reading the full method to understand edge dependencies Potential refactor: extract stage-level subgraphs (entry, memory, planning, response) into separate builder functions
Legacy memory.py node The old monolithic memory node still exists and is used only for the greeting flow. All other flows use the split memory pipeline (retrieval_router → semantic → temporal → assembly) Remove legacy node once greeting flow is migrated to the split pipeline
State field sprawl GlobalSupervisorState has grown to ~77 fields across multiple concerns, making it hard to know which fields are relevant at each node Consider namespaced sub-states or a state schema validation layer
No retry on agent execution failures If a domain agent raises an exception, the supervisor catches it and reports the failure but does not retry Add configurable retry with exponential backoff for transient failures