Global Supervisor — Overview¶

This content was migrated from Documentation/GLOBAL_SUPERVISOR.md and restructured into audience sections. Review for accuracy against the current codebase.

What This Module Does¶

The Global Supervisor is the central brain of Swisper. Every conversation starts here — whether a user types a message in the web app or speaks through the voice interface, the Global Supervisor receives it and manages the entire response process from start to finish.

It works like a skilled coordinator: it figures out what the user is asking for, recalls relevant details from past conversations ("you mentioned your son's birthday is in March"), decides which specialized agents need to be involved (a calendar agent, a research agent, etc.), and then assembles everything into a coherent, personalized response.

The system is also smart about efficiency. Simple questions like "what's 2+2?" take a fast track that skips unnecessary processing, while complex requests that require multiple agents go through full planning and execution. This keeps response times low while still delivering high-quality answers.

Swisper greeting screen showing a personalized welcome message that references the user's upcoming flight and recent activity, with suggested follow-up actions The Global Supervisor powers the greeting experience: it recalls personal context (upcoming travel, recent events) and suggests relevant next actions.

Who It Serves¶

Persona	Need
End Users (Swisper customers)	Fast, accurate, personalized responses to their questions — whether simple chat or complex multi-step requests
Product Owners	Understanding of what drives response quality, personalization, and conversation speed
Support Staff	Insight into why a conversation took a particular path or produced a specific response

Key Capabilities¶

Intent understanding — Determines what the user wants and classifies messages as simple chat, complex multi-agent requests, greetings, or document uploads
Personalized memory — Recalls facts and preferences from past conversations to provide context-aware answers (e.g., remembering a user's travel plans or family members)
Smart delegation — Plans and routes complex requests to the right combination of domain agents (productivity, wealth management, research, etc.)
Ambiguity resolution — When a user mentions "Thomas" but has two contacts named Thomas, the system asks which one they mean before proceeding
Conversation management — Compresses long conversations to stay within token limits while preserving important context
Multi-modal support — Handles both text and voice conversations through the same orchestration pipeline

How It Fits in the Platform¶

The Global Supervisor sits at the center of the Swisper platform, connecting the user-facing API layer to all backend intelligence:

Upstream: Receives messages from the Orchestration Service, which handles API requests from the web and voice frontends
Downstream: Delegates work to domain agents (Productivity Agent, Wealth Agent, Research Agent, etc.) and receives their results
Memory integration: Reads from and writes to the Memory System (fact storage, entity resolution, semantic search)
Response output: Hands off final responses to UI Response Nodes, which format them for text or voice delivery
State persistence: Saves conversation state to Redis (with PostgreSQL fallback) to support resuming after interruptions

Limits and Edge Cases¶

Single-turn focus — The supervisor processes one user message at a time. It cannot handle true parallel conversations within the same chat session
Agent availability — If a domain agent is unavailable or returns an error, the supervisor will report the failure but cannot independently complete that agent's task
Memory retrieval limits — The system retrieves relevant facts from past conversations, but very old or infrequently accessed facts may not surface in every context
Disambiguation overhead — When entity ambiguity is detected (e.g., two contacts with the same name), the system pauses to ask the user for clarification. This adds a round-trip but prevents errors in personalization

FAQ¶

Q: What happens when I send a simple greeting like "Hi"? A: The system detects it as a greeting and takes a fast track, skipping the full memory pipeline. It loads a small set of preloaded facts (like the user's name and recent activity) and generates a personalized greeting in about a second. After the first turn, these preloaded facts are cleared to save processing capacity.

Q: How does the supervisor decide which agents to use? A: After classifying the user's intent, the Global Planner node creates an execution plan. For a question like "schedule a meeting with Thomas next Tuesday," it would plan to call the Productivity Agent. For "what's the latest news on AI and can you add it to my calendar," it would plan both the Research Agent and the Productivity Agent in sequence.

Q: Can the system recover if something goes wrong mid-conversation? A: Yes. The supervisor saves its state after each processing step. If a conversation is interrupted — whether by a clarification question, a network issue, or a system restart — it can resume from the last checkpoint rather than starting over.

Q: Why does Swisper sometimes ask "which Sophie do you mean?" A: This is the disambiguation system at work. When the user mentions a name that matches multiple contacts in their personal database, the system needs to know exactly who is being referenced before it stores any facts or takes action. Asking is better than guessing wrong.

Disambiguation prompt showing Swisper asking "Which Sophie are you trying to reach?" with options for Sophie Müller (colleague), Sophie Weber (friend), Someone else, and Never mind When a name matches multiple contacts, the Global Supervisor pauses to ask the user which person they mean.

Q: How long does a typical response take? A: Simple questions (greetings, math, basic chat) take roughly 1–2 seconds. Complex requests involving domain agents typically take 2–5 seconds depending on how many agents are needed and the complexity of each agent's task. The system streams responses progressively, so the user sees the first words appear within about 200 milliseconds.