HITL System — Architecture
This content was migrated from Documentation/SWISPER_HUMAN_IN_THE_LOOP_ARCHITECTURE.md and
restructured into audience sections. Review for accuracy against
the current codebase.
Context and Purpose
The HITL System exists because AI agents sometimes need human input to proceed safely and correctly. Without it, agents would either guess (risking errors) or fail silently (losing user trust).
The driving requirements are:
- Agents never talk directly to users — All user-facing questions go through a centralized handler for consistent UX
- Indefinite pause/resume — The system must survive server restarts and arbitrarily long user pauses via LangGraph checkpoint persistence
- Multi-turn support — Complex tasks may require multiple clarification rounds before completion
- Blocking vs non-blocking — Entity disambiguation must distinguish between cases that block the answer and cases that are incidental
Architecture Overview
flowchart TB
subgraph Triggers["HITL Triggers"]
DA[Domain Agent] -->|WAITING_FOR_INPUT| AE[Agent Execution Node]
ER[Entity Resolution] -->|ambiguity detected| DB[Disambiguation Blocking]
end
subgraph Orchestration["HITL Orchestration"]
AE -->|user_in_the_loop.is_waiting| UIR[UI Router]
DB -->|streams question| UIR
UIR --> HT[HITL Text Node]
HT -->|streams question to frontend| HANDLER[HITL Handler]
HANDLER -->|interrupt\(\)| PAUSE([Graph Pauses])
PAUSE -->|checkpoint to Redis| WAIT[Wait for User]
end
subgraph Resume["Resume Flow"]
WAIT -->|user responds| CMD["Command(resume=answer)"]
CMD --> HANDLER2[HITL Handler]
HANDLER2 -->|escaping?| NEW[Process New Intent]
HANDLER2 -->|answer received| ROUTE{Resume Route}
ROUTE -->|agent HITL| AE2[Agent Execution]
ROUTE -->|disambiguation| DR[Disambiguation Resolution]
end
subgraph Resolution["Disambiguation Resolution"]
DR -->|resolve entity| FACTS[Persist Pending Facts]
DR -->|create new entity| CNE[Create New Entity Node]
FACTS --> CONTINUE[Continue Graph]
CNE --> CONTINUE
end
Flow summary: A trigger (agent or disambiguation) creates a UserInTheLoop payload with is_waiting=True. The HITL Handler calls interrupt() to pause the graph and checkpoint state. When the user responds, Command(resume=answer) restores execution. The handler routes to either the original agent or disambiguation resolution based on the interrupt source.
Component Responsibilities
| Component |
Responsibility |
| HITL Handler Node |
Central orchestrator — calls interrupt(), processes user response, detects escaping, routes to resume target |
| Disambiguation Blocking Node |
Generates ask-only questions for critical entity ambiguity. Streams question, sets is_waiting=True |
| Disambiguation Resolution Node |
Resolves user's entity choice — fast-path (exact match) or LLM semantic matching. Persists pending facts with correct entity |
| Create New Entity Node |
Handles "Someone else" flow — LLM extracts role/context, creates new Person record |
| HITL Text Node |
Formats and streams HITL questions to the frontend (bypasses LLM) |
| Agent Execution Node |
Detects WAITING_FOR_INPUT status from domain agents and propagates UserInTheLoop to state |
Data Model
UserInTheLoop (Pydantic Model)
| Field |
Type |
Purpose |
question |
str |
Current question to display |
answer |
str |
User's answer (populated on resume) |
is_waiting |
bool |
True = graph is paused waiting for user |
escaping |
bool |
True = user wants to abandon current task |
target_agent |
str |
Which agent to resume (e.g., "productivity_agent") |
source_node |
str |
Which node triggered the interrupt |
needs_clarification |
bool |
Type: missing data |
needs_confirmation |
bool |
Type: risky action approval |
last_question_type |
str |
"clarification", "confirmation", or "disambiguation" |
stored_data |
dict |
Context preserved across the interrupt (search results, drafts, entity options) |
tool_results |
dict |
Tool execution results preserved across the interrupt |
previous_questions / previous_answers |
list[str] |
Multi-turn history |
modality |
str |
"text" or "voice" |
State Persistence
| Store |
What's Saved |
Mechanism |
| Redis |
Full graph state including UserInTheLoop, agent context, partial results |
LangGraph checkpoint via interrupt() |
| PostgreSQL |
Fallback state for crash recovery |
StatePersistenceManager |
Key Design Decisions
1. LangGraph interrupt() Over Custom Polling
- Chosen: Native LangGraph
interrupt() with Command(resume=...) for pause/resume
- Rejected: Custom polling loop, WebSocket-based waiting, database flag checking
- Rationale:
interrupt() integrates directly with LangGraph's checkpoint system, providing automatic state persistence, deterministic resume, and multi-turn support without custom infrastructure
2. Centralized Handler Over Per-Agent Questions
- Chosen: All HITL requests flow through a single handler node that formats and delivers questions
- Rejected: Each agent formats its own user-facing questions
- Rationale: Consistent UX across all agents and channels. Single point for audit logging, A/B testing of question formats, and channel-specific adaptation (text vs voice)
3. Blocking vs Non-Blocking Disambiguation
- Chosen: Two separate flows — blocking (ask first) and non-blocking (answer first, ask "by the way")
- Rejected: Always blocking; always non-blocking; let the LLM decide
- Rationale: Blocking every disambiguation adds unnecessary latency for incidental mentions. Non-blocking every time risks incorrect answers when entity identity matters. The entity resolution node sets a
relevance field ("blocking" vs "non_blocking") based on whether the entity affects the answer
Interfaces and Contracts
| Interface |
Direction |
Format |
Consumer |
| Domain Agents → HITL |
Inbound |
DomainAgentResult(status=WAITING_FOR_INPUT, user_in_the_loop=...) |
Agent Execution Node |
| Entity Resolution → HITL |
Inbound |
entity_ambiguity dict in state with relevance field |
Disambiguation Blocking Node |
| HITL → Frontend |
Outbound |
Streamed question via SupervisorResponseChunkEvent |
Frontend chat UI |
| Frontend → HITL |
Inbound |
User message → Command(resume=user_message) |
HITL Handler Node |
| HITL → Disambiguation Resolution |
Outbound |
UserInTheLoop.answer + entity_ambiguity context |
Disambiguation Resolution Node |
Known Trade-offs and Debt
| Item |
Impact |
Remediation |
| Policy management not yet implemented |
The legacy doc describes a full PolicyManager service for tool-level policies (email approval, transfer thresholds). This is designed but not built |
Implement PolicyManager when domain agents need configurable approval rules |
| Single active interrupt |
Only one HITL question at a time per conversation. Agents needing multiple inputs must ask sequentially |
Could batch questions into a single structured form, but adds UI complexity |
| No HITL analytics |
No tracking of approval rates, common clarification patterns, or user response times |
Add audit trail and analytics when compliance reporting is needed |