HITL System — Architecture¶

This content was migrated from Documentation/SWISPER_HUMAN_IN_THE_LOOP_ARCHITECTURE.md and restructured into audience sections. Review for accuracy against the current codebase.

Context and Purpose¶

The HITL System exists because AI agents sometimes need human input to proceed safely and correctly. Without it, agents would either guess (risking errors) or fail silently (losing user trust).

The driving requirements are:

Agents never talk directly to users — All user-facing questions go through a centralized handler for consistent UX
Indefinite pause/resume — The system must survive server restarts and arbitrarily long user pauses via LangGraph checkpoint persistence
Multi-turn support — Complex tasks may require multiple clarification rounds before completion
Blocking vs non-blocking — Entity disambiguation must distinguish between cases that block the answer and cases that are incidental

Architecture Overview¶

flowchart TB
    subgraph Triggers["HITL Triggers"]
        DA[Domain Agent] -->|WAITING_FOR_INPUT| AE[Agent Execution Node]
        ER[Entity Resolution] -->|ambiguity detected| DB[Disambiguation Blocking]
    end

    subgraph Orchestration["HITL Orchestration"]
        AE -->|user_in_the_loop.is_waiting| UIR[UI Router]
        DB -->|streams question| UIR
        UIR --> HT[HITL Text Node]
        HT -->|streams question to frontend| HANDLER[HITL Handler]
        HANDLER -->|interrupt\(\)| PAUSE([Graph Pauses])
        PAUSE -->|checkpoint to Redis| WAIT[Wait for User]
    end

    subgraph Resume["Resume Flow"]
        WAIT -->|user responds| CMD["Command(resume=answer)"]
        CMD --> HANDLER2[HITL Handler]
        HANDLER2 -->|escaping?| NEW[Process New Intent]
        HANDLER2 -->|answer received| ROUTE{Resume Route}
        ROUTE -->|agent HITL| AE2[Agent Execution]
        ROUTE -->|disambiguation| DR[Disambiguation Resolution]
    end

    subgraph Resolution["Disambiguation Resolution"]
        DR -->|resolve entity| FACTS[Persist Pending Facts]
        DR -->|create new entity| CNE[Create New Entity Node]
        FACTS --> CONTINUE[Continue Graph]
        CNE --> CONTINUE
    end

Flow summary: A trigger (agent or disambiguation) creates a UserInTheLoop payload with is_waiting=True. The HITL Handler calls interrupt() to pause the graph and checkpoint state. When the user responds, Command(resume=answer) restores execution. The handler routes to either the original agent or disambiguation resolution based on the interrupt source.

Component Responsibilities¶

Component	Responsibility
HITL Handler Node	Central orchestrator — calls `interrupt()`, processes user response, detects escaping, routes to resume target
Disambiguation Blocking Node	Generates ask-only questions for critical entity ambiguity. Streams question, sets `is_waiting=True`
Disambiguation Resolution Node	Resolves user's entity choice — fast-path (exact match) or LLM semantic matching. Persists pending facts with correct entity
Create New Entity Node	Handles "Someone else" flow — LLM extracts role/context, creates new Person record
HITL Text Node	Formats and streams HITL questions to the frontend (bypasses LLM)
Agent Execution Node	Detects `WAITING_FOR_INPUT` status from domain agents and propagates `UserInTheLoop` to state

Data Model¶

UserInTheLoop (Pydantic Model)¶

Field	Type	Purpose
`question`	`str`	Current question to display
`answer`	`str`	User's answer (populated on resume)
`is_waiting`	`bool`	`True` = graph is paused waiting for user
`escaping`	`bool`	`True` = user wants to abandon current task
`target_agent`	`str`	Which agent to resume (e.g., `"productivity_agent"`)
`source_node`	`str`	Which node triggered the interrupt
`needs_clarification`	`bool`	Type: missing data
`needs_confirmation`	`bool`	Type: risky action approval
`last_question_type`	`str`	`"clarification"`, `"confirmation"`, or `"disambiguation"`
`stored_data`	`dict`	Context preserved across the interrupt (search results, drafts, entity options)
`tool_results`	`dict`	Tool execution results preserved across the interrupt
`previous_questions` / `previous_answers`	`list[str]`	Multi-turn history
`modality`	`str`	`"text"` or `"voice"`

State Persistence¶

Store	What's Saved	Mechanism
Redis	Full graph state including `UserInTheLoop`, agent context, partial results	LangGraph checkpoint via `interrupt()`
PostgreSQL	Fallback state for crash recovery	`StatePersistenceManager`

Key Design Decisions¶

1. LangGraph `interrupt()` Over Custom Polling¶

Chosen: Native LangGraph interrupt() with Command(resume=...) for pause/resume
Rejected: Custom polling loop, WebSocket-based waiting, database flag checking
Rationale: interrupt() integrates directly with LangGraph's checkpoint system, providing automatic state persistence, deterministic resume, and multi-turn support without custom infrastructure

2. Centralized Handler Over Per-Agent Questions¶

Chosen: All HITL requests flow through a single handler node that formats and delivers questions
Rejected: Each agent formats its own user-facing questions
Rationale: Consistent UX across all agents and channels. Single point for audit logging, A/B testing of question formats, and channel-specific adaptation (text vs voice)

3. Blocking vs Non-Blocking Disambiguation¶

Chosen: Two separate flows — blocking (ask first) and non-blocking (answer first, ask "by the way")
Rejected: Always blocking; always non-blocking; let the LLM decide
Rationale: Blocking every disambiguation adds unnecessary latency for incidental mentions. Non-blocking every time risks incorrect answers when entity identity matters. The entity resolution node sets a relevance field ("blocking" vs "non_blocking") based on whether the entity affects the answer

Interfaces and Contracts¶

Interface	Direction	Format	Consumer
Domain Agents → HITL	Inbound	`DomainAgentResult(status=WAITING_FOR_INPUT, user_in_the_loop=...)`	Agent Execution Node
Entity Resolution → HITL	Inbound	`entity_ambiguity` dict in state with relevance field	Disambiguation Blocking Node
HITL → Frontend	Outbound	Streamed question via `SupervisorResponseChunkEvent`	Frontend chat UI
Frontend → HITL	Inbound	User message → `Command(resume=user_message)`	HITL Handler Node
HITL → Disambiguation Resolution	Outbound	`UserInTheLoop.answer` + `entity_ambiguity` context	Disambiguation Resolution Node

Known Trade-offs and Debt¶

Item	Impact	Remediation
Policy management not yet implemented	The legacy doc describes a full `PolicyManager` service for tool-level policies (email approval, transfer thresholds). This is designed but not built	Implement `PolicyManager` when domain agents need configurable approval rules
Single active interrupt	Only one HITL question at a time per conversation. Agents needing multiple inputs must ask sequentially	Could batch questions into a single structured form, but adds UI complexity
No HITL analytics	No tracking of approval rates, common clarification patterns, or user response times	Add audit trail and analytics when compliance reporting is needed