Summarization — Overview¶
This content was migrated from
Documentation/SUMMARIZATION_SYSTEM.mdand restructured into audience sections. Review for accuracy against the current codebase.
What This Module Does¶
The Summarization System compresses long conversations to keep Swisper fast and cost-efficient. When a conversation exceeds 20 messages or ~4,000 tokens, the system generates a concise summary of older messages while preserving the most recent exchanges verbatim. It also regenerates the chat title to reflect the evolved conversation topic.
Without summarization, a 30-message conversation would send ~22,000 tokens of context to every LLM call. With summarization, that drops to ~3,500 tokens — an 84% reduction in cost and a significant latency improvement.
Who It Serves¶
| Persona | Need |
|---|---|
| End Users | Fast responses even in long conversations, with chat titles that reflect what the conversation is actually about |
| Product Owners | Control over token costs and response latency as conversations grow |
| Operations | Predictable token usage patterns regardless of conversation length |
Key Capabilities¶
- Threshold-based triggering — Summarization runs only when needed: >20 messages or >4,000 estimated tokens
- Smart loading — When a summary exists, only the summary + last 4 messages are loaded from the database, avoiding unnecessary reads
- Iterative summarization — New summaries incorporate the previous summary, so context accumulates without losing earlier decisions
- Title regeneration — After summarizing, the chat title is updated to reflect the current conversation topic (e.g., "Hello" → "MCP Integration Project")
- Graceful degradation — If the LLM call fails, the system continues with truncated messages rather than breaking the conversation
How It Fits in the Platform¶
- Upstream: Runs after
session_init(which loads chat history) and beforecontext_loader - Trigger: The
summarization_checknode evaluates message count and token estimates; the routing function directs to either the summarization node or skips to context loading - Downstream: The summary is used by all subsequent nodes as conversation context, reducing token usage in intent classification, fact extraction, and response generation
- Persistence: The summarization node is computation-only — all database writes (summary, title) happen atomically in the
message_persist_nodeat the end of the turn
Limits and Edge Cases¶
- One-time latency cost — Summarization adds ~1–2 seconds when it triggers, but saves ~200ms on every subsequent turn
- Information loss — Summarization necessarily discards detail. Key decisions and facts are preserved, but nuances of earlier turns may be lost
- Token estimation is approximate — The trigger uses ~4 chars per token as a heuristic, which may be inaccurate for non-Latin languages
FAQ¶
Q: How often does summarization run? A: It triggers when a conversation exceeds 20 messages or ~4,000 tokens. After summarizing, messages accumulate again until the threshold is hit. In a typical conversation, it runs every ~8–10 turns after the first trigger.
Q: Will I notice when summarization happens? A: You may see the chat title update in the sidebar. The summarization itself adds ~1–2 seconds to that turn, but subsequent turns are faster.
Q: Can summarization lose important information? A: The system preserves key decisions, facts, and unresolved items. The last 4 messages (2 turns) are always kept verbatim. Very early conversational nuances may be compressed, but critical facts are also captured by the Fact System independently.