Summarization — Overview¶

This content was migrated from Documentation/SUMMARIZATION_SYSTEM.md and restructured into audience sections. Review for accuracy against the current codebase.

What This Module Does¶

The Summarization System compresses long conversations to keep Swisper fast and cost-efficient. When a conversation exceeds 20 messages or ~4,000 tokens, the system generates a concise summary of older messages while preserving the most recent exchanges verbatim. It also regenerates the chat title to reflect the evolved conversation topic.

Without summarization, a 30-message conversation would send ~22,000 tokens of context to every LLM call. With summarization, that drops to ~3,500 tokens — an 84% reduction in cost and a significant latency improvement.

Who It Serves¶

Persona	Need
End Users	Fast responses even in long conversations, with chat titles that reflect what the conversation is actually about
Product Owners	Control over token costs and response latency as conversations grow
Operations	Predictable token usage patterns regardless of conversation length

Key Capabilities¶

Threshold-based triggering — Summarization runs only when needed: >20 messages or >4,000 estimated tokens
Smart loading — When a summary exists, only the summary + last 4 messages are loaded from the database, avoiding unnecessary reads
Iterative summarization — New summaries incorporate the previous summary, so context accumulates without losing earlier decisions
Title regeneration — After summarizing, the chat title is updated to reflect the current conversation topic (e.g., "Hello" → "MCP Integration Project")
Graceful degradation — If the LLM call fails, the system continues with truncated messages rather than breaking the conversation

How It Fits in the Platform¶

Upstream: Runs after session_init (which loads chat history) and before context_loader
Trigger: The summarization_check node evaluates message count and token estimates; the routing function directs to either the summarization node or skips to context loading
Downstream: The summary is used by all subsequent nodes as conversation context, reducing token usage in intent classification, fact extraction, and response generation
Persistence: The summarization node is computation-only — all database writes (summary, title) happen atomically in the message_persist_node at the end of the turn

Limits and Edge Cases¶

One-time latency cost — Summarization adds ~1–2 seconds when it triggers, but saves ~200ms on every subsequent turn
Information loss — Summarization necessarily discards detail. Key decisions and facts are preserved, but nuances of earlier turns may be lost
Token estimation is approximate — The trigger uses ~4 chars per token as a heuristic, which may be inaccurate for non-Latin languages

FAQ¶

Q: How often does summarization run? A: It triggers when a conversation exceeds 20 messages or ~4,000 tokens. After summarizing, messages accumulate again until the threshold is hit. In a typical conversation, it runs every ~8–10 turns after the first trigger.

Q: Will I notice when summarization happens? A: You may see the chat title update in the sidebar. The summarization itself adds ~1–2 seconds to that turn, but subsequent turns are faster.

Q: Can summarization lose important information? A: The system preserves key decisions, facts, and unresolved items. The last 4 messages (2 turns) are always kept verbatim. Very early conversational nuances may be compressed, but critical facts are also captured by the Fact System independently.