Document Intelligence & RAG — Architecture¶
Audience: Architects, tech leads, senior engineers evaluating the v2 document agent and its delta against the v1
DocumentSearchAgent.
Context and Purpose¶
The document agent is Swisper's interface between the planner and the user's document corpus — uploaded files, ingested email attachments, and Swisper-generated documents. It must do three things at once:
- Stay in the FC loop. Document operations have to interleave freely with memory recall, calendar lookups, and other tools in a single planner turn. "Find the insurance policy Thomas sent and remind me when it expires" must not need two agent hand-offs.
- Honour workspace boundaries. Every retrieval is scoped to a workspace and avatar. There is no cross-workspace leakage path, even on a misconfigured planner call.
- Keep retrieval honest. Vector-only cosine similarity misses exact matches (policy numbers, names, dates). Hybrid retrieval and a Discovery Engine reranker close that gap, with all components individually feature-flagged for safe rollout.
The v1 DocumentSearchAgent (Pattern A delegation, reactive LangGraph subgraph) cannot satisfy any of these without architectural changes. The v2 agent is a clean break.
v1 vs v2 — What Changed¶
| Concern | v1 (agents/doc/) |
v2 (agents/document/) |
|---|---|---|
| Integration model | DomainAgentInterface registered with DomainAgentRegistry. Planner emits delegate_to_document_search_agent; supervisor delegates the entire turn. |
Pattern 2: individual BaseTool subclasses registered with the planner's FC loop. Tools interleave with memory tools in one plan. |
| Internal control flow | LangGraph subgraph: doc_indexing_node → document_search_planner_node ↔ doc_tool_execution_node (loop, max 5 iterations). |
Each tool is a thin async method on a service. No internal LangGraph. |
| Scoping | user_id only. Hard-coded limit=20 documents fetched up-front. |
workspace_id + avatar_id enforced at the service layer (DocumentSearchService). No up-front fetch. |
| Retrieval | Cosine distance only (RAGService.retrieve). |
Three configurable fusion strategies — dense_only, weighted_blend (alpha-blend), rrf (Reciprocal Rank Fusion, k=60) — over DocumentChunk + AttachmentChunk. |
| Reranking | None. | Optional Vertex AI Discovery Engine semantic-ranker-default-004, runs on top 3 × top_k candidates after fusion. Graceful degradation on failure. |
| Embedding | Single embedding config (gemini-embedding-001, 2000 dim). |
Two distinct lanes: embedding (gemini-embedding-001) for facts and document_embedding (gemini-embedding-2-preview) for documents. Hard rule: queries are embedded by the same lane as the corpus. |
| Tool surface | semantic_search, document_summary. |
search_documents, get_document, list_documents, create_document. (analyze_document from the spec is not yet a Pattern 2 tool — see Spec Coverage.) |
| Document creation | Not supported. | create_document writes Markdown through the standard parse/chunk/embed pipeline; source_type=GENERATED provenance. |
| Doc → memory linkage | None. Documents and facts are isolated graphs. | ExtractDocumentFactsJob extracts facts from indexed documents and persists them with source_ref_type='document' provenance. |
| Failure modes | Fan-out exceptions, max-iterations error states without explanation. | Graceful degradation per layer (reranker fail → fusion order; encryption fail → search succeeds; fact extraction fail → document still searchable). |
| Rollout | All-or-nothing — the agent is in the DomainAgentRegistry or it isn't. |
Feature-flagged: DOCUMENT_SEARCH_V2_ENABLED switches between the two paths at the supervisor layer. Both can coexist during migration. |
The v1 path is preserved on feature/voice-v2-gemini-live-and-friends so today's UAT stays unbroken; v2 is selected in the supervisor when the flag is on.
v2 Architecture Overview¶
graph TD
subgraph Planner["Agentic Supervisor (FC loop)"]
P["global_planner_node"]
TN["tool_node"]
end
subgraph DocTools["Document tools (Pattern 2 — agents/document/)"]
SD["search_documents"]
GD["get_document"]
LD["list_documents"]
CD["create_document"]
end
subgraph Services["Domain services"]
DSS["DocumentSearchService<br/>(services/document_search.py)"]
IDX["IndexingService<br/>(services/file_indexing.py)"]
RR["RerankerService<br/>(services/reranker.py)"]
FE["FactAndEntityExtraction<br/>Service"]
end
subgraph Gateways["Gateways / external"]
LLM["SwisperLLMAdapter<br/>(embedding lanes)"]
DocAI["Document AI<br/>(layout parser)"]
DE["Vertex Discovery Engine<br/>(reranker)"]
end
subgraph Storage["Persistence"]
PG[("AlloyDB:<br/>documents, document_chunks,<br/>attachment_chunks (pgvector + tsvector)")]
S3[("MinIO / GCS<br/>(encrypted blobs)")]
end
subgraph Jobs["Background"]
EDFJ["ExtractDocumentFactsJob"]
REEMB["ReembedDocumentChunksJob"]
end
P -->|tool_calls| TN
TN --> SD & GD & LD & CD
SD --> DSS
GD & LD --> PG
CD --> IDX
DSS --> LLM
DSS --> PG
DSS --> RR --> DE
IDX --> DocAI & LLM & PG & S3
IDX -. IndexingCompleteEvent .-> EDFJ --> FE --> PG
REEMB --> LLM --> PG
Wiring: agents/agentic_supervisor/agent.py builds the tool list per request via _get_document_tools() (lines ~290–325). The factory agents/document/factory.py::build_document_tools() returns the four BaseTool instances, each capturing the request's workspace_id, avatar_id, and db_session so the planner can call them without passing context.
Components and Responsibilities¶
| Component | File | Responsibility |
|---|---|---|
build_document_tools(...) |
agents/document/factory.py |
Request-scoped tool factory. Returns [SearchDocumentsTool, GetDocumentTool, ListDocumentsTool, CreateDocumentTool] bound to one workspace/avatar/session. |
SearchDocumentsTool |
agents/document/tools/search_documents.py |
Async-only BaseTool. Calls DocumentSearchService.search(), formats EvidenceResults (heading path, page range, parent-email metadata, score) for the planner. |
GetDocumentTool |
agents/document/tools/get_document.py |
Sync metadata fetch by document UUID. Workspace check on Document.workspace_id. |
ListDocumentsTool |
agents/document/tools/list_documents.py |
Workspace-scoped listing ordered by created_at desc, max 100. |
CreateDocumentTool |
agents/document/tools/create_document.py |
Writes Markdown bytes through IndexingService.process_and_index_document() with source_type=GENERATED. Same parse → chunk → embed path as uploads. |
DocumentSearchService |
services/document_search.py |
Owns embedding-lane selection (document_embedding), fusion strategy (dense_only / weighted_blend / rrf), AttachmentChunk join, and reranker invocation. Returns list[EvidenceResult]. |
IndexingService |
services/file_indexing.py |
Upload → store (S3) → parse (Document AI Layout Parser, with format-specific fallbacks) → chunk → embed (document_embedding) → persist. Emits IndexingCompleteEvent. Populates search_vector against plaintext before PGPString encryption. |
RerankerService |
services/reranker.py |
Discovery Engine adapter. semantic-ranker-default-004 over the top 3 × top_k candidates. Logs and falls back on failure. |
ExtractDocumentFactsJob |
jobs/extract_document_facts_job.py |
Background job. Picks documents with fact_extraction_status='queued', allowlists by classification, calls FactAndEntityExtractionService with source_ref_type='document'. Embeds facts on the embedding lane (not document_embedding). |
ReembedDocumentChunksJob |
jobs/reembed_document_chunks_job.py |
One-shot migration job: re-embeds legacy chunks into the document_embedding lane (gemini-embedding-2-preview). |
Tool Catalogue (Planner-Facing)¶
The planner sees four tools when DOCUMENT_SEARCH_V2_ENABLED=True and a workspace context exists. Tool descriptions are loaded from each BaseTool.description; the planner picks among them based on user intent.
search_documents — primary retrieval tool¶
Returns a numbered list of evidence entries. Each entry contains source kind (document | email_attachment), source UUID, optional Section: heading path, optional Pages:, optional parent email subject/sender, relevance score, and the chunk text.
Internal flow (DocumentSearchService.search):
1. Embed query on the document_embedding lane.
2. (Optional) apply EncryptionContext.transform_query for encrypted-at-rest corpora.
3. Run the configured fusion strategy on DocumentChunk (workspace-scoped). Always run dense search on AttachmentChunk (no tsvector column).
4. Merge, sort by distance, keep top 3 × top_k candidates.
5. If DOCUMENT_RERANKER_ENABLED=True, rerank via Discovery Engine; on failure return fusion order.
6. Return top top_k.
get_document¶
Returns title, filename, type, size, created timestamp, summary, and processing status. Returns a "not found in this workspace" message if the workspace check fails — the result is identical to "does not exist" so the planner cannot probe across workspaces.
list_documents¶
Returns a header line plus one row per document with ID, title, filename, type, size, and created timestamp. Workspace-scoped; ordered by created_at desc.
create_document¶
Writes the content as <sanitised_title>.md through IndexingService.process_and_index_document() with source_type=GENERATED. The document is searchable on the next search_documents call. Errors return a string starting with "Failed to create document"; success returns the new document_id.
How the planner uses them¶
The planner is encouraged (via tool descriptions) to:
- Reach for
search_documentswhenever the user asks about uploaded content, attachments, or generated notes. - Compose with memory tools — e.g.
recall("Thomas insurance")followed bysearch_documents("home insurance Thomas")and use the union to answer. - Use
get_documentaftersearch_documentsonly when the user wants metadata (title, summary, status) about a hit. - Use
list_documentsfor "what do I have on file?" or as a fallback when search returns nothing. - Use
create_documentonly when the user explicitly asks Swisper to save a structured artefact (meeting notes, summary, report) — not for chat replies.
Embedding Lanes¶
Two cohorts, two lanes — enforced at adapter level via agent_configuration rows.
| Lane | Adapter call | Model | Dim | Used by |
|---|---|---|---|---|
embedding |
SwisperLLMAdapter("__service__", "embedding") |
gemini-embedding-001 |
2000 | Facts (UserFact, ExtractedFact), system facts, fact extraction (incl. ExtractDocumentFactsJob outputs). |
document_embedding |
SwisperLLMAdapter("__service__", "document_embedding") |
gemini-embedding-2-preview |
2000 (MRL-truncated from 3072) | DocumentChunk, AttachmentChunk, query-time embedding inside DocumentSearchService. |
Hard rule (Spec §6.9): A query MUST be embedded on the same lane as the corpus it targets.
DocumentSearchService._embed_queryis the only place a document query is embedded; it always usesdocument_embedding. Don't pasteSwisperLLMAdapter("__service__", "embedding")into search code paths.
The Vector(2000) column width is unchanged across both lanes — the migration from Azure embeddings to gemini-embedding-2-preview is content-only and runs through ReembedDocumentChunksJob.
Hybrid Retrieval¶
Three fusion strategies, selected via DOCUMENT_FUSION_STRATEGY (with DOCUMENT_HYBRID_RETRIEVAL_ENABLED=False forcing dense_only regardless):
| Strategy | Score formula | When it wins |
|---|---|---|
dense_only |
cosine_distance(query_embedding, chunk.embedding) |
Pure semantic recall, no exact-match signal needed. Default before Phase-1 hybrid was on. |
weighted_blend |
α · dense_sim + (1−α) · ts_rank (default α = 0.7 via DOCUMENT_HYBRID_RETRIEVAL_ALPHA) |
Mixed corpora where you want a tunable knob between meaning and keywords. Sensitive to score-scale drift between dense and lexical. |
rrf |
Σ 1 / (k + rank_i) over the dense ranking and the BM25 ranking, k=60 |
Recommended production default. Score-scale-independent — only positions matter. Robust across query types. |
Lexical signal: AlloyDB tsvector column on DocumentChunk.search_vector, populated against plaintext before PGPString encrypts chunk_text. Attachments don't carry a tsvector and always run dense.
After fusion, the top 3 × top_k candidates flow to the reranker (when enabled). The reranker score replaces relevance_score in the returned EvidenceResults.
Workspace + Avatar Scoping¶
Every read and every write goes through a workspace boundary:
- Search:
DocumentSearchServiceaddsWHERE workspace_id = :workspace_idon every chunk query (both document and attachment). - Get/List: tools check
Document.workspace_idagainst the bound workspace ID. Mismatches return "not found", not a permission error — no information disclosure. - Create:
IndexingServicewritesDocument.workspace_id,Document.avatar_id, and propagatesworkspace_idto everyDocumentChunk. - Background jobs:
ExtractDocumentFactsJobreadsDocument.workspace_idand stores facts under the same workspace.
Indices: ix_documents_workspace_id and ix_document_chunks_workspace_id make these filters cheap.
Document → Fact Pipeline¶
After indexing, a document with fact_extraction_status='queued' and a classification on the configurable allowlist becomes input for ExtractDocumentFactsJob:
- Pull up to
MAX_CHUNK_PASSAGES = 5chunks, each truncated toMAX_PASSAGE_CHARS = 800. - Hand the joined text to
FactAndEntityExtractionService.extract_and_store()with the document'sworkspace_id/avatar_idandsource_ref_type='document',source_ref_id=<document_id>provenance. - Facts are deduplicated semantically (0.92 threshold) against existing facts, embedded on the
embeddinglane (intentionally different from the document lane), and linked to Person records by entity disambiguation.
Failure isolation: extraction failure does not block document availability. The document remains searchable via search_documents regardless of fact extraction state.
Feature flag: DOCUMENT_FACT_EXTRACTION_ENABLED gates the whole job. While off, documents stay in queued; flipping the flag picks them up on the next run.
Data Model — What's New in v2¶
Document (table documents):
| Field | Type | v2 purpose |
|---|---|---|
workspace_id |
UUID | Tenancy boundary. Indexed. |
avatar_id |
UUID | Owner avatar within the workspace. |
source_type |
DocumentSourceType |
upload / generated / image_capture / voice_memo / email_attachment. |
processing_status |
DocumentProcessingStatus |
uploaded / extracting / indexed / queued / failed. |
fact_extraction_status |
FactExtractionStatus |
not_applicable / queued / completed / failed. Drives ExtractDocumentFactsJob. |
DocumentChunk (table document_chunks):
| Field | Type | v2 purpose |
|---|---|---|
workspace_id |
UUID | Tenancy filter on every search. Indexed. |
search_vector |
tsvector |
BM25 / lexical search. Populated against plaintext before PGPString encryption. |
heading_path |
encrypted str | Layout-parser ancestor heading chain (e.g. § Coverage > Exclusions). Surfaced to the planner in evidence. |
page_start / page_end |
int | Page range surfaced as Pages: p.4-5 in evidence. |
kind |
ChunkKind |
paragraph / table / list / ocr_text / image_caption / transcript / generated_text. |
embedding |
Vector(2000) |
Now sourced from the document_embedding lane. Same column, new content. |
AttachmentChunk keeps the same shape and joins to Attachment → Email for parent-email provenance in evidence rows.
Configuration & Feature Flags¶
All keys are reachable via ConfigurationService (DB-backed) with environment-variable fallbacks defined in swisper/core/config.py.
| Key / Env var | Default | Effect |
|---|---|---|
DOCUMENT_SEARCH_V2_ENABLED |
False |
When True AND a workspace/avatar/db session is available, the supervisor passes the v2 tool list to the planner. When False, falls back to v1 delegation. |
DOCUMENT_HYBRID_RETRIEVAL_ENABLED |
False |
Master switch for hybrid (lexical + dense). When False, DocumentSearchService forces dense_only regardless of DOCUMENT_FUSION_STRATEGY. |
DOCUMENT_FUSION_STRATEGY |
weighted_blend |
One of dense_only / weighted_blend / rrf. Recommended production: rrf. |
DOCUMENT_HYBRID_RETRIEVAL_ALPHA |
0.7 |
Weight for dense in weighted_blend. 1.0 = pure dense; 0.0 = pure lexical. |
RRF_K |
60 |
Rank-fusion constant. Lower values favour top results more aggressively. |
DOCUMENT_RERANKER_ENABLED |
True |
Run Discovery Engine reranker over the top 3 × top_k candidates. Adds ~50–150 ms; falls back to fusion order on failure. |
DOCUMENT_FACT_EXTRACTION_ENABLED |
False |
Gates ExtractDocumentFactsJob. While off, documents accumulate in fact_extraction_status='queued'. |
Recommended rollout sequence: turn on DOCUMENT_SEARCH_V2_ENABLED first (verify tool selection in traces), then DOCUMENT_HYBRID_RETRIEVAL_ENABLED with rrf, then DOCUMENT_FACT_EXTRACTION_ENABLED once chunk re-embedding has completed.
How to Use It¶
From the planner¶
You don't call this module directly — the supervisor exposes it. Verify in a request trace that Built N document tools for workspace=<id> is logged (see agentic_supervisor/agent.py:320) and that the planner emits a tool_call for one of the four document tools instead of delegate_to_document_search_agent.
From a service / job¶
Read paths go through DocumentSearchService. Construct it with a Session and call search(query, workspace_id, avatar_id, top_k); the service handles embedding, fusion, and reranking. Do not instantiate RAGService for new code — it is the v1 implementation and is being retired with the agents/doc/ agent.
Write paths go through IndexingService.process_and_index_document(file_content, filename, workspace_id, avatar_id, user_id, source_type). Pass the right DocumentSourceType so downstream consumers (e.g. fact extraction) can apply type-aware policies.
From a test¶
Real-LLM tests should embed against the document_embedding lane to match the corpus. Fixtures should populate workspace_id, avatar_id, and (for hybrid) the search_vector column. The plaintext-first ordering in IndexingService._populate_search_vectors is the only correct way — manual fixture rows that bypass it will not match to_tsvector queries.
From the UI¶
The frontend keeps calling the existing /api/v1/documents/... REST endpoints (see api/routes/documents.py). Those routes still drive the v1 RAG service today. Migration to v2 is tracked separately and does not block the planner-side cutover; the chat-driven path described above is the v2 critical path.
Failure Modes & Degradation¶
| Failure | Behaviour |
|---|---|
| Reranker times out / errors | _rerank logs and returns the fusion order untouched. Search still returns results. |
EncryptionContext.transform_query fails |
The query embedding is used as-is. (Caller's responsibility to provide a working context.) |
DOCUMENT_SEARCH_V2_ENABLED=True but missing workspace/avatar/session |
_get_document_tools logs a warning and returns None. The supervisor proceeds with no document tools. The planner cannot call them — graceful "feature not available." |
create_document indexing fails |
Tool returns an error string starting with Failed to create document. Nothing is written to the chunk table for that document. |
| Fact extraction fails for a document | Document remains searchable. fact_extraction_status flips to failed. Job continues with the next document. |
| Embedding lane misconfigured | SwisperLLMAdapter raises early. Caller (search or indexing) propagates. No silent dimension mismatch. |
Spec Coverage / Gap Analysis¶
Comparing what's shipped on feature/voice-v2-gemini-live (and integration sub-branches) against spec_FEAT_ALL_document_intelligence_v1.md:
| Spec capability | Status | Notes |
|---|---|---|
| Pattern 2 tools registered with FC loop | ✅ | build_document_tools() + supervisor wiring + feature flag. |
search_documents tool |
✅ | Async BaseTool, formats evidence with provenance. |
get_document tool |
⚠️ partial | Implemented, but does not yet expose download_url, entity_names, or fact_count from the spec. |
list_documents tool |
⚠️ partial | Implemented, but spec also asks for document_type / date_range / offset filters. Currently only limit. |
create_document tool |
✅ | Markdown-only path; same indexing pipeline as uploads; source_type=GENERATED. |
analyze_document tool |
❌ not implemented as tool | Logic exists in RAGService.analyze_document (v1). No Pattern 2 wrapper yet. The planner cannot ask for deep single-document analysis. |
| Self-correcting retrieval (retrieve → grade → rewrite → retry, max 2 cycles) | ❌ not implemented | SearchDocumentsTool._arun calls DocumentSearchService.search and returns. Reranking compensates somewhat; LLM-as-judge grading + query rewrite loop is not present. |
| Hybrid search (vector + BM25) | ✅ | weighted_blend and rrf strategies, both with workspace scoping. |
| RRF (k=60) | ✅ | _search_document_chunks_rrf. |
| Reranking via Vertex AI | ✅ | RerankerService over semantic-ranker-default-004. Graceful degradation. |
| Check Grounding API integration | ❌ | No GroundingGateway. Citations come from chunk metadata, not from a post-generation grounding check. |
| Workspace + avatar scoping | ✅ | Enforced at DocumentSearchService, Document, and DocumentChunk levels. |
Two-lane embedding (embedding + document_embedding) |
✅ | Even cleaner than the spec — facts and documents are physically separated by adapter call site. |
gemini-embedding-2-preview for documents |
✅ | Default for document_embedding lane. |
| Layout-aware parsing (Document AI Layout Parser) | ✅ | services/document_parsing/parsers/layout_parser.py + factory wiring. |
| Document-type-aware chunk sizing | ⚠️ partial | Chunking pipeline + ChunkKind taxonomy exist; per-type token targets from spec §5.5 are not all wired through configuration yet. |
| Heading-path enrichment | ✅ | DocumentChunk.heading_path populated and surfaced to the planner. |
| Multimodal ingestion (image + voice) | ⚠️ in progress | DocumentSourceType.IMAGE_CAPTURE and VOICE_MEMO exist; parser hooks present; end-to-end ingestion paths and embedding-via-multimodal are not yet a single tested flow. |
| MinIO/S3 physical storage | ✅ | gateways/storage/providers/minio.py + IndexingService._store_file. |
| Document → Fact pipeline | ✅ | ExtractDocumentFactsJob with provenance, allowlist, semantic dedup. |
| Embedding migration job (Azure → Gemini) | ✅ | ReembedDocumentChunksJob. |
Deprecation of DocumentSearchAgent |
⚠️ in progress | v1 path still registered. v2 selected via feature flag. Deletion is gated on flag rollout completing. |
The two non-trivial gaps that change the planner's behaviour are: (1) the missing analyze_document tool — the planner has no way to request deep single-document analysis without falling back to v1 — and (2) the absent self-correcting loop, which means a poor first-cycle query is returned as-is. Both are tractable extensions of the current shape (no architectural rework needed).
Known Trade-offs and Debt¶
| Item | Impact | Remediation |
|---|---|---|
analyze_document is missing as a Pattern 2 tool |
Planner cannot do deep single-document Q&A in the v2 path. | Wrap RAGService.analyze_document (or rebuild it on DocumentSearchService + chunk-by-document fetch) as AnalyzeDocumentTool. |
No grade/rewrite loop in search_documents |
Bad first-cycle queries return weak results. Reranker mitigates but doesn't fix. | Add an LLM-as-judge step after rerank; on <2 chunks above threshold, rewrite the query and retry once. |
get_document and list_documents lack the spec's filter set |
Planner cannot filter by document type, date range, entity names, or paginate beyond limit. |
Extend the input schemas and the underlying queries; index Document.created_at for range queries. |
| v1 path still in the repo | Two implementations to maintain during migration. | After v2 reaches 100% on DOCUMENT_SEARCH_V2_ENABLED, remove agents/doc/ and the delegate_to_document_search_agent registration in one PR. |
| REST endpoints still drive v1 | Frontend search bar and document Q&A use /api/v1/documents/... and the v1 RAGService. |
Re-point to DocumentSearchService (read) and IndexingService (write); deprecate v1 route bodies. |
| Document-type-aware chunk sizing not fully wired | Default sizes are used regardless of document type; spec §5.5 targets are unenforced. | Add a sizing policy keyed on DocumentSourceType / detected classification; surface in services/document_chunking.py. |
| No Check Grounding step | Cited answers rely on chunk metadata; there's no post-generation factuality check. | Add a GroundingGateway; gate behind a flag; apply when the answer is sourced from documents. |