Document Intelligence & RAG — Overview¶
Audience: Business stakeholders, product owners, analysts, new team members.
What This Module Does¶
Swisper turns every file a user shares — uploaded documents, email attachments, and notes Swisper writes for them — into searchable knowledge. The system:
- Stores the original file encrypted in MinIO/GCS, so it can be retrieved or shared later.
- Parses it with a layout-aware parser (Google Document AI), extracting structure (sections, tables, headings, page ranges).
- Chunks the content with structure-first cuts, falling back to character-based splitting only when needed.
- Embeds every chunk on the dedicated
document_embeddinglane (gemini-embedding-2-preview, 2000 dim). - Indexes chunks in AlloyDB with both
pgvector(semantic) andtsvector(lexical / BM25) so exact terms and meaning both score. - Retrieves by hybrid search — Reciprocal Rank Fusion (RRF) across the two indexes — and reranks the top candidates via Vertex AI Discovery Engine.
- Surfaces evidence to the planner with full provenance: source kind, document/attachment ID, heading path, page range, parent-email metadata.
- Feeds the memory graph — a background job extracts facts from indexed documents (e.g. "Home insurance expires June 2027") and stores them with
source_ref_type='document'provenance.
This is the v2 document agent, which replaces the v1 LangGraph DocumentSearchAgent delegation flow. v2 is feature-flagged (DOCUMENT_SEARCH_V2_ENABLED) and selected at the agentic supervisor.
Who It Serves¶
| Persona | Need |
|---|---|
| End users | Ask questions about uploaded files and attachments; have Swisper draft and save documents; have document content quietly improve Swisper's memory. |
| Product owners | Understand what's searchable, how scoping works (workspace + avatar), and which capabilities are MLP vs. roadmap. |
| Backend developers | Add a new format parser, tune fusion / reranker, or extend the planner-facing tool surface. |
Key v2 Capabilities¶
- Planner-native tools —
search_documents,get_document,list_documents,create_documentare individualBaseToolsubclasses registered with the agentic supervisor's FC loop. The planner can interleave them with memory tools in one plan ("recall person → search their documents → summarise"). - Workspace + avatar scoping — every read and write is scoped at the service layer. No cross-workspace leakage path.
- Hybrid retrieval — three configurable strategies (
dense_only,weighted_blend,rrf) onDocumentChunk, plus dense overAttachmentChunk. Recommended production: RRF with k=60. - Semantic reranker — Vertex AI Discovery Engine
semantic-ranker-default-004over the top3 × top_kcandidates. Graceful degradation when the API is unavailable. - Two embedding lanes — facts ride on
gemini-embedding-001; documents ride ongemini-embedding-2-preview. Queries are embedded by the same lane as the corpus. - Layout-aware parsing — Google Document AI Layout Parser produces structured sections with heading hierarchy. Chunks carry their
heading_pathand page range as evidence metadata. - Document creation —
create_documentwrites Markdown through the same parse/chunk/embed pipeline as uploads, withsource_type=GENERATEDprovenance. - Document → Fact pipeline —
ExtractDocumentFactsJobextracts facts from allowlisted documents, embeds them on theembeddinglane, and links them to existing Person records. Runs async; never blocks document availability. - Email attachment unification —
AttachmentChunkis searched alongsideDocumentChunk; results carry parent-email metadata (subject, sender) for traceability.
Supported File Formats¶
Today's parsers (services/document_parsing/parsers/):
| Format | Extensions | Parser |
|---|---|---|
| PDF (layout-aware) | .pdf |
Document AI Layout Parser → fallback to pdf.py |
| Markdown | .md |
markdown.py (header-aware) |
| Office | .docx, .doc |
office.py |
| Spreadsheet | .xlsx, .xls |
spreadsheet.py |
| Plain text | .txt |
text.py |
| Images | .jpg, .jpeg, .png |
image.py (OCR via Document AI / Gemini Vision) |
Voice-memo ingestion (source_type=VOICE_MEMO) is wired in the data model but not yet a single end-to-end tested flow — see Architecture → Spec Coverage.
How It Fits in the Platform¶
- Agentic Supervisor — Builds the document tool list per request via
_get_document_tools()and passes it toglobal_planner_node. Tools are scoped to the call's workspace/avatar/session. - Memory System —
ExtractDocumentFactsJobcallsFactAndEntityExtractionService.extract_and_store()with document provenance. Facts persist in the same memory graph as conversational facts. - LLM Adapter — Embeddings flow through
SwisperLLMAdapteron two lanes (embeddingfor facts,document_embeddingfor docs). One configuration change rolls a model swap to all callers. - Storage Gateway —
MinIOProvider(S3-compatible) handles encrypted blob storage; metadata stays in AlloyDB. - Background Jobs —
ExtractDocumentFactsJob(continuous),ReembedDocumentChunksJob(one-shot migration from legacy embeddings).
Limits and Edge Cases¶
- Hard scope at workspace. A misconfigured planner call (no workspace/avatar) yields zero tools — the planner sees no document surface, not an error.
- Reranker is optional and bounded. When
DOCUMENT_RERANKER_ENABLED=Falseor Discovery Engine is down, results fall back to fusion order. - No cross-workspace search, even for the same user across multiple avatars.
gemini-embedding-2-previewis Public Preview. Expect non-EU data residency until GA. Lanes can be reconfigured viaagent_configuration.analyze_documentis not yet a v2 tool. Deep single-document Q&A still goes through the v1 RAG service via/api/v1/documents/...REST routes.- Self-correcting retrieval loop is not yet implemented. A poor first-cycle query is returned as-is; reranking compensates partially.
Frequently Asked Questions¶
Q: How is v2 turned on?
A: Set DOCUMENT_SEARCH_V2_ENABLED=True (env or agent_configuration row). The supervisor will register the four document tools per request as long as a workspace and avatar are present.
Q: Which embedding model does what?
A: gemini-embedding-001 for facts (the embedding lane); gemini-embedding-2-preview for documents (the document_embedding lane). Hard rule: queries use the same lane as the corpus.
Q: Is the v1 DocumentSearchAgent removed?
A: Not yet. It is still registered via DomainAgentRegistry and selected when DOCUMENT_SEARCH_V2_ENABLED=False. Removal is gated on the flag reaching 100% in production.
Q: How do I add a new document type with custom chunking?
A: Extend the parser in services/document_parsing/parsers/, then thread a sizing policy through services/document_chunking.py. A DocumentSourceType-aware sizing layer is on the roadmap.
Q: Can the planner combine document search with memory recall?
A: Yes — that is the headline reason for the v2 redesign. recall("Thomas insurance") + search_documents("home insurance Thomas") in one planner turn is the canonical cross-domain pattern.
Q: How do facts get extracted from documents?
A: After indexing, eligible documents flip to fact_extraction_status='queued'. ExtractDocumentFactsJob (when enabled) picks them up, runs FactAndEntityExtractionService on a chunk excerpt, and stores facts with source_ref_type='document'. Failures don't block document availability.