Document Intelligence & RAG — Overview¶

Audience: Business stakeholders, product owners, analysts, new team members.

What This Module Does¶

Swisper turns every file a user shares — uploaded documents, email attachments, and notes Swisper writes for them — into searchable knowledge. The system:

Stores the original file encrypted in MinIO/GCS, so it can be retrieved or shared later.
Parses it with a layout-aware parser (Google Document AI), extracting structure (sections, tables, headings, page ranges).
Chunks the content with structure-first cuts, falling back to character-based splitting only when needed.
Embeds every chunk on the dedicated document_embedding lane (gemini-embedding-2-preview, 2000 dim).
Indexes chunks in AlloyDB with both pgvector (semantic) and tsvector (lexical / BM25) so exact terms and meaning both score.
Retrieves by hybrid search — Reciprocal Rank Fusion (RRF) across the two indexes — and reranks the top candidates via Vertex AI Discovery Engine.
Surfaces evidence to the planner with full provenance: source kind, document/attachment ID, heading path, page range, parent-email metadata.
Feeds the memory graph — a background job extracts facts from indexed documents (e.g. "Home insurance expires June 2027") and stores them with source_ref_type='document' provenance.

This is the v2 document agent, which replaces the v1 LangGraph DocumentSearchAgent delegation flow. v2 is feature-flagged (DOCUMENT_SEARCH_V2_ENABLED) and selected at the agentic supervisor.

Who It Serves¶

Persona	Need
End users	Ask questions about uploaded files and attachments; have Swisper draft and save documents; have document content quietly improve Swisper's memory.
Product owners	Understand what's searchable, how scoping works (workspace + avatar), and which capabilities are MLP vs. roadmap.
Backend developers	Add a new format parser, tune fusion / reranker, or extend the planner-facing tool surface.

Key v2 Capabilities¶

Planner-native tools — search_documents, get_document, list_documents, create_document are individual BaseTool subclasses registered with the agentic supervisor's FC loop. The planner can interleave them with memory tools in one plan ("recall person → search their documents → summarise").
Workspace + avatar scoping — every read and write is scoped at the service layer. No cross-workspace leakage path.
Hybrid retrieval — three configurable strategies (dense_only, weighted_blend, rrf) on DocumentChunk, plus dense over AttachmentChunk. Recommended production: RRF with k=60.
Semantic reranker — Vertex AI Discovery Engine semantic-ranker-default-004 over the top 3 × top_k candidates. Graceful degradation when the API is unavailable.
Two embedding lanes — facts ride on gemini-embedding-001; documents ride on gemini-embedding-2-preview. Queries are embedded by the same lane as the corpus.
Layout-aware parsing — Google Document AI Layout Parser produces structured sections with heading hierarchy. Chunks carry their heading_path and page range as evidence metadata.
Document creation — create_document writes Markdown through the same parse/chunk/embed pipeline as uploads, with source_type=GENERATED provenance.
Document → Fact pipeline — ExtractDocumentFactsJob extracts facts from allowlisted documents, embeds them on the embedding lane, and links them to existing Person records. Runs async; never blocks document availability.
Email attachment unification — AttachmentChunk is searched alongside DocumentChunk; results carry parent-email metadata (subject, sender) for traceability.

Supported File Formats¶

Today's parsers (services/document_parsing/parsers/):

Format	Extensions	Parser
PDF (layout-aware)	`.pdf`	Document AI Layout Parser → fallback to `pdf.py`
Markdown	`.md`	`markdown.py` (header-aware)
Office	`.docx`, `.doc`	`office.py`
Spreadsheet	`.xlsx`, `.xls`	`spreadsheet.py`
Plain text	`.txt`	`text.py`
Images	`.jpg`, `.jpeg`, `.png`	`image.py` (OCR via Document AI / Gemini Vision)

Voice-memo ingestion (source_type=VOICE_MEMO) is wired in the data model but not yet a single end-to-end tested flow — see Architecture → Spec Coverage.

How It Fits in the Platform¶

Agentic Supervisor — Builds the document tool list per request via _get_document_tools() and passes it to global_planner_node. Tools are scoped to the call's workspace/avatar/session.
Memory System — ExtractDocumentFactsJob calls FactAndEntityExtractionService.extract_and_store() with document provenance. Facts persist in the same memory graph as conversational facts.
LLM Adapter — Embeddings flow through SwisperLLMAdapter on two lanes (embedding for facts, document_embedding for docs). One configuration change rolls a model swap to all callers.
Storage Gateway — MinIOProvider (S3-compatible) handles encrypted blob storage; metadata stays in AlloyDB.
Background Jobs — ExtractDocumentFactsJob (continuous), ReembedDocumentChunksJob (one-shot migration from legacy embeddings).

Limits and Edge Cases¶

Hard scope at workspace. A misconfigured planner call (no workspace/avatar) yields zero tools — the planner sees no document surface, not an error.
Reranker is optional and bounded. When DOCUMENT_RERANKER_ENABLED=False or Discovery Engine is down, results fall back to fusion order.
No cross-workspace search, even for the same user across multiple avatars.
gemini-embedding-2-preview is Public Preview. Expect non-EU data residency until GA. Lanes can be reconfigured via agent_configuration.
analyze_document is not yet a v2 tool. Deep single-document Q&A still goes through the v1 RAG service via /api/v1/documents/... REST routes.
Self-correcting retrieval loop is not yet implemented. A poor first-cycle query is returned as-is; reranking compensates partially.

Frequently Asked Questions¶

Q: How is v2 turned on? A: Set DOCUMENT_SEARCH_V2_ENABLED=True (env or agent_configuration row). The supervisor will register the four document tools per request as long as a workspace and avatar are present.

Q: Which embedding model does what? A: gemini-embedding-001 for facts (the embedding lane); gemini-embedding-2-preview for documents (the document_embedding lane). Hard rule: queries use the same lane as the corpus.

Q: Is the v1 DocumentSearchAgent removed? A: Not yet. It is still registered via DomainAgentRegistry and selected when DOCUMENT_SEARCH_V2_ENABLED=False. Removal is gated on the flag reaching 100% in production.

Q: How do I add a new document type with custom chunking? A: Extend the parser in services/document_parsing/parsers/, then thread a sizing policy through services/document_chunking.py. A DocumentSourceType-aware sizing layer is on the roadmap.

Q: Can the planner combine document search with memory recall? A: Yes — that is the headline reason for the v2 redesign. recall("Thomas insurance") + search_documents("home insurance Thomas") in one planner turn is the canonical cross-domain pattern.

Q: How do facts get extracted from documents? A: After indexing, eligible documents flip to fact_extraction_status='queued'. ExtractDocumentFactsJob (when enabled) picks them up, runs FactAndEntityExtractionService on a chunk excerpt, and stores facts with source_ref_type='document'. Failures don't block document availability.