Skip to content

Document Intelligence & RAG — Overview

Audience: Business stakeholders, product owners, analysts, new team members.


What This Module Does

Swisper turns every file a user shares — uploaded documents, email attachments, and notes Swisper writes for them — into searchable knowledge. The system:

  1. Stores the original file encrypted in MinIO/GCS, so it can be retrieved or shared later.
  2. Parses it with a layout-aware parser (Google Document AI), extracting structure (sections, tables, headings, page ranges).
  3. Chunks the content with structure-first cuts, falling back to character-based splitting only when needed.
  4. Embeds every chunk on the dedicated document_embedding lane (gemini-embedding-2-preview, 2000 dim).
  5. Indexes chunks in AlloyDB with both pgvector (semantic) and tsvector (lexical / BM25) so exact terms and meaning both score.
  6. Retrieves by hybrid search — Reciprocal Rank Fusion (RRF) across the two indexes — and reranks the top candidates via Vertex AI Discovery Engine.
  7. Surfaces evidence to the planner with full provenance: source kind, document/attachment ID, heading path, page range, parent-email metadata.
  8. Feeds the memory graph — a background job extracts facts from indexed documents (e.g. "Home insurance expires June 2027") and stores them with source_ref_type='document' provenance.

This is the v2 document agent, which replaces the v1 LangGraph DocumentSearchAgent delegation flow. v2 is feature-flagged (DOCUMENT_SEARCH_V2_ENABLED) and selected at the agentic supervisor.


Who It Serves

Persona Need
End users Ask questions about uploaded files and attachments; have Swisper draft and save documents; have document content quietly improve Swisper's memory.
Product owners Understand what's searchable, how scoping works (workspace + avatar), and which capabilities are MLP vs. roadmap.
Backend developers Add a new format parser, tune fusion / reranker, or extend the planner-facing tool surface.

Key v2 Capabilities

  • Planner-native toolssearch_documents, get_document, list_documents, create_document are individual BaseTool subclasses registered with the agentic supervisor's FC loop. The planner can interleave them with memory tools in one plan ("recall person → search their documents → summarise").
  • Workspace + avatar scoping — every read and write is scoped at the service layer. No cross-workspace leakage path.
  • Hybrid retrieval — three configurable strategies (dense_only, weighted_blend, rrf) on DocumentChunk, plus dense over AttachmentChunk. Recommended production: RRF with k=60.
  • Semantic reranker — Vertex AI Discovery Engine semantic-ranker-default-004 over the top 3 × top_k candidates. Graceful degradation when the API is unavailable.
  • Two embedding lanes — facts ride on gemini-embedding-001; documents ride on gemini-embedding-2-preview. Queries are embedded by the same lane as the corpus.
  • Layout-aware parsing — Google Document AI Layout Parser produces structured sections with heading hierarchy. Chunks carry their heading_path and page range as evidence metadata.
  • Document creationcreate_document writes Markdown through the same parse/chunk/embed pipeline as uploads, with source_type=GENERATED provenance.
  • Document → Fact pipelineExtractDocumentFactsJob extracts facts from allowlisted documents, embeds them on the embedding lane, and links them to existing Person records. Runs async; never blocks document availability.
  • Email attachment unificationAttachmentChunk is searched alongside DocumentChunk; results carry parent-email metadata (subject, sender) for traceability.

Supported File Formats

Today's parsers (services/document_parsing/parsers/):

Format Extensions Parser
PDF (layout-aware) .pdf Document AI Layout Parser → fallback to pdf.py
Markdown .md markdown.py (header-aware)
Office .docx, .doc office.py
Spreadsheet .xlsx, .xls spreadsheet.py
Plain text .txt text.py
Images .jpg, .jpeg, .png image.py (OCR via Document AI / Gemini Vision)

Voice-memo ingestion (source_type=VOICE_MEMO) is wired in the data model but not yet a single end-to-end tested flow — see Architecture → Spec Coverage.


How It Fits in the Platform

  • Agentic Supervisor — Builds the document tool list per request via _get_document_tools() and passes it to global_planner_node. Tools are scoped to the call's workspace/avatar/session.
  • Memory SystemExtractDocumentFactsJob calls FactAndEntityExtractionService.extract_and_store() with document provenance. Facts persist in the same memory graph as conversational facts.
  • LLM Adapter — Embeddings flow through SwisperLLMAdapter on two lanes (embedding for facts, document_embedding for docs). One configuration change rolls a model swap to all callers.
  • Storage GatewayMinIOProvider (S3-compatible) handles encrypted blob storage; metadata stays in AlloyDB.
  • Background JobsExtractDocumentFactsJob (continuous), ReembedDocumentChunksJob (one-shot migration from legacy embeddings).

Limits and Edge Cases

  • Hard scope at workspace. A misconfigured planner call (no workspace/avatar) yields zero tools — the planner sees no document surface, not an error.
  • Reranker is optional and bounded. When DOCUMENT_RERANKER_ENABLED=False or Discovery Engine is down, results fall back to fusion order.
  • No cross-workspace search, even for the same user across multiple avatars.
  • gemini-embedding-2-preview is Public Preview. Expect non-EU data residency until GA. Lanes can be reconfigured via agent_configuration.
  • analyze_document is not yet a v2 tool. Deep single-document Q&A still goes through the v1 RAG service via /api/v1/documents/... REST routes.
  • Self-correcting retrieval loop is not yet implemented. A poor first-cycle query is returned as-is; reranking compensates partially.

Frequently Asked Questions

Q: How is v2 turned on? A: Set DOCUMENT_SEARCH_V2_ENABLED=True (env or agent_configuration row). The supervisor will register the four document tools per request as long as a workspace and avatar are present.

Q: Which embedding model does what? A: gemini-embedding-001 for facts (the embedding lane); gemini-embedding-2-preview for documents (the document_embedding lane). Hard rule: queries use the same lane as the corpus.

Q: Is the v1 DocumentSearchAgent removed? A: Not yet. It is still registered via DomainAgentRegistry and selected when DOCUMENT_SEARCH_V2_ENABLED=False. Removal is gated on the flag reaching 100% in production.

Q: How do I add a new document type with custom chunking? A: Extend the parser in services/document_parsing/parsers/, then thread a sizing policy through services/document_chunking.py. A DocumentSourceType-aware sizing layer is on the roadmap.

Q: Can the planner combine document search with memory recall? A: Yes — that is the headline reason for the v2 redesign. recall("Thomas insurance") + search_documents("home insurance Thomas") in one planner turn is the canonical cross-domain pattern.

Q: How do facts get extracted from documents? A: After indexing, eligible documents flip to fact_extraction_status='queued'. ExtractDocumentFactsJob (when enabled) picks them up, runs FactAndEntityExtractionService on a chunk excerpt, and stores facts with source_ref_type='document'. Failures don't block document availability.