Skip to content

Document Intelligence & RAG — Architecture

Audience: Architects, tech leads, senior engineers evaluating the v2 document agent and its delta against the v1 DocumentSearchAgent.


Context and Purpose

The document agent is Swisper's interface between the planner and the user's document corpus — uploaded files, ingested email attachments, and Swisper-generated documents. It must do three things at once:

  1. Stay in the FC loop. Document operations have to interleave freely with memory recall, calendar lookups, and other tools in a single planner turn. "Find the insurance policy Thomas sent and remind me when it expires" must not need two agent hand-offs.
  2. Honour workspace boundaries. Every retrieval is scoped to a workspace and avatar. There is no cross-workspace leakage path, even on a misconfigured planner call.
  3. Keep retrieval honest. Vector-only cosine similarity misses exact matches (policy numbers, names, dates). Hybrid retrieval and a Discovery Engine reranker close that gap, with all components individually feature-flagged for safe rollout.

The v1 DocumentSearchAgent (Pattern A delegation, reactive LangGraph subgraph) cannot satisfy any of these without architectural changes. The v2 agent is a clean break.


v1 vs v2 — What Changed

Concern v1 (agents/doc/) v2 (agents/document/)
Integration model DomainAgentInterface registered with DomainAgentRegistry. Planner emits delegate_to_document_search_agent; supervisor delegates the entire turn. Pattern 2: individual BaseTool subclasses registered with the planner's FC loop. Tools interleave with memory tools in one plan.
Internal control flow LangGraph subgraph: doc_indexing_nodedocument_search_planner_nodedoc_tool_execution_node (loop, max 5 iterations). Each tool is a thin async method on a service. No internal LangGraph.
Scoping user_id only. Hard-coded limit=20 documents fetched up-front. workspace_id + avatar_id enforced at the service layer (DocumentSearchService). No up-front fetch.
Retrieval Cosine distance only (RAGService.retrieve). Three configurable fusion strategies — dense_only, weighted_blend (alpha-blend), rrf (Reciprocal Rank Fusion, k=60) — over DocumentChunk + AttachmentChunk.
Reranking None. Optional Vertex AI Discovery Engine semantic-ranker-default-004, runs on top 3 × top_k candidates after fusion. Graceful degradation on failure.
Embedding Single embedding config (gemini-embedding-001, 2000 dim). Two distinct lanes: embedding (gemini-embedding-001) for facts and document_embedding (gemini-embedding-2-preview) for documents. Hard rule: queries are embedded by the same lane as the corpus.
Tool surface semantic_search, document_summary. search_documents, get_document, list_documents, create_document. (analyze_document from the spec is not yet a Pattern 2 tool — see Spec Coverage.)
Document creation Not supported. create_document writes Markdown through the standard parse/chunk/embed pipeline; source_type=GENERATED provenance.
Doc → memory linkage None. Documents and facts are isolated graphs. ExtractDocumentFactsJob extracts facts from indexed documents and persists them with source_ref_type='document' provenance.
Failure modes Fan-out exceptions, max-iterations error states without explanation. Graceful degradation per layer (reranker fail → fusion order; encryption fail → search succeeds; fact extraction fail → document still searchable).
Rollout All-or-nothing — the agent is in the DomainAgentRegistry or it isn't. Feature-flagged: DOCUMENT_SEARCH_V2_ENABLED switches between the two paths at the supervisor layer. Both can coexist during migration.

The v1 path is preserved on feature/voice-v2-gemini-live-and-friends so today's UAT stays unbroken; v2 is selected in the supervisor when the flag is on.


v2 Architecture Overview

graph TD
    subgraph Planner["Agentic Supervisor (FC loop)"]
        P["global_planner_node"]
        TN["tool_node"]
    end

    subgraph DocTools["Document tools (Pattern 2 — agents/document/)"]
        SD["search_documents"]
        GD["get_document"]
        LD["list_documents"]
        CD["create_document"]
    end

    subgraph Services["Domain services"]
        DSS["DocumentSearchService<br/>(services/document_search.py)"]
        IDX["IndexingService<br/>(services/file_indexing.py)"]
        RR["RerankerService<br/>(services/reranker.py)"]
        FE["FactAndEntityExtraction<br/>Service"]
    end

    subgraph Gateways["Gateways / external"]
        LLM["SwisperLLMAdapter<br/>(embedding lanes)"]
        DocAI["Document AI<br/>(layout parser)"]
        DE["Vertex Discovery Engine<br/>(reranker)"]
    end

    subgraph Storage["Persistence"]
        PG[("AlloyDB:<br/>documents, document_chunks,<br/>attachment_chunks (pgvector + tsvector)")]
        S3[("MinIO / GCS<br/>(encrypted blobs)")]
    end

    subgraph Jobs["Background"]
        EDFJ["ExtractDocumentFactsJob"]
        REEMB["ReembedDocumentChunksJob"]
    end

    P -->|tool_calls| TN
    TN --> SD & GD & LD & CD

    SD --> DSS
    GD & LD --> PG
    CD --> IDX

    DSS --> LLM
    DSS --> PG
    DSS --> RR --> DE

    IDX --> DocAI & LLM & PG & S3
    IDX -. IndexingCompleteEvent .-> EDFJ --> FE --> PG
    REEMB --> LLM --> PG

Wiring: agents/agentic_supervisor/agent.py builds the tool list per request via _get_document_tools() (lines ~290–325). The factory agents/document/factory.py::build_document_tools() returns the four BaseTool instances, each capturing the request's workspace_id, avatar_id, and db_session so the planner can call them without passing context.


Components and Responsibilities

Component File Responsibility
build_document_tools(...) agents/document/factory.py Request-scoped tool factory. Returns [SearchDocumentsTool, GetDocumentTool, ListDocumentsTool, CreateDocumentTool] bound to one workspace/avatar/session.
SearchDocumentsTool agents/document/tools/search_documents.py Async-only BaseTool. Calls DocumentSearchService.search(), formats EvidenceResults (heading path, page range, parent-email metadata, score) for the planner.
GetDocumentTool agents/document/tools/get_document.py Sync metadata fetch by document UUID. Workspace check on Document.workspace_id.
ListDocumentsTool agents/document/tools/list_documents.py Workspace-scoped listing ordered by created_at desc, max 100.
CreateDocumentTool agents/document/tools/create_document.py Writes Markdown bytes through IndexingService.process_and_index_document() with source_type=GENERATED. Same parse → chunk → embed path as uploads.
DocumentSearchService services/document_search.py Owns embedding-lane selection (document_embedding), fusion strategy (dense_only / weighted_blend / rrf), AttachmentChunk join, and reranker invocation. Returns list[EvidenceResult].
IndexingService services/file_indexing.py Upload → store (S3) → parse (Document AI Layout Parser, with format-specific fallbacks) → chunk → embed (document_embedding) → persist. Emits IndexingCompleteEvent. Populates search_vector against plaintext before PGPString encryption.
RerankerService services/reranker.py Discovery Engine adapter. semantic-ranker-default-004 over the top 3 × top_k candidates. Logs and falls back on failure.
ExtractDocumentFactsJob jobs/extract_document_facts_job.py Background job. Picks documents with fact_extraction_status='queued', allowlists by classification, calls FactAndEntityExtractionService with source_ref_type='document'. Embeds facts on the embedding lane (not document_embedding).
ReembedDocumentChunksJob jobs/reembed_document_chunks_job.py One-shot migration job: re-embeds legacy chunks into the document_embedding lane (gemini-embedding-2-preview).

Tool Catalogue (Planner-Facing)

The planner sees four tools when DOCUMENT_SEARCH_V2_ENABLED=True and a workspace context exists. Tool descriptions are loaded from each BaseTool.description; the planner picks among them based on user intent.

search_documents — primary retrieval tool

SearchDocumentsInput:
    query: str                  # natural-language query
    top_k: int = 5              # 1..20

Returns a numbered list of evidence entries. Each entry contains source kind (document | email_attachment), source UUID, optional Section: heading path, optional Pages:, optional parent email subject/sender, relevance score, and the chunk text.

Internal flow (DocumentSearchService.search): 1. Embed query on the document_embedding lane. 2. (Optional) apply EncryptionContext.transform_query for encrypted-at-rest corpora. 3. Run the configured fusion strategy on DocumentChunk (workspace-scoped). Always run dense search on AttachmentChunk (no tsvector column). 4. Merge, sort by distance, keep top 3 × top_k candidates. 5. If DOCUMENT_RERANKER_ENABLED=True, rerank via Discovery Engine; on failure return fusion order. 6. Return top top_k.

get_document

GetDocumentInput:
    document_id: str            # document UUID

Returns title, filename, type, size, created timestamp, summary, and processing status. Returns a "not found in this workspace" message if the workspace check fails — the result is identical to "does not exist" so the planner cannot probe across workspaces.

list_documents

ListDocumentsInput:
    limit: int = 20             # 1..100

Returns a header line plus one row per document with ID, title, filename, type, size, and created timestamp. Workspace-scoped; ordered by created_at desc.

create_document

CreateDocumentInput:
    title: str                  # 1..500 chars
    content: str                # Markdown, min 1 char

Writes the content as <sanitised_title>.md through IndexingService.process_and_index_document() with source_type=GENERATED. The document is searchable on the next search_documents call. Errors return a string starting with "Failed to create document"; success returns the new document_id.

How the planner uses them

The planner is encouraged (via tool descriptions) to:

  • Reach for search_documents whenever the user asks about uploaded content, attachments, or generated notes.
  • Compose with memory tools — e.g. recall("Thomas insurance") followed by search_documents("home insurance Thomas") and use the union to answer.
  • Use get_document after search_documents only when the user wants metadata (title, summary, status) about a hit.
  • Use list_documents for "what do I have on file?" or as a fallback when search returns nothing.
  • Use create_document only when the user explicitly asks Swisper to save a structured artefact (meeting notes, summary, report) — not for chat replies.

Embedding Lanes

Two cohorts, two lanes — enforced at adapter level via agent_configuration rows.

Lane Adapter call Model Dim Used by
embedding SwisperLLMAdapter("__service__", "embedding") gemini-embedding-001 2000 Facts (UserFact, ExtractedFact), system facts, fact extraction (incl. ExtractDocumentFactsJob outputs).
document_embedding SwisperLLMAdapter("__service__", "document_embedding") gemini-embedding-2-preview 2000 (MRL-truncated from 3072) DocumentChunk, AttachmentChunk, query-time embedding inside DocumentSearchService.

Hard rule (Spec §6.9): A query MUST be embedded on the same lane as the corpus it targets. DocumentSearchService._embed_query is the only place a document query is embedded; it always uses document_embedding. Don't paste SwisperLLMAdapter("__service__", "embedding") into search code paths.

The Vector(2000) column width is unchanged across both lanes — the migration from Azure embeddings to gemini-embedding-2-preview is content-only and runs through ReembedDocumentChunksJob.


Hybrid Retrieval

Three fusion strategies, selected via DOCUMENT_FUSION_STRATEGY (with DOCUMENT_HYBRID_RETRIEVAL_ENABLED=False forcing dense_only regardless):

Strategy Score formula When it wins
dense_only cosine_distance(query_embedding, chunk.embedding) Pure semantic recall, no exact-match signal needed. Default before Phase-1 hybrid was on.
weighted_blend α · dense_sim + (1−α) · ts_rank (default α = 0.7 via DOCUMENT_HYBRID_RETRIEVAL_ALPHA) Mixed corpora where you want a tunable knob between meaning and keywords. Sensitive to score-scale drift between dense and lexical.
rrf Σ 1 / (k + rank_i) over the dense ranking and the BM25 ranking, k=60 Recommended production default. Score-scale-independent — only positions matter. Robust across query types.

Lexical signal: AlloyDB tsvector column on DocumentChunk.search_vector, populated against plaintext before PGPString encrypts chunk_text. Attachments don't carry a tsvector and always run dense.

After fusion, the top 3 × top_k candidates flow to the reranker (when enabled). The reranker score replaces relevance_score in the returned EvidenceResults.


Workspace + Avatar Scoping

Every read and every write goes through a workspace boundary:

  • Search: DocumentSearchService adds WHERE workspace_id = :workspace_id on every chunk query (both document and attachment).
  • Get/List: tools check Document.workspace_id against the bound workspace ID. Mismatches return "not found", not a permission error — no information disclosure.
  • Create: IndexingService writes Document.workspace_id, Document.avatar_id, and propagates workspace_id to every DocumentChunk.
  • Background jobs: ExtractDocumentFactsJob reads Document.workspace_id and stores facts under the same workspace.

Indices: ix_documents_workspace_id and ix_document_chunks_workspace_id make these filters cheap.


Document → Fact Pipeline

After indexing, a document with fact_extraction_status='queued' and a classification on the configurable allowlist becomes input for ExtractDocumentFactsJob:

  1. Pull up to MAX_CHUNK_PASSAGES = 5 chunks, each truncated to MAX_PASSAGE_CHARS = 800.
  2. Hand the joined text to FactAndEntityExtractionService.extract_and_store() with the document's workspace_id/avatar_id and source_ref_type='document', source_ref_id=<document_id> provenance.
  3. Facts are deduplicated semantically (0.92 threshold) against existing facts, embedded on the embedding lane (intentionally different from the document lane), and linked to Person records by entity disambiguation.

Failure isolation: extraction failure does not block document availability. The document remains searchable via search_documents regardless of fact extraction state.

Feature flag: DOCUMENT_FACT_EXTRACTION_ENABLED gates the whole job. While off, documents stay in queued; flipping the flag picks them up on the next run.


Data Model — What's New in v2

Document (table documents):

Field Type v2 purpose
workspace_id UUID Tenancy boundary. Indexed.
avatar_id UUID Owner avatar within the workspace.
source_type DocumentSourceType upload / generated / image_capture / voice_memo / email_attachment.
processing_status DocumentProcessingStatus uploaded / extracting / indexed / queued / failed.
fact_extraction_status FactExtractionStatus not_applicable / queued / completed / failed. Drives ExtractDocumentFactsJob.

DocumentChunk (table document_chunks):

Field Type v2 purpose
workspace_id UUID Tenancy filter on every search. Indexed.
search_vector tsvector BM25 / lexical search. Populated against plaintext before PGPString encryption.
heading_path encrypted str Layout-parser ancestor heading chain (e.g. § Coverage > Exclusions). Surfaced to the planner in evidence.
page_start / page_end int Page range surfaced as Pages: p.4-5 in evidence.
kind ChunkKind paragraph / table / list / ocr_text / image_caption / transcript / generated_text.
embedding Vector(2000) Now sourced from the document_embedding lane. Same column, new content.

AttachmentChunk keeps the same shape and joins to Attachment → Email for parent-email provenance in evidence rows.


Configuration & Feature Flags

All keys are reachable via ConfigurationService (DB-backed) with environment-variable fallbacks defined in swisper/core/config.py.

Key / Env var Default Effect
DOCUMENT_SEARCH_V2_ENABLED False When True AND a workspace/avatar/db session is available, the supervisor passes the v2 tool list to the planner. When False, falls back to v1 delegation.
DOCUMENT_HYBRID_RETRIEVAL_ENABLED False Master switch for hybrid (lexical + dense). When False, DocumentSearchService forces dense_only regardless of DOCUMENT_FUSION_STRATEGY.
DOCUMENT_FUSION_STRATEGY weighted_blend One of dense_only / weighted_blend / rrf. Recommended production: rrf.
DOCUMENT_HYBRID_RETRIEVAL_ALPHA 0.7 Weight for dense in weighted_blend. 1.0 = pure dense; 0.0 = pure lexical.
RRF_K 60 Rank-fusion constant. Lower values favour top results more aggressively.
DOCUMENT_RERANKER_ENABLED True Run Discovery Engine reranker over the top 3 × top_k candidates. Adds ~50–150 ms; falls back to fusion order on failure.
DOCUMENT_FACT_EXTRACTION_ENABLED False Gates ExtractDocumentFactsJob. While off, documents accumulate in fact_extraction_status='queued'.

Recommended rollout sequence: turn on DOCUMENT_SEARCH_V2_ENABLED first (verify tool selection in traces), then DOCUMENT_HYBRID_RETRIEVAL_ENABLED with rrf, then DOCUMENT_FACT_EXTRACTION_ENABLED once chunk re-embedding has completed.


How to Use It

From the planner

You don't call this module directly — the supervisor exposes it. Verify in a request trace that Built N document tools for workspace=<id> is logged (see agentic_supervisor/agent.py:320) and that the planner emits a tool_call for one of the four document tools instead of delegate_to_document_search_agent.

From a service / job

Read paths go through DocumentSearchService. Construct it with a Session and call search(query, workspace_id, avatar_id, top_k); the service handles embedding, fusion, and reranking. Do not instantiate RAGService for new code — it is the v1 implementation and is being retired with the agents/doc/ agent.

Write paths go through IndexingService.process_and_index_document(file_content, filename, workspace_id, avatar_id, user_id, source_type). Pass the right DocumentSourceType so downstream consumers (e.g. fact extraction) can apply type-aware policies.

From a test

Real-LLM tests should embed against the document_embedding lane to match the corpus. Fixtures should populate workspace_id, avatar_id, and (for hybrid) the search_vector column. The plaintext-first ordering in IndexingService._populate_search_vectors is the only correct way — manual fixture rows that bypass it will not match to_tsvector queries.

From the UI

The frontend keeps calling the existing /api/v1/documents/... REST endpoints (see api/routes/documents.py). Those routes still drive the v1 RAG service today. Migration to v2 is tracked separately and does not block the planner-side cutover; the chat-driven path described above is the v2 critical path.


Failure Modes & Degradation

Failure Behaviour
Reranker times out / errors _rerank logs and returns the fusion order untouched. Search still returns results.
EncryptionContext.transform_query fails The query embedding is used as-is. (Caller's responsibility to provide a working context.)
DOCUMENT_SEARCH_V2_ENABLED=True but missing workspace/avatar/session _get_document_tools logs a warning and returns None. The supervisor proceeds with no document tools. The planner cannot call them — graceful "feature not available."
create_document indexing fails Tool returns an error string starting with Failed to create document. Nothing is written to the chunk table for that document.
Fact extraction fails for a document Document remains searchable. fact_extraction_status flips to failed. Job continues with the next document.
Embedding lane misconfigured SwisperLLMAdapter raises early. Caller (search or indexing) propagates. No silent dimension mismatch.

Spec Coverage / Gap Analysis

Comparing what's shipped on feature/voice-v2-gemini-live (and integration sub-branches) against spec_FEAT_ALL_document_intelligence_v1.md:

Spec capability Status Notes
Pattern 2 tools registered with FC loop build_document_tools() + supervisor wiring + feature flag.
search_documents tool Async BaseTool, formats evidence with provenance.
get_document tool ⚠️ partial Implemented, but does not yet expose download_url, entity_names, or fact_count from the spec.
list_documents tool ⚠️ partial Implemented, but spec also asks for document_type / date_range / offset filters. Currently only limit.
create_document tool Markdown-only path; same indexing pipeline as uploads; source_type=GENERATED.
analyze_document tool ❌ not implemented as tool Logic exists in RAGService.analyze_document (v1). No Pattern 2 wrapper yet. The planner cannot ask for deep single-document analysis.
Self-correcting retrieval (retrieve → grade → rewrite → retry, max 2 cycles) ❌ not implemented SearchDocumentsTool._arun calls DocumentSearchService.search and returns. Reranking compensates somewhat; LLM-as-judge grading + query rewrite loop is not present.
Hybrid search (vector + BM25) weighted_blend and rrf strategies, both with workspace scoping.
RRF (k=60) _search_document_chunks_rrf.
Reranking via Vertex AI RerankerService over semantic-ranker-default-004. Graceful degradation.
Check Grounding API integration No GroundingGateway. Citations come from chunk metadata, not from a post-generation grounding check.
Workspace + avatar scoping Enforced at DocumentSearchService, Document, and DocumentChunk levels.
Two-lane embedding (embedding + document_embedding) Even cleaner than the spec — facts and documents are physically separated by adapter call site.
gemini-embedding-2-preview for documents Default for document_embedding lane.
Layout-aware parsing (Document AI Layout Parser) services/document_parsing/parsers/layout_parser.py + factory wiring.
Document-type-aware chunk sizing ⚠️ partial Chunking pipeline + ChunkKind taxonomy exist; per-type token targets from spec §5.5 are not all wired through configuration yet.
Heading-path enrichment DocumentChunk.heading_path populated and surfaced to the planner.
Multimodal ingestion (image + voice) ⚠️ in progress DocumentSourceType.IMAGE_CAPTURE and VOICE_MEMO exist; parser hooks present; end-to-end ingestion paths and embedding-via-multimodal are not yet a single tested flow.
MinIO/S3 physical storage gateways/storage/providers/minio.py + IndexingService._store_file.
Document → Fact pipeline ExtractDocumentFactsJob with provenance, allowlist, semantic dedup.
Embedding migration job (Azure → Gemini) ReembedDocumentChunksJob.
Deprecation of DocumentSearchAgent ⚠️ in progress v1 path still registered. v2 selected via feature flag. Deletion is gated on flag rollout completing.

The two non-trivial gaps that change the planner's behaviour are: (1) the missing analyze_document tool — the planner has no way to request deep single-document analysis without falling back to v1 — and (2) the absent self-correcting loop, which means a poor first-cycle query is returned as-is. Both are tractable extensions of the current shape (no architectural rework needed).


Known Trade-offs and Debt

Item Impact Remediation
analyze_document is missing as a Pattern 2 tool Planner cannot do deep single-document Q&A in the v2 path. Wrap RAGService.analyze_document (or rebuild it on DocumentSearchService + chunk-by-document fetch) as AnalyzeDocumentTool.
No grade/rewrite loop in search_documents Bad first-cycle queries return weak results. Reranker mitigates but doesn't fix. Add an LLM-as-judge step after rerank; on <2 chunks above threshold, rewrite the query and retry once.
get_document and list_documents lack the spec's filter set Planner cannot filter by document type, date range, entity names, or paginate beyond limit. Extend the input schemas and the underlying queries; index Document.created_at for range queries.
v1 path still in the repo Two implementations to maintain during migration. After v2 reaches 100% on DOCUMENT_SEARCH_V2_ENABLED, remove agents/doc/ and the delegate_to_document_search_agent registration in one PR.
REST endpoints still drive v1 Frontend search bar and document Q&A use /api/v1/documents/... and the v1 RAGService. Re-point to DocumentSearchService (read) and IndexingService (write); deprecate v1 route bodies.
Document-type-aware chunk sizing not fully wired Default sizes are used regardless of document type; spec §5.5 targets are unenforced. Add a sizing policy keyed on DocumentSourceType / detected classification; surface in services/document_chunking.py.
No Check Grounding step Cited answers rely on chunk metadata; there's no post-generation factuality check. Add a GroundingGateway; gate behind a flag; apply when the answer is sourced from documents.