Document Intelligence & RAG — Architecture¶

Audience: Architects, tech leads, senior engineers evaluating the v2 document agent and its delta against the v1 DocumentSearchAgent.

Context and Purpose¶

The document agent is Swisper's interface between the planner and the user's document corpus — uploaded files, ingested email attachments, and Swisper-generated documents. It must do three things at once:

Stay in the FC loop. Document operations have to interleave freely with memory recall, calendar lookups, and other tools in a single planner turn. "Find the insurance policy Thomas sent and remind me when it expires" must not need two agent hand-offs.
Honour workspace boundaries. Every retrieval is scoped to a workspace and avatar. There is no cross-workspace leakage path, even on a misconfigured planner call.
Keep retrieval honest. Vector-only cosine similarity misses exact matches (policy numbers, names, dates). Hybrid retrieval and a Discovery Engine reranker close that gap, with all components individually feature-flagged for safe rollout.

The v1 DocumentSearchAgent (Pattern A delegation, reactive LangGraph subgraph) cannot satisfy any of these without architectural changes. The v2 agent is a clean break.

v1 vs v2 — What Changed¶

Concern	v1 (`agents/doc/`)	v2 (`agents/document/`)
Integration model	`DomainAgentInterface` registered with `DomainAgentRegistry`. Planner emits `delegate_to_document_search_agent`; supervisor delegates the entire turn.	Pattern 2: individual `BaseTool` subclasses registered with the planner's FC loop. Tools interleave with memory tools in one plan.
Internal control flow	LangGraph subgraph: `doc_indexing_node` → `document_search_planner_node` ↔ `doc_tool_execution_node` (loop, max 5 iterations).	Each tool is a thin async method on a service. No internal LangGraph.
Scoping	`user_id` only. Hard-coded `limit=20` documents fetched up-front.	`workspace_id` + `avatar_id` enforced at the service layer (`DocumentSearchService`). No up-front fetch.
Retrieval	Cosine distance only (`RAGService.retrieve`).	Three configurable fusion strategies — `dense_only`, `weighted_blend` (alpha-blend), `rrf` (Reciprocal Rank Fusion, k=60) — over `DocumentChunk` + `AttachmentChunk`.
Reranking	None.	Optional Vertex AI Discovery Engine `semantic-ranker-default-004`, runs on top `3 × top_k` candidates after fusion. Graceful degradation on failure.
Embedding	Single `embedding` config (`gemini-embedding-001`, 2000 dim).	Two distinct lanes: `embedding` (`gemini-embedding-001`) for facts and `document_embedding` (`gemini-embedding-2-preview`) for documents. Hard rule: queries are embedded by the same lane as the corpus.
Tool surface	`semantic_search`, `document_summary`.	`search_documents`, `get_document`, `list_documents`, `create_document`. (`analyze_document` from the spec is not yet a Pattern 2 tool — see Spec Coverage.)
Document creation	Not supported.	`create_document` writes Markdown through the standard parse/chunk/embed pipeline; `source_type=GENERATED` provenance.
Doc → memory linkage	None. Documents and facts are isolated graphs.	`ExtractDocumentFactsJob` extracts facts from indexed documents and persists them with `source_ref_type='document'` provenance.
Failure modes	Fan-out exceptions, max-iterations error states without explanation.	Graceful degradation per layer (reranker fail → fusion order; encryption fail → search succeeds; fact extraction fail → document still searchable).
Rollout	All-or-nothing — the agent is in the `DomainAgentRegistry` or it isn't.	Feature-flagged: `DOCUMENT_SEARCH_V2_ENABLED` switches between the two paths at the supervisor layer. Both can coexist during migration.

The v1 path is preserved on feature/voice-v2-gemini-live-and-friends so today's UAT stays unbroken; v2 is selected in the supervisor when the flag is on.

v2 Architecture Overview¶

graph TD
    subgraph Planner["Agentic Supervisor (FC loop)"]
        P["global_planner_node"]
        TN["tool_node"]
    end

    subgraph DocTools["Document tools (Pattern 2 — agents/document/)"]
        SD["search_documents"]
        GD["get_document"]
        LD["list_documents"]
        CD["create_document"]
    end

    subgraph Services["Domain services"]
        DSS["DocumentSearchService<br/>(services/document_search.py)"]
        IDX["IndexingService<br/>(services/file_indexing.py)"]
        RR["RerankerService<br/>(services/reranker.py)"]
        FE["FactAndEntityExtraction<br/>Service"]
    end

    subgraph Gateways["Gateways / external"]
        LLM["SwisperLLMAdapter<br/>(embedding lanes)"]
        DocAI["Document AI<br/>(layout parser)"]
        DE["Vertex Discovery Engine<br/>(reranker)"]
    end

    subgraph Storage["Persistence"]
        PG[("AlloyDB:<br/>documents, document_chunks,<br/>attachment_chunks (pgvector + tsvector)")]
        S3[("MinIO / GCS<br/>(encrypted blobs)")]
    end

    subgraph Jobs["Background"]
        EDFJ["ExtractDocumentFactsJob"]
        REEMB["ReembedDocumentChunksJob"]
    end

    P -->|tool_calls| TN
    TN --> SD & GD & LD & CD

    SD --> DSS
    GD & LD --> PG
    CD --> IDX

    DSS --> LLM
    DSS --> PG
    DSS --> RR --> DE

    IDX --> DocAI & LLM & PG & S3
    IDX -. IndexingCompleteEvent .-> EDFJ --> FE --> PG
    REEMB --> LLM --> PG

Wiring: agents/agentic_supervisor/agent.py builds the tool list per request via _get_document_tools() (lines ~290–325). The factory agents/document/factory.py::build_document_tools() returns the four BaseTool instances, each capturing the request's workspace_id, avatar_id, and db_session so the planner can call them without passing context.

Components and Responsibilities¶

Component	File	Responsibility
`build_document_tools(...)`	`agents/document/factory.py`	Request-scoped tool factory. Returns `[SearchDocumentsTool, GetDocumentTool, ListDocumentsTool, CreateDocumentTool]` bound to one workspace/avatar/session.
`SearchDocumentsTool`	`agents/document/tools/search_documents.py`	Async-only `BaseTool`. Calls `DocumentSearchService.search()`, formats `EvidenceResult`s (heading path, page range, parent-email metadata, score) for the planner.
`GetDocumentTool`	`agents/document/tools/get_document.py`	Sync metadata fetch by document UUID. Workspace check on `Document.workspace_id`.
`ListDocumentsTool`	`agents/document/tools/list_documents.py`	Workspace-scoped listing ordered by `created_at` desc, max 100.
`CreateDocumentTool`	`agents/document/tools/create_document.py`	Writes Markdown bytes through `IndexingService.process_and_index_document()` with `source_type=GENERATED`. Same parse → chunk → embed path as uploads.
`DocumentSearchService`	`services/document_search.py`	Owns embedding-lane selection (`document_embedding`), fusion strategy (`dense_only` / `weighted_blend` / `rrf`), `AttachmentChunk` join, and reranker invocation. Returns `list[EvidenceResult]`.
`IndexingService`	`services/file_indexing.py`	Upload → store (S3) → parse (Document AI Layout Parser, with format-specific fallbacks) → chunk → embed (`document_embedding`) → persist. Emits `IndexingCompleteEvent`. Populates `search_vector` against plaintext before `PGPString` encryption.
`RerankerService`	`services/reranker.py`	Discovery Engine adapter. `semantic-ranker-default-004` over the top `3 × top_k` candidates. Logs and falls back on failure.
`ExtractDocumentFactsJob`	`jobs/extract_document_facts_job.py`	Background job. Picks documents with `fact_extraction_status='queued'`, allowlists by classification, calls `FactAndEntityExtractionService` with `source_ref_type='document'`. Embeds facts on the `embedding` lane (not `document_embedding`).
`ReembedDocumentChunksJob`	`jobs/reembed_document_chunks_job.py`	One-shot migration job: re-embeds legacy chunks into the `document_embedding` lane (`gemini-embedding-2-preview`).

Tool Catalogue (Planner-Facing)¶

The planner sees four tools when DOCUMENT_SEARCH_V2_ENABLED=True and a workspace context exists. Tool descriptions are loaded from each BaseTool.description; the planner picks among them based on user intent.

`search_documents` — primary retrieval tool¶

SearchDocumentsInput:
    query: str                  # natural-language query
    top_k: int = 5              # 1..20

Returns a numbered list of evidence entries. Each entry contains source kind (document | email_attachment), source UUID, optional Section: heading path, optional Pages:, optional parent email subject/sender, relevance score, and the chunk text.

Internal flow (DocumentSearchService.search): 1. Embed query on the document_embedding lane. 2. (Optional) apply EncryptionContext.transform_query for encrypted-at-rest corpora. 3. Run the configured fusion strategy on DocumentChunk (workspace-scoped). Always run dense search on AttachmentChunk (no tsvector column). 4. Merge, sort by distance, keep top 3 × top_k candidates. 5. If DOCUMENT_RERANKER_ENABLED=True, rerank via Discovery Engine; on failure return fusion order. 6. Return top top_k.

`get_document`¶

GetDocumentInput:
    document_id: str            # document UUID

Returns title, filename, type, size, created timestamp, summary, and processing status. Returns a "not found in this workspace" message if the workspace check fails — the result is identical to "does not exist" so the planner cannot probe across workspaces.

`list_documents`¶

ListDocumentsInput:
    limit: int = 20             # 1..100

Returns a header line plus one row per document with ID, title, filename, type, size, and created timestamp. Workspace-scoped; ordered by created_at desc.

`create_document`¶

CreateDocumentInput:
    title: str                  # 1..500 chars
    content: str                # Markdown, min 1 char

Writes the content as <sanitised_title>.md through IndexingService.process_and_index_document() with source_type=GENERATED. The document is searchable on the next search_documents call. Errors return a string starting with "Failed to create document"; success returns the new document_id.

How the planner uses them¶

The planner is encouraged (via tool descriptions) to:

Reach for search_documents whenever the user asks about uploaded content, attachments, or generated notes.
Compose with memory tools — e.g. recall("Thomas insurance") followed by search_documents("home insurance Thomas") and use the union to answer.
Use get_document after search_documents only when the user wants metadata (title, summary, status) about a hit.
Use list_documents for "what do I have on file?" or as a fallback when search returns nothing.
Use create_document only when the user explicitly asks Swisper to save a structured artefact (meeting notes, summary, report) — not for chat replies.

Embedding Lanes¶

Two cohorts, two lanes — enforced at adapter level via agent_configuration rows.

Lane	Adapter call	Model	Dim	Used by
`embedding`	`SwisperLLMAdapter("__service__", "embedding")`	`gemini-embedding-001`	2000	Facts (`UserFact`, `ExtractedFact`), system facts, fact extraction (incl. `ExtractDocumentFactsJob` outputs).
`document_embedding`	`SwisperLLMAdapter("__service__", "document_embedding")`	`gemini-embedding-2-preview`	2000 (MRL-truncated from 3072)	`DocumentChunk`, `AttachmentChunk`, query-time embedding inside `DocumentSearchService`.

Hard rule (Spec §6.9): A query MUST be embedded on the same lane as the corpus it targets. DocumentSearchService._embed_query is the only place a document query is embedded; it always uses document_embedding. Don't paste SwisperLLMAdapter("__service__", "embedding") into search code paths.

The Vector(2000) column width is unchanged across both lanes — the migration from Azure embeddings to gemini-embedding-2-preview is content-only and runs through ReembedDocumentChunksJob.

Hybrid Retrieval¶

Three fusion strategies, selected via DOCUMENT_FUSION_STRATEGY (with DOCUMENT_HYBRID_RETRIEVAL_ENABLED=False forcing dense_only regardless):

Strategy	Score formula	When it wins
`dense_only`	`cosine_distance(query_embedding, chunk.embedding)`	Pure semantic recall, no exact-match signal needed. Default before Phase-1 hybrid was on.
`weighted_blend`	`α · dense_sim + (1−α) · ts_rank` (default α = 0.7 via `DOCUMENT_HYBRID_RETRIEVAL_ALPHA`)	Mixed corpora where you want a tunable knob between meaning and keywords. Sensitive to score-scale drift between dense and lexical.
`rrf`	`Σ 1 / (k + rank_i)` over the dense ranking and the BM25 ranking, k=60	Recommended production default. Score-scale-independent — only positions matter. Robust across query types.

Lexical signal: AlloyDB tsvector column on DocumentChunk.search_vector, populated against plaintext before PGPString encrypts chunk_text. Attachments don't carry a tsvector and always run dense.

After fusion, the top 3 × top_k candidates flow to the reranker (when enabled). The reranker score replaces relevance_score in the returned EvidenceResults.

Workspace + Avatar Scoping¶

Every read and every write goes through a workspace boundary:

Search: DocumentSearchService adds WHERE workspace_id = :workspace_id on every chunk query (both document and attachment).
Get/List: tools check Document.workspace_id against the bound workspace ID. Mismatches return "not found", not a permission error — no information disclosure.
Create: IndexingService writes Document.workspace_id, Document.avatar_id, and propagates workspace_id to every DocumentChunk.
Background jobs: ExtractDocumentFactsJob reads Document.workspace_id and stores facts under the same workspace.

Indices: ix_documents_workspace_id and ix_document_chunks_workspace_id make these filters cheap.

Document → Fact Pipeline¶

After indexing, a document with fact_extraction_status='queued' and a classification on the configurable allowlist becomes input for ExtractDocumentFactsJob:

Pull up to MAX_CHUNK_PASSAGES = 5 chunks, each truncated to MAX_PASSAGE_CHARS = 800.
Hand the joined text to FactAndEntityExtractionService.extract_and_store() with the document's workspace_id/avatar_id and source_ref_type='document', source_ref_id=<document_id> provenance.
Facts are deduplicated semantically (0.92 threshold) against existing facts, embedded on the embedding lane (intentionally different from the document lane), and linked to Person records by entity disambiguation.

Failure isolation: extraction failure does not block document availability. The document remains searchable via search_documents regardless of fact extraction state.

Feature flag: DOCUMENT_FACT_EXTRACTION_ENABLED gates the whole job. While off, documents stay in queued; flipping the flag picks them up on the next run.

Data Model — What's New in v2¶

Document (table documents):

Field	Type	v2 purpose
`workspace_id`	UUID	Tenancy boundary. Indexed.
`avatar_id`	UUID	Owner avatar within the workspace.
`source_type`	`DocumentSourceType`	`upload` / `generated` / `image_capture` / `voice_memo` / `email_attachment`.
`processing_status`	`DocumentProcessingStatus`	`uploaded` / `extracting` / `indexed` / `queued` / `failed`.
`fact_extraction_status`	`FactExtractionStatus`	`not_applicable` / `queued` / `completed` / `failed`. Drives `ExtractDocumentFactsJob`.

DocumentChunk (table document_chunks):

Field	Type	v2 purpose
`workspace_id`	UUID	Tenancy filter on every search. Indexed.
`search_vector`	`tsvector`	BM25 / lexical search. Populated against plaintext before `PGPString` encryption.
`heading_path`	encrypted str	Layout-parser ancestor heading chain (e.g. `§ Coverage > Exclusions`). Surfaced to the planner in evidence.
`page_start` / `page_end`	int	Page range surfaced as `Pages: p.4-5` in evidence.
`kind`	`ChunkKind`	`paragraph` / `table` / `list` / `ocr_text` / `image_caption` / `transcript` / `generated_text`.
`embedding`	`Vector(2000)`	Now sourced from the `document_embedding` lane. Same column, new content.

AttachmentChunk keeps the same shape and joins to Attachment → Email for parent-email provenance in evidence rows.

Configuration & Feature Flags¶

All keys are reachable via ConfigurationService (DB-backed) with environment-variable fallbacks defined in swisper/core/config.py.

Key / Env var	Default	Effect
`DOCUMENT_SEARCH_V2_ENABLED`	`False`	When `True` AND a workspace/avatar/db session is available, the supervisor passes the v2 tool list to the planner. When `False`, falls back to v1 delegation.
`DOCUMENT_HYBRID_RETRIEVAL_ENABLED`	`False`	Master switch for hybrid (lexical + dense). When `False`, `DocumentSearchService` forces `dense_only` regardless of `DOCUMENT_FUSION_STRATEGY`.
`DOCUMENT_FUSION_STRATEGY`	`weighted_blend`	One of `dense_only` / `weighted_blend` / `rrf`. Recommended production: `rrf`.
`DOCUMENT_HYBRID_RETRIEVAL_ALPHA`	`0.7`	Weight for dense in `weighted_blend`. `1.0` = pure dense; `0.0` = pure lexical.
`RRF_K`	`60`	Rank-fusion constant. Lower values favour top results more aggressively.
`DOCUMENT_RERANKER_ENABLED`	`True`	Run Discovery Engine reranker over the top `3 × top_k` candidates. Adds ~50–150 ms; falls back to fusion order on failure.
`DOCUMENT_FACT_EXTRACTION_ENABLED`	`False`	Gates `ExtractDocumentFactsJob`. While off, documents accumulate in `fact_extraction_status='queued'`.

Recommended rollout sequence: turn on DOCUMENT_SEARCH_V2_ENABLED first (verify tool selection in traces), then DOCUMENT_HYBRID_RETRIEVAL_ENABLED with rrf, then DOCUMENT_FACT_EXTRACTION_ENABLED once chunk re-embedding has completed.

How to Use It¶

From the planner¶

You don't call this module directly — the supervisor exposes it. Verify in a request trace that Built N document tools for workspace=<id> is logged (see agentic_supervisor/agent.py:320) and that the planner emits a tool_call for one of the four document tools instead of delegate_to_document_search_agent.

From a service / job¶

Read paths go through DocumentSearchService. Construct it with a Session and call search(query, workspace_id, avatar_id, top_k); the service handles embedding, fusion, and reranking. Do not instantiate RAGService for new code — it is the v1 implementation and is being retired with the agents/doc/ agent.

Write paths go through IndexingService.process_and_index_document(file_content, filename, workspace_id, avatar_id, user_id, source_type). Pass the right DocumentSourceType so downstream consumers (e.g. fact extraction) can apply type-aware policies.

From a test¶

Real-LLM tests should embed against the document_embedding lane to match the corpus. Fixtures should populate workspace_id, avatar_id, and (for hybrid) the search_vector column. The plaintext-first ordering in IndexingService._populate_search_vectors is the only correct way — manual fixture rows that bypass it will not match to_tsvector queries.

From the UI¶

The frontend keeps calling the existing /api/v1/documents/... REST endpoints (see api/routes/documents.py). Those routes still drive the v1 RAG service today. Migration to v2 is tracked separately and does not block the planner-side cutover; the chat-driven path described above is the v2 critical path.

Failure Modes & Degradation¶

Failure	Behaviour
Reranker times out / errors	`_rerank` logs and returns the fusion order untouched. Search still returns results.
`EncryptionContext.transform_query` fails	The query embedding is used as-is. (Caller's responsibility to provide a working context.)
`DOCUMENT_SEARCH_V2_ENABLED=True` but missing workspace/avatar/session	`_get_document_tools` logs a warning and returns `None`. The supervisor proceeds with no document tools. The planner cannot call them — graceful "feature not available."
`create_document` indexing fails	Tool returns an error string starting with `Failed to create document`. Nothing is written to the chunk table for that document.
Fact extraction fails for a document	Document remains searchable. `fact_extraction_status` flips to `failed`. Job continues with the next document.
Embedding lane misconfigured	`SwisperLLMAdapter` raises early. Caller (search or indexing) propagates. No silent dimension mismatch.

Spec Coverage / Gap Analysis¶

Comparing what's shipped on feature/voice-v2-gemini-live (and integration sub-branches) against spec_FEAT_ALL_document_intelligence_v1.md:

Spec capability	Status	Notes
Pattern 2 tools registered with FC loop	✅	`build_document_tools()` + supervisor wiring + feature flag.
`search_documents` tool	✅	Async `BaseTool`, formats evidence with provenance.
`get_document` tool	⚠️ partial	Implemented, but does not yet expose `download_url`, `entity_names`, or `fact_count` from the spec.
`list_documents` tool	⚠️ partial	Implemented, but spec also asks for `document_type` / `date_range` / `offset` filters. Currently only `limit`.
`create_document` tool	✅	Markdown-only path; same indexing pipeline as uploads; `source_type=GENERATED`.
`analyze_document` tool	❌ not implemented as tool	Logic exists in `RAGService.analyze_document` (v1). No Pattern 2 wrapper yet. The planner cannot ask for deep single-document analysis.
Self-correcting retrieval (retrieve → grade → rewrite → retry, max 2 cycles)	❌ not implemented	`SearchDocumentsTool._arun` calls `DocumentSearchService.search` and returns. Reranking compensates somewhat; LLM-as-judge grading + query rewrite loop is not present.
Hybrid search (vector + BM25)	✅	`weighted_blend` and `rrf` strategies, both with workspace scoping.
RRF (k=60)	✅	`_search_document_chunks_rrf`.
Reranking via Vertex AI	✅	`RerankerService` over `semantic-ranker-default-004`. Graceful degradation.
Check Grounding API integration	❌	No `GroundingGateway`. Citations come from chunk metadata, not from a post-generation grounding check.
Workspace + avatar scoping	✅	Enforced at `DocumentSearchService`, `Document`, and `DocumentChunk` levels.
Two-lane embedding (`embedding` + `document_embedding`)	✅	Even cleaner than the spec — facts and documents are physically separated by adapter call site.
`gemini-embedding-2-preview` for documents	✅	Default for `document_embedding` lane.
Layout-aware parsing (Document AI Layout Parser)	✅	`services/document_parsing/parsers/layout_parser.py` + factory wiring.
Document-type-aware chunk sizing	⚠️ partial	Chunking pipeline + `ChunkKind` taxonomy exist; per-type token targets from spec §5.5 are not all wired through configuration yet.
Heading-path enrichment	✅	`DocumentChunk.heading_path` populated and surfaced to the planner.
Multimodal ingestion (image + voice)	⚠️ in progress	`DocumentSourceType.IMAGE_CAPTURE` and `VOICE_MEMO` exist; parser hooks present; end-to-end ingestion paths and embedding-via-multimodal are not yet a single tested flow.
MinIO/S3 physical storage	✅	`gateways/storage/providers/minio.py` + `IndexingService._store_file`.
Document → Fact pipeline	✅	`ExtractDocumentFactsJob` with provenance, allowlist, semantic dedup.
Embedding migration job (Azure → Gemini)	✅	`ReembedDocumentChunksJob`.
Deprecation of `DocumentSearchAgent`	⚠️ in progress	v1 path still registered. v2 selected via feature flag. Deletion is gated on flag rollout completing.

The two non-trivial gaps that change the planner's behaviour are: (1) the missing analyze_document tool — the planner has no way to request deep single-document analysis without falling back to v1 — and (2) the absent self-correcting loop, which means a poor first-cycle query is returned as-is. Both are tractable extensions of the current shape (no architectural rework needed).

Known Trade-offs and Debt¶

Item	Impact	Remediation
`analyze_document` is missing as a Pattern 2 tool	Planner cannot do deep single-document Q&A in the v2 path.	Wrap `RAGService.analyze_document` (or rebuild it on `DocumentSearchService` + chunk-by-document fetch) as `AnalyzeDocumentTool`.
No grade/rewrite loop in `search_documents`	Bad first-cycle queries return weak results. Reranker mitigates but doesn't fix.	Add an LLM-as-judge step after rerank; on `<2` chunks above threshold, rewrite the query and retry once.
`get_document` and `list_documents` lack the spec's filter set	Planner cannot filter by document type, date range, entity names, or paginate beyond `limit`.	Extend the input schemas and the underlying queries; index `Document.created_at` for range queries.
v1 path still in the repo	Two implementations to maintain during migration.	After v2 reaches 100% on `DOCUMENT_SEARCH_V2_ENABLED`, remove `agents/doc/` and the `delegate_to_document_search_agent` registration in one PR.
REST endpoints still drive v1	Frontend search bar and document Q&A use `/api/v1/documents/...` and the v1 `RAGService`.	Re-point to `DocumentSearchService` (read) and `IndexingService` (write); deprecate v1 route bodies.
Document-type-aware chunk sizing not fully wired	Default sizes are used regardless of document type; spec §5.5 targets are unenforced.	Add a sizing policy keyed on `DocumentSourceType` / detected classification; surface in `services/document_chunking.py`.
No Check Grounding step	Cited answers rely on chunk metadata; there's no post-generation factuality check.	Add a `GroundingGateway`; gate behind a flag; apply when the answer is sourced from documents.