TDR-004: Prism Indexing Must Move to a Queue-Based Worker¶

Status: Identified Priority: High (before GA / multi-tenant launch) Estimated Effort: 3–5 days Date Identified: 2026-02-27 Identified By: Dev lead (during EPC_012 Console UAT)

Description¶

What: The Prism Gateway currently runs full repo indexing as an inline asyncio.create_task on the HTTP server process. This works for a single-tenant alpha but is unsafe at any meaningful scale.

Current State:

# prism/gateway/console_routes.py
asyncio.create_task(
    _run_tier3_full_reindex(
        storage=storage, embedder=embedder,
        repo_full_name=..., clone_url=..., branch=..., tenant_id=...,
    )
)
return JSONResponse({"status": "queued"}, status_code=202)

Each POST /api/v1/repos/{id}/index spawns a background coroutine that: 1. Clones the full repo into /tmp via git clone --depth=1 2. Walks all files and generates Vertex AI embeddings (one embed() call per file) 3. Writes chunks + embeddings to pgvector

Why it is problematic:

Scenario	Failure mode
Single large repo (>300 MB)	OOM — Vertex AI SDK alone uses ~400 MB baseline
2+ repos indexed concurrently on the same instance	Additive memory pressure → OOM
Cloud Run scale-to-zero during indexing	Container killed mid-index, job silently lost
Indexing timeout (large repo > Cloud Run 3600s request timeout)	Job killed without status update
Retry on failure	No retry logic; failed jobs are silently dropped

The memory crash was observed during EPC_012 UAT at 531 MiB for a 69 MB repo (Fintama/helvetiq) — before the clone even started, just from loading the Vertex AI SDK. Memory was increased to 2 GiB as a short-term fix.

Desired State:

POST /api/v1/repos/{id}/index enqueues a Cloud Tasks task and returns 202 immediately. A dedicated indexer worker (separate Cloud Run service or Cloud Run Job) picks up the task, clones, embeds, and writes results. The gateway HTTP server has no indexing logic.

Console API ──► Gateway /index ──► Cloud Tasks queue ──► Indexer worker
                    202 ◄──────────────────────────────   (dedicated service)

Impact¶

Reliability¶

❌ Silent failures: OOM or timeout during indexing leaves the repo in "indexing" or "pending" state with no error surfaced to the user
❌ No retry: Failed jobs are lost; user must manually trigger again
❌ Race condition: Two simultaneous reindex triggers for the same repo produce duplicate work and potentially corrupt chunk data

Scalability¶

❌ Memory: Each concurrent index job adds ~150–400 MB; 4 concurrent jobs on a single 2 GiB instance will OOM
❌ CPU: Embedding generation is CPU-intensive; running it on the gateway degrades HTTP response latency for all other requests (MCP tools, auth)

Observability¶

❌ No progress visibility: Status is updated via direct DB writes from the background task; if the task is killed the status never updates to "failed"
❌ No job history: No record of past index runs, durations, or errors

Security¶

⚠️ Token in clone URL: GitHub access token is currently embedded in the clone URL (https://TOKEN@github.com/...) and passed through the gateway request body. This is a short-term pragmatic choice. A proper solution uses GitHub App installation tokens generated at job dispatch time.

Remediation Plan¶

Option A — Cloud Tasks + Separate Indexer Service (Recommended)¶

Create prism-indexer Cloud Run service (separate from gateway) — same Docker image, different entry point, no HTTP server
Gateway /index endpoint calls Cloud Tasks create_task pointing at prism-indexer/internal/run
Indexer handles one job at a time per instance; Cloud Tasks handles retries, deduplication (via task name), and timeout management
Job status written to prism.index_jobs table; gateway /status endpoint reads from there

Effort: ~4 days Requires: Cloud Tasks queue (1 queue, free tier covers alpha load)

Option B — Cloud Run Jobs (Simpler, Less Observability)¶

Each /index trigger creates a Cloud Run Job execution
Job runs to completion in an isolated container; no shared state
Status surfaced via Cloud Run Jobs API

Effort: ~2 days Limitation: No built-in retry policy; harder to query job status from the gateway

Short-Term Mitigations Already Applied (Alpha)¶

Mitigation	Location	Removes risk?
Gateway memory → 2 GiB	Cloud Run config	Partially (OOM for single job)
`asyncio.Semaphore(2)`	`console_routes.py`	Partially (caps concurrency)

Note: The semaphore has NOT been implemented yet — it is called out in TDR-004 as a recommended interim measure before GA.

Prerequisites Before GA¶

[ ] Implement asyncio.Semaphore(2) in console_routes.py as an interim guard
[ ] Implement Option A (Cloud Tasks) before onboarding the second tenant
[ ] Replace token-in-URL with GitHub App installation tokens (separate TDR or ADR)
[ ] Add prism.index_jobs table to track job history and status

Success Criteria¶

Indexing a repo does not affect gateway HTTP response latency
OOM during indexing does not affect the gateway process
Failed index jobs automatically retry (up to 3 attempts)
Job status is queryable independently of the gateway process lifecycle
Two simultaneous reindex triggers for the same repo are deduplicated

EPC_012: Console build where this was first observed in UAT
ADR-005: Developer token architecture (opaque tokens vs IDP tokens)
Code References:
apps/prism/prism/gateway/console_routes.py — _run_tier3_full_reindex()
apps/prism/prism/tier3/handler.py — Tier3Handler.full_reindex()
apps/prism/prism/tier3/clone.py — shallow_clone()
apps/prism-console-api/console_api/gateways/prism_gateway_client.py — trigger_index()

Status Updates¶

2026-02-27: Identified during EPC_012 Console UAT. Memory OOM observed at 531 MiB for a 69 MB private repo. Short-term fix: increased Cloud Run memory to 2 GiB. Queue-based solution deferred to pre-GA.