TDR-004: Prism Indexing Must Move to a Queue-Based Worker¶
Status: Identified Priority: High (before GA / multi-tenant launch) Estimated Effort: 3–5 days Date Identified: 2026-02-27 Identified By: Dev lead (during EPC_012 Console UAT)
Description¶
What: The Prism Gateway currently runs full repo indexing as an inline
asyncio.create_task on the HTTP server process. This works for a single-tenant
alpha but is unsafe at any meaningful scale.
Current State:
# prism/gateway/console_routes.py
asyncio.create_task(
_run_tier3_full_reindex(
storage=storage, embedder=embedder,
repo_full_name=..., clone_url=..., branch=..., tenant_id=...,
)
)
return JSONResponse({"status": "queued"}, status_code=202)
Each POST /api/v1/repos/{id}/index spawns a background coroutine that:
1. Clones the full repo into /tmp via git clone --depth=1
2. Walks all files and generates Vertex AI embeddings (one embed() call per file)
3. Writes chunks + embeddings to pgvector
Why it is problematic:
| Scenario | Failure mode |
|---|---|
| Single large repo (>300 MB) | OOM — Vertex AI SDK alone uses ~400 MB baseline |
| 2+ repos indexed concurrently on the same instance | Additive memory pressure → OOM |
| Cloud Run scale-to-zero during indexing | Container killed mid-index, job silently lost |
| Indexing timeout (large repo > Cloud Run 3600s request timeout) | Job killed without status update |
| Retry on failure | No retry logic; failed jobs are silently dropped |
The memory crash was observed during EPC_012 UAT at 531 MiB for a 69 MB repo
(Fintama/helvetiq) — before the clone even started, just from loading the
Vertex AI SDK. Memory was increased to 2 GiB as a short-term fix.
Desired State:
POST /api/v1/repos/{id}/index enqueues a Cloud Tasks task and returns 202
immediately. A dedicated indexer worker (separate Cloud Run service or Cloud Run
Job) picks up the task, clones, embeds, and writes results. The gateway HTTP
server has no indexing logic.
Console API ──► Gateway /index ──► Cloud Tasks queue ──► Indexer worker
202 ◄────────────────────────────── (dedicated service)
Impact¶
Reliability¶
- ❌ Silent failures: OOM or timeout during indexing leaves the repo in "indexing" or "pending" state with no error surfaced to the user
- ❌ No retry: Failed jobs are lost; user must manually trigger again
- ❌ Race condition: Two simultaneous reindex triggers for the same repo produce duplicate work and potentially corrupt chunk data
Scalability¶
- ❌ Memory: Each concurrent index job adds ~150–400 MB; 4 concurrent jobs on a single 2 GiB instance will OOM
- ❌ CPU: Embedding generation is CPU-intensive; running it on the gateway degrades HTTP response latency for all other requests (MCP tools, auth)
Observability¶
- ❌ No progress visibility: Status is updated via direct DB writes from the background task; if the task is killed the status never updates to "failed"
- ❌ No job history: No record of past index runs, durations, or errors
Security¶
- ⚠️ Token in clone URL: GitHub access token is currently embedded in the
clone URL (
https://TOKEN@github.com/...) and passed through the gateway request body. This is a short-term pragmatic choice. A proper solution uses GitHub App installation tokens generated at job dispatch time.
Remediation Plan¶
Option A — Cloud Tasks + Separate Indexer Service (Recommended)¶
- Create
prism-indexerCloud Run service (separate from gateway) — same Docker image, different entry point, no HTTP server - Gateway
/indexendpoint calls Cloud Taskscreate_taskpointing atprism-indexer/internal/run - Indexer handles one job at a time per instance; Cloud Tasks handles retries, deduplication (via task name), and timeout management
- Job status written to
prism.index_jobstable; gateway/statusendpoint reads from there
Effort: ~4 days Requires: Cloud Tasks queue (1 queue, free tier covers alpha load)
Option B — Cloud Run Jobs (Simpler, Less Observability)¶
- Each
/indextrigger creates a Cloud Run Job execution - Job runs to completion in an isolated container; no shared state
- Status surfaced via Cloud Run Jobs API
Effort: ~2 days Limitation: No built-in retry policy; harder to query job status from the gateway
Short-Term Mitigations Already Applied (Alpha)¶
| Mitigation | Location | Removes risk? |
|---|---|---|
| Gateway memory → 2 GiB | Cloud Run config | Partially (OOM for single job) |
asyncio.Semaphore(2) |
console_routes.py |
Partially (caps concurrency) |
Note: The semaphore has NOT been implemented yet — it is called out in TDR-004 as a recommended interim measure before GA.
Prerequisites Before GA¶
- [ ] Implement
asyncio.Semaphore(2)inconsole_routes.pyas an interim guard - [ ] Implement Option A (Cloud Tasks) before onboarding the second tenant
- [ ] Replace token-in-URL with GitHub App installation tokens (separate TDR or ADR)
- [ ] Add
prism.index_jobstable to track job history and status
Success Criteria¶
- Indexing a repo does not affect gateway HTTP response latency
- OOM during indexing does not affect the gateway process
- Failed index jobs automatically retry (up to 3 attempts)
- Job status is queryable independently of the gateway process lifecycle
- Two simultaneous reindex triggers for the same repo are deduplicated
Related¶
- EPC_012: Console build where this was first observed in UAT
- ADR-005: Developer token architecture (opaque tokens vs IDP tokens)
- Code References:
apps/prism/prism/gateway/console_routes.py—_run_tier3_full_reindex()apps/prism/prism/tier3/handler.py—Tier3Handler.full_reindex()apps/prism/prism/tier3/clone.py—shallow_clone()apps/prism-console-api/console_api/gateways/prism_gateway_client.py—trigger_index()
Status Updates¶
- 2026-02-27: Identified during EPC_012 Console UAT. Memory OOM observed at 531 MiB for a 69 MB private repo. Short-term fix: increased Cloud Run memory to 2 GiB. Queue-based solution deferred to pre-GA.