Lab — Architecture¶
Audience: Architects, tech leads, senior engineers evaluating design decisions and cross-module impact. This document answers "how is the Lab designed, and why?" Assumes technical fluency but explains domain-specific decisions.
Context and Purpose¶
The Lab exists as a separate module to isolate the experimentation and evaluation concern from SwisperStudio's observability pipeline. Tracing captures what happened in production; the Lab answers "what should happen next" by providing a structured workflow for comparing alternative LLM configurations. This separation ensures that experimentation state (scenarios, setups, batch results) does not pollute the observability data model, and that batch execution — which generates many LLM calls — operates independently from the tracing consumer.
The driving constraints: (1) each Lab is scoped to a single node, because prompts, state shapes, and output contracts differ across nodes, making cross-node comparison meaningless; (2) scenarios and setups are immutable after first use to guarantee reproducibility; (3) the system must integrate bidirectionally with the Config Service — reading available prompt templates and writing winning configurations back as deployments.
Architecture Overview¶
graph TD
subgraph Frontend ["Frontend (React)"]
LP["Lab Pages\n(Home, Experiment)"]
PW["Playground\nWorkbench"]
PE["Prompt Editor\n(CodeMirror + Jinja2)"]
BW["Batch Create\nWizard"]
BM["Batch Matrix\n& Comparison"]
end
subgraph Backend ["Studio Backend (FastAPI)"]
LR["Lab Router\n(8 sub-routers)"]
BE["Batch Execution\nService"]
AS["Assessment\nService"]
RP["Render Proxy"]
end
subgraph External ["External Services"]
CS["Config Service\n(prompts, deploy)"]
LLM["LLM Providers\n(OpenAI, Anthropic,\nVertex AI, Azure)"]
end
subgraph DB ["PostgreSQL"]
SC["scenarios"]
SU["setups"]
EX["experiments +\nversions + runs"]
BA["batches +\nresults + aggregations"]
end
LP --> LR
PW --> PE
PW --> LR
BW --> LR
BM --> LR
LR --> SC
LR --> SU
LR --> EX
LR --> BA
BE --> LLM
BE --> BA
AS --> LLM
AS --> BA
RP --> CS
LR --> CS
PE -.->|"preview"| RP
The Lab follows a standard three-tier architecture: React frontend, FastAPI
backend, PostgreSQL storage. The frontend is organized as a feature module under
frontendV2/src/features/lab/ with pages, components, hooks, and API client
functions. The backend routes live in backend/app/api/routes/lab/ as 8
sub-routers mounted under /projects/{project_id}/labs. Two external integrations
drive the system: the Config Service (for prompt discovery, Jinja2 preview
rendering, and deployment) and LLM providers (for experiment runs and batch
execution).
Component Responsibilities¶
| Component | Location | Responsibility |
|---|---|---|
| Lab Router | backend/app/api/routes/lab/ |
Aggregates 8 sub-routers (discovery, scenarios, setups, experiments, batches, deployments, add-to-lab, prompts) under the lab API prefix |
| Batch Execution Service | backend/app/services/batch_execution_service.py |
Orchestrates N×M×R LLM calls for batch evaluation with parallelism, retry, and rate-limit handling |
| Assessment Service | backend/app/services/assessment_service.py |
Evaluates batch results against scenario expectations using exact match, structured match, or LLM-as-judge |
| Config Service Client | backend/app/services/config_service_client.py |
Adapter for Config Service — prompt discovery, node listing, and render proxy |
| Render Proxy | backend/app/api/routes/lab/prompts.py |
Forwards Jinja2 template render requests to Config Service for production-identical preview |
| Playground Workbench | frontendV2/.../PlaygroundWorkbench.tsx |
Interactive prompt editing UI — model selection, template editing, single-run execution, results display |
| Prompt Editor | frontendV2/.../PromptEditor.tsx |
CodeMirror 6 editor with Jinja2 syntax highlighting, autocomplete, inline validation, and insertAtCursor for variable insertion |
| Template Section | frontendV2/.../TemplateSection.tsx |
Editor/preview toggle — shows CodeMirror editor or server-rendered Jinja2 preview with font size controls and variable panel |
| Batch Matrix | frontendV2/.../BatchMatrix.tsx |
Renders the scenarios × setups comparison grid with pass/fail, latency, and cost per cell |
| Lab Hooks | frontendV2/.../hooks/ |
8 React Query hooks managing data fetching, mutations, and caching for all Lab entities |
Data Model¶
All Lab entities are stored in PostgreSQL. The Lab does not have its own table —
Lab identity is the tuple (project_id, agent_name, node_name).
erDiagram
Project ||--o{ Scenario : has
Project ||--o{ Setup : has
Project ||--o{ Experiment : has
Project ||--o{ Batch : has
Scenario {
string id PK
string project_id FK
string agent_name
string node_name
string name
json state
string assessment_type
json expected_output
}
Setup {
string id PK
string project_id FK
string agent_name
string node_name
string name
string model_name_snapshot
json params
string prompt_template_id FK
}
Experiment {
string id PK
string project_id FK
string name
json test_state
string custom_prompt
}
Experiment ||--o{ ExperimentVersion : has
Experiment ||--o{ ExperimentRun : has
ExperimentVersion {
string id PK
string experiment_id FK
int version_number
string template_content
}
ExperimentRun {
string id PK
string experiment_id FK
string model_name_snapshot
string response
int latency_ms
decimal calculated_cost
}
Batch ||--o{ BatchResult : produces
Batch {
string id PK
string project_id FK
string agent_name
string node_name
string status
int total_runs
string selected_winner_setup_id
}
BatchResult {
string id PK
string batch_id FK
string scenario_id FK
string setup_id FK
int repetition_number
string status
boolean assessment_pass
}
| Entity | Table | Purpose |
|---|---|---|
| Scenario | scenarios |
Frozen test case — node input state + expected result |
| Setup | setups |
Named recipe — prompt snapshot + model + parameters |
| Experiment | experiments |
Interactive playground session |
| ExperimentVersion | experiment_versions |
Versioned prompt snapshot within an experiment |
| ExperimentRun | experiment_runs |
Single LLM execution with results and metrics |
| Batch | experiment_batches |
Systematic N×M×R evaluation job |
| BatchResult | batch_results |
Per (scenario × setup × repetition) execution result |
| AggregatedResult | aggregated_results |
Per (scenario × setup) rollup of batch results |
Key Design Decisions¶
Decision: No labs table — Lab identity is (project_id, agent_name, node_name)
- Chose: Virtual Lab scoped by the node tuple, no dedicated table
- Rejected: Explicit labs table with CRUD lifecycle
- Rationale: A Lab maps 1:1 to a node. A separate table would add indirection
(create/delete/rename operations) without value. The node identity is stable
because it comes from the agent's graph structure. All Lab entities carry
agent_name + node_name columns for scoping.
Decision: Atomic nodes — one LLM call per node, one template per node (ADR-009) - Chose: Each LangGraph node makes exactly one LLM call with one self-contained template - Rejected: Multi-LLM nodes with fragment assembly - Rationale: Atomic nodes enable per-node experimentation — each prompt can be tested, versioned, and deployed independently. The trade-off is more nodes in the graph and some template duplication, but clarity and testability outweigh DRY. - Related: ADR-009: Prompt Architecture Simplification
Decision: Immutability after first use (spec D4) - Chose: Scenarios and Setups lock their core fields after first use in a batch - Rejected: Allow unrestricted editing with version tracking - Rationale: Reproducibility. If a scenario's expected output changes after a batch, historical results become meaningless. The clone-to-modify pattern preserves audit integrity while still allowing iteration.
Decision: CodeMirror 6 for prompt editing (replacing Monaco)
- Chose: CodeMirror 6 with @codemirror/lang-jinja for Jinja2-native editing
- Rejected: Continue with Monaco in markdown mode
- Rationale: Monaco has no Jinja2 language support. CodeMirror's first-party
lang-jinja provides syntax highlighting, autocomplete, and auto-close for all
Jinja2 constructs. The editor also provides inline syntax validation that
prevents saving invalid templates.
Interfaces and Contracts¶
| Interface | Direction | Consumer | Contract |
|---|---|---|---|
GET /projects/{pid}/labs |
Inbound | Frontend Lab Index | Returns LabSummary[] — merged Config Service nodes + local DB counts |
POST .../labs/{agent}/{node}/experiments |
Inbound | Frontend Playground | Creates or resolves playground experiment for the node |
POST .../labs/{agent}/{node}/batches |
Inbound | Frontend Batch Wizard | Creates a batch with scenario IDs, setup IDs, repetitions |
POST .../labs/{agent}/{node}/batches/{id}/start |
Inbound | Frontend | Triggers async batch execution |
POST /prompts/{name}/render |
Outbound (proxy) | Config Service | Forwards Jinja2 render request, returns rendered string |
POST .../labs/{agent}/{node}/deployments |
Outbound (proxy) | Config Service | Promotes winning setup to target environment |
Config Service GET /api/v1/prompts |
Outbound | Lab Discovery | Fetches available prompt nodes for Lab listing |
Breaking change note: The Lab API uses camelCase JSON serialization (via
Pydantic alias generators). Frontend types in frontendV2/src/features/lab/types.ts
must match. Schema changes require coordinated backend + frontend updates.
Known Trade-offs and Debt¶
- Monaco still bundled. CodeMirror replaced Monaco for prompt editing, but Monaco remains in the bundle for JSON editors in the Tracing feature. Two editor libraries coexist, adding ~80KB to the initial bundle. Future work: migrate remaining Monaco uses to CodeMirror or lazy-load Monaco.
- No lazy-loading for Lab pages. Lab pages are eagerly imported in
App.tsxalongside all other routes. CodeMirror and all Lab components load on initial page visit. Route-level code splitting would reduce initial bundle size. - Config Service is a hard dependency for preview and deployment. When unreachable, the Jinja2 preview falls back to raw template text and deployment is unavailable. No local Jinja2 rendering fallback exists in the Studio backend.
- Batch execution has no cost budget. Large batches (e.g., 50 scenarios × 10 setups × 5 repetitions = 2,500 LLM calls) execute without a spending limit. A cost-cap feature is deferred.
- Assessment calibration not exposed. The LLM-as-judge assessment uses a fixed rubric. Calibration of the judge model against human ratings is not yet implemented.