Skip to content

Lab — Architecture

Audience: Architects, tech leads, senior engineers evaluating design decisions and cross-module impact. This document answers "how is the Lab designed, and why?" Assumes technical fluency but explains domain-specific decisions.


Context and Purpose

The Lab exists as a separate module to isolate the experimentation and evaluation concern from SwisperStudio's observability pipeline. Tracing captures what happened in production; the Lab answers "what should happen next" by providing a structured workflow for comparing alternative LLM configurations. This separation ensures that experimentation state (scenarios, setups, batch results) does not pollute the observability data model, and that batch execution — which generates many LLM calls — operates independently from the tracing consumer.

The driving constraints: (1) each Lab is scoped to a single node, because prompts, state shapes, and output contracts differ across nodes, making cross-node comparison meaningless; (2) scenarios and setups are immutable after first use to guarantee reproducibility; (3) the system must integrate bidirectionally with the Config Service — reading available prompt templates and writing winning configurations back as deployments.


Architecture Overview

graph TD
    subgraph Frontend ["Frontend (React)"]
        LP["Lab Pages\n(Home, Experiment)"]
        PW["Playground\nWorkbench"]
        PE["Prompt Editor\n(CodeMirror + Jinja2)"]
        BW["Batch Create\nWizard"]
        BM["Batch Matrix\n& Comparison"]
    end

    subgraph Backend ["Studio Backend (FastAPI)"]
        LR["Lab Router\n(8 sub-routers)"]
        BE["Batch Execution\nService"]
        AS["Assessment\nService"]
        RP["Render Proxy"]
    end

    subgraph External ["External Services"]
        CS["Config Service\n(prompts, deploy)"]
        LLM["LLM Providers\n(OpenAI, Anthropic,\nVertex AI, Azure)"]
    end

    subgraph DB ["PostgreSQL"]
        SC["scenarios"]
        SU["setups"]
        EX["experiments +\nversions + runs"]
        BA["batches +\nresults + aggregations"]
    end

    LP --> LR
    PW --> PE
    PW --> LR
    BW --> LR
    BM --> LR

    LR --> SC
    LR --> SU
    LR --> EX
    LR --> BA

    BE --> LLM
    BE --> BA
    AS --> LLM
    AS --> BA

    RP --> CS
    LR --> CS
    PE -.->|"preview"| RP

The Lab follows a standard three-tier architecture: React frontend, FastAPI backend, PostgreSQL storage. The frontend is organized as a feature module under frontendV2/src/features/lab/ with pages, components, hooks, and API client functions. The backend routes live in backend/app/api/routes/lab/ as 8 sub-routers mounted under /projects/{project_id}/labs. Two external integrations drive the system: the Config Service (for prompt discovery, Jinja2 preview rendering, and deployment) and LLM providers (for experiment runs and batch execution).


Component Responsibilities

Component Location Responsibility
Lab Router backend/app/api/routes/lab/ Aggregates 8 sub-routers (discovery, scenarios, setups, experiments, batches, deployments, add-to-lab, prompts) under the lab API prefix
Batch Execution Service backend/app/services/batch_execution_service.py Orchestrates N×M×R LLM calls for batch evaluation with parallelism, retry, and rate-limit handling
Assessment Service backend/app/services/assessment_service.py Evaluates batch results against scenario expectations using exact match, structured match, or LLM-as-judge
Config Service Client backend/app/services/config_service_client.py Adapter for Config Service — prompt discovery, node listing, and render proxy
Render Proxy backend/app/api/routes/lab/prompts.py Forwards Jinja2 template render requests to Config Service for production-identical preview
Playground Workbench frontendV2/.../PlaygroundWorkbench.tsx Interactive prompt editing UI — model selection, template editing, single-run execution, results display
Prompt Editor frontendV2/.../PromptEditor.tsx CodeMirror 6 editor with Jinja2 syntax highlighting, autocomplete, inline validation, and insertAtCursor for variable insertion
Template Section frontendV2/.../TemplateSection.tsx Editor/preview toggle — shows CodeMirror editor or server-rendered Jinja2 preview with font size controls and variable panel
Batch Matrix frontendV2/.../BatchMatrix.tsx Renders the scenarios × setups comparison grid with pass/fail, latency, and cost per cell
Lab Hooks frontendV2/.../hooks/ 8 React Query hooks managing data fetching, mutations, and caching for all Lab entities

Data Model

All Lab entities are stored in PostgreSQL. The Lab does not have its own table — Lab identity is the tuple (project_id, agent_name, node_name).

erDiagram
    Project ||--o{ Scenario : has
    Project ||--o{ Setup : has
    Project ||--o{ Experiment : has
    Project ||--o{ Batch : has

    Scenario {
        string id PK
        string project_id FK
        string agent_name
        string node_name
        string name
        json state
        string assessment_type
        json expected_output
    }

    Setup {
        string id PK
        string project_id FK
        string agent_name
        string node_name
        string name
        string model_name_snapshot
        json params
        string prompt_template_id FK
    }

    Experiment {
        string id PK
        string project_id FK
        string name
        json test_state
        string custom_prompt
    }

    Experiment ||--o{ ExperimentVersion : has
    Experiment ||--o{ ExperimentRun : has

    ExperimentVersion {
        string id PK
        string experiment_id FK
        int version_number
        string template_content
    }

    ExperimentRun {
        string id PK
        string experiment_id FK
        string model_name_snapshot
        string response
        int latency_ms
        decimal calculated_cost
    }

    Batch ||--o{ BatchResult : produces
    Batch {
        string id PK
        string project_id FK
        string agent_name
        string node_name
        string status
        int total_runs
        string selected_winner_setup_id
    }

    BatchResult {
        string id PK
        string batch_id FK
        string scenario_id FK
        string setup_id FK
        int repetition_number
        string status
        boolean assessment_pass
    }
Entity Table Purpose
Scenario scenarios Frozen test case — node input state + expected result
Setup setups Named recipe — prompt snapshot + model + parameters
Experiment experiments Interactive playground session
ExperimentVersion experiment_versions Versioned prompt snapshot within an experiment
ExperimentRun experiment_runs Single LLM execution with results and metrics
Batch experiment_batches Systematic N×M×R evaluation job
BatchResult batch_results Per (scenario × setup × repetition) execution result
AggregatedResult aggregated_results Per (scenario × setup) rollup of batch results

Key Design Decisions

Decision: No labs table — Lab identity is (project_id, agent_name, node_name) - Chose: Virtual Lab scoped by the node tuple, no dedicated table - Rejected: Explicit labs table with CRUD lifecycle - Rationale: A Lab maps 1:1 to a node. A separate table would add indirection (create/delete/rename operations) without value. The node identity is stable because it comes from the agent's graph structure. All Lab entities carry agent_name + node_name columns for scoping.

Decision: Atomic nodes — one LLM call per node, one template per node (ADR-009) - Chose: Each LangGraph node makes exactly one LLM call with one self-contained template - Rejected: Multi-LLM nodes with fragment assembly - Rationale: Atomic nodes enable per-node experimentation — each prompt can be tested, versioned, and deployed independently. The trade-off is more nodes in the graph and some template duplication, but clarity and testability outweigh DRY. - Related: ADR-009: Prompt Architecture Simplification

Decision: Immutability after first use (spec D4) - Chose: Scenarios and Setups lock their core fields after first use in a batch - Rejected: Allow unrestricted editing with version tracking - Rationale: Reproducibility. If a scenario's expected output changes after a batch, historical results become meaningless. The clone-to-modify pattern preserves audit integrity while still allowing iteration.

Decision: CodeMirror 6 for prompt editing (replacing Monaco) - Chose: CodeMirror 6 with @codemirror/lang-jinja for Jinja2-native editing - Rejected: Continue with Monaco in markdown mode - Rationale: Monaco has no Jinja2 language support. CodeMirror's first-party lang-jinja provides syntax highlighting, autocomplete, and auto-close for all Jinja2 constructs. The editor also provides inline syntax validation that prevents saving invalid templates.


Interfaces and Contracts

Interface Direction Consumer Contract
GET /projects/{pid}/labs Inbound Frontend Lab Index Returns LabSummary[] — merged Config Service nodes + local DB counts
POST .../labs/{agent}/{node}/experiments Inbound Frontend Playground Creates or resolves playground experiment for the node
POST .../labs/{agent}/{node}/batches Inbound Frontend Batch Wizard Creates a batch with scenario IDs, setup IDs, repetitions
POST .../labs/{agent}/{node}/batches/{id}/start Inbound Frontend Triggers async batch execution
POST /prompts/{name}/render Outbound (proxy) Config Service Forwards Jinja2 render request, returns rendered string
POST .../labs/{agent}/{node}/deployments Outbound (proxy) Config Service Promotes winning setup to target environment
Config Service GET /api/v1/prompts Outbound Lab Discovery Fetches available prompt nodes for Lab listing

Breaking change note: The Lab API uses camelCase JSON serialization (via Pydantic alias generators). Frontend types in frontendV2/src/features/lab/types.ts must match. Schema changes require coordinated backend + frontend updates.


Known Trade-offs and Debt

  • Monaco still bundled. CodeMirror replaced Monaco for prompt editing, but Monaco remains in the bundle for JSON editors in the Tracing feature. Two editor libraries coexist, adding ~80KB to the initial bundle. Future work: migrate remaining Monaco uses to CodeMirror or lazy-load Monaco.
  • No lazy-loading for Lab pages. Lab pages are eagerly imported in App.tsx alongside all other routes. CodeMirror and all Lab components load on initial page visit. Route-level code splitting would reduce initial bundle size.
  • Config Service is a hard dependency for preview and deployment. When unreachable, the Jinja2 preview falls back to raw template text and deployment is unavailable. No local Jinja2 rendering fallback exists in the Studio backend.
  • Batch execution has no cost budget. Large batches (e.g., 50 scenarios × 10 setups × 5 repetitions = 2,500 LLM calls) execute without a spending limit. A cost-cap feature is deferred.
  • Assessment calibration not exposed. The LLM-as-judge assessment uses a fixed rubric. Calibration of the judge model against human ratings is not yet implemented.