Lab — Overview¶

Audience: Business stakeholders, product owners, analysts, new team members. This document answers "what does the Lab do and why does it matter?" in plain language. No unexplained technical jargon.

What This Module Does¶

The Lab is SwisperStudio's experimentation workspace for optimizing how each AI node in a Swisper agent works. When a developer or prompt engineer needs to find the best combination of system prompt, AI model, and model settings for a specific node, the Lab provides a structured way to test, compare, and deploy the winning configuration — replacing guesswork with data-driven decisions.

Without the Lab, finding the right recipe for a node means manually editing prompts, running a trace, eyeballing the output, and repeating. There is no way to systematically compare alternatives, no persistent test cases to reuse, and no audit trail showing why a particular configuration was chosen. The Lab turns this ad-hoc process into a repeatable workflow: create test cases from real production data, define configurations to compare, run a structured evaluation, review the results in a comparison matrix, and deploy the winner to production.

Who It Serves¶

Prompt engineers who need to iterate on system prompts and see how changes affect output quality, latency, and cost across realistic test cases.
AI developers who want to compare different models (e.g., GPT-4 vs Gemini vs Claude) on the same test suite to find the best quality-cost trade-off for each node.
QA engineers who import test scenarios from TDD test files or create them from real traces to build regression test suites for AI behavior.
Product owners who need confidence that changes to AI prompts have been tested against real scenarios before reaching production.

Key Capabilities¶

Create test cases from real traces. Right-click on any node observation in the Tracing UI and select "Add to Lab" to instantly create a test case with the exact inputs and outputs from production.
Build a library of reusable scenarios. Each test case captures the full input state and expected result, so it can be replayed against any configuration without re-running the original workflow.
Define named setups for comparison. Combine a prompt template, a model, and model settings into a named "setup" that can be evaluated systematically.
Run structured batch evaluations. Select multiple scenarios and multiple setups, run them with configurable repetitions, and get a comparison matrix showing pass/fail, latency, token usage, and cost for every combination.
Edit prompts with Jinja2-aware syntax highlighting. The Playground editor understands Jinja2 template syntax, provides variable autocomplete, validates syntax in real time, and shows a server-rendered preview that matches production output.
Deploy the winning configuration. Once the best setup is identified, promote it directly to a staging or production environment from within the Lab.

How It Fits in the Platform¶

The Lab sits between the Tracing module (which captures real production data) and the Config Service (which manages deployed configurations). Traces provide the raw material for test scenarios — the Lab imports frozen node state from trace observations. When the user identifies a winning configuration, the Lab pushes it to the Config Service for deployment to a target environment.

The Lab also depends on the Provider Configuration module, which supplies the available AI models and their credentials. Batch execution uses these configured providers to run LLM calls during evaluation.

For a detailed view of component relationships and data flows, see the Architecture document.

Limits and Edge Cases¶

Node-scoped only. The Lab optimizes one node at a time. It cannot run end-to-end agent workflows or evaluate how changes to one node affect downstream nodes.
Config Service required for preview and deployment. The Jinja2 preview and deployment features require the Config Service to be running. If the Config Service is unreachable, the editor still works but preview shows raw template text and deployment is unavailable.
Scenarios and setups become immutable after first use. Once a scenario or setup has been used in a batch evaluation, its core content (inputs, expected output, prompt, model, parameters) cannot be changed. To modify, the user must create a clone.
No built-in model cost budgets. Batch evaluations run all requested combinations. Large batches (many scenarios × many setups × many repetitions) can incur significant LLM costs without a built-in budget limit.

FAQ¶

Q: Where do test scenarios come from? A: Scenarios can be created in three ways: from real production traces (the most common path — right-click a node observation and select "Add to Lab"), manually by entering input state and expected output, or by importing from TDD test files in the codebase.

Q: How do I know which setup is best? A: Run a batch evaluation with your test scenarios and the setups you want to compare. The Lab produces a comparison matrix showing each setup's pass rate, average latency, token usage, and total cost across all scenarios. A recommendation score combines these factors to suggest a winner.

Q: Does running a batch cost money? A: Yes. Each batch runs real LLM calls against the configured providers. A batch of 10 scenarios × 3 setups × 3 repetitions = 90 LLM calls. Costs depend on the models selected and their token pricing.

Q: What happens when I deploy a setup? A: The Lab creates a configuration snapshot with the winning setup's prompt and model settings, then releases it to the target environment (e.g., staging or production) via the Config Service. The agent running in that environment picks up the new configuration on its next request.

Q: Can I use the Lab without the Config Service? A: Partially. The Playground editor, scenario management, and setup management all work independently. However, the Jinja2 template preview, prompt discovery (which nodes are available), and deployment to environments require the Config Service to be running.