Experiments

Run controlled experiments to compare AI configurations — models, schemas, prompts, parameters — against the same test data. Find the best setup before deploying to production.

How It Works

An experiment compares multiple variants (different configurations) against a dataset (versioned test data). Each variant runs against every item in the dataset, producing results you can compare side by side.

The workflow is: create dataset → create experiment → add variants → run → compare results.

Supported experiment types

The current release supports three experiment types: extraction (document → structured JSON), llm (freeform prompt → response), and custom (untyped — bring your own input/output shape).

Core Concepts

Variants

A variant is one specific configuration you want to test. For extraction experiments, a variant specifies a model, schema, and optionally a schema version. For LLM experiments, it specifies a model, system prompt, temperature, etc.

The experiment type determines what configuration fields are available. All configuration is stored as JSON — the experiment engine is type-agnostic.

Runs

When you start an experiment, the engine creates one run per variant. Each run processes every item in the linked dataset through the variant's configuration and stores the results. Runs execute asynchronously — you can poll for progress.

Results

Each run produces a result per dataset item, containing the AI output, duration, token usage, and estimated cost. Configured evaluators also score each result — grounding scores measure how well extracted values match the source document, and exact-match scores compare against ground-truth expectedOutput when provided.

Creating an Experiment

Create an experiment within your organization, optionally linking it to a dataset version:

Request

// POST /v1/experiments
{
  "name": "Invoice Model Comparison",
  "description": "Compare GPT-4 vs Claude on invoices",
  "type": "extraction",
  "datasetVersionId": "dsv_789"
}

Response

{
  "id": "exp_456",
  "name": "Invoice Model Comparison",
  "status": "DRAFT",
  "type": "extraction",
  "datasetVersionId": "dsv_789",
  "createdAt": "2026-04-08T10:10:00Z"
}

Adding Variants

Add one or more variants to compare. Each variant specifies a taskType and a JSON configuration:

Request

// POST /v1/experiments/{experimentId}/variants
{
  "name": "GPT-4.1",
  "taskType": "extraction",
  "configuration": {
    "schemaId": "sch_789",
    "model": "gpt-4.1"
  }
}

Response

{
  "id": "var_101",
  "experimentId": "exp_456",
  "name": "GPT-4.1",
  "taskType": "extraction",
  "configuration": {
    "schemaId": "sch_789",
    "model": "gpt-4.1"
  },
  "sortOrder": 1
}

Running an Experiment

Start the experiment to execute all variants against the dataset:

Request

// POST /v1/experiments/{experimentId}/runs

Returns 202 Accepted. Each variant gets its own run that executes asynchronously.

Response

{
  "id": "run_201",
  "experimentId": "exp_456",
  "variantId": "var_101",
  "status": "PENDING",
  "itemsTotal": 0,
  "itemsCompleted": 0,
  "itemsFailed": 0
}

Checking Progress

Poll the runs endpoint to check status. Runs transition through: PENDING → RUNNING → COMPLETED (or FAILED).

GET /v1/experiments/{experimentId}/runs

The experiment's own status is derived from its runs — there is no separate field that can get out of sync. As runs change state, the experiment reflects the aggregate (see the lifecycle table below).

Viewing Results

Retrieve results for a specific run:

GET /v1/experiments/{experimentId}/runs/{runId}/results

Response

{
  "content": [
    {
      "id": "res_301",
      "runId": "run_201",
      "datasetItemId": "dsi_101",
      "output": {
        "invoiceNumber": "INV-2026-001",
        "vendor": "Acme Corp",
        "total": 1250.00
      },
      "durationMs": 2340,
      "inputTokens": 156,
      "outputTokens": 89,
      "estimatedCost": 0.004200,
      "error": null
    }
  ]
}

When a run fails on an item (e.g. the model rejects the request or the provider returns an error), output is null and error carries a JSON object with type and message describing the failure. The run's itemsFailed counter increments for that result.

Comparing Results

The experiment's results page bundles a variant × item comparison matrix — the primary way to read an experiment:

Metric-only cells — each cell shows the duration, total tokens, and cost for that variant on that item. Per-row winners (fastest, fewest tokens, cheapest) are highlighted so you can scan a dataset for performance patterns at a glance.
Divergence detection — rows where variants produced different outputs are flagged with an amber indicator. A summary banner at the top counts how many rows diverge; rows where all variants agree stay neutral.
Outlier highlighting — within a divergent row, cells that don't match the row's majority answer get a subtle amber tint. If there's no clear majority, every cell is flagged.
Aggregate row — averaged duration, averaged token count, and total cost per variant across the whole dataset, with overall winners highlighted the same way.
Side-by-side diff — clicking a row opens a diff view comparing the outputs of two variants. With two variants the pair is fixed; with three or more, dropdowns let you pick which pair to compare.

All of this is built on the public GET …/runs/{runId}/results endpoint — if you're building your own tooling, join on datasetItemId across the results of all completed runs for an experiment and you can reproduce the same views.

Regression Detection

Iterating on a pipeline means tweaking one thing and hoping nothing else quietly broke — a prompt change that improves invoices might break receipts, a model upgrade that helps accuracy might make latency regress. Regression detection compares any run against a reference and tells you exactly which dataset items improved, which regressed, and by how much.

Score-driven, modality-agnostic

Regression detection operates on evaluator scores, not raw outputs. As long as your variants have evaluators attached, it works the same for extraction, LLM, and custom experiment types.

Baselines

The reference a run is compared against is called the baseline. Baselines are scoped per variant — each variant compares its own runs against its own reference — because re-running a single variant (see POST …/runs?variantId=…) is a first-class operation, and forcing a cross-variant baseline would fight that workflow.

Every regression comparison resolves a baseline in this order:

Explicit — a baselineRunId query parameter on the request. Overrides everything.
Marked baseline — if any completed run on the same variant has been flagged via PUT …/runs/{runId}/baseline, it's used.
Prior run (auto) — the most recent older completed run on the same variant. Filtered strictly by createdAt so looking back from an older run never picks a newer one.
None — if nothing qualifies, no report is produced. The UI shows "No prior run" in place of counters.

The UI's regression panel makes the resolved source explicit (the small vs … label on each row) so users never have to guess what a delta number is measured against.

Mark a run as its variant's baseline:

PUT /v1/experiments/{experimentId}/runs/{runId}/baseline

Clearing any prior baseline on the same variant is atomic — the partial-unique index on experiment_run(variant_id) WHERE baseline = true guarantees only one baseline per variant at any time.

Response

{
  "id": "run_201",
  "variantId": "var_101",
  "status": "COMPLETED",
  "baseline": true
}

The Report

Fetch a regression report for any completed run:

GET /v1/experiments/{experimentId}/runs/{runId}/regression

Optional query params:

baselineRunId — compare against an explicit run instead of the resolved default
threshold — the absolute score delta above which an item is considered IMPROVED or REGRESSED. Default 0.05.

Response

{
  "runId": "run_current",
  "baselineRunId": "run_baseline",
  "baselineRunCreatedAt": "2026-04-14T09:14:00Z",
  "baselineSource": "MARKED_BASELINE",
  "threshold": 0.05,
  "summary": {
    "comparedItems": 12,
    "improved": 6,
    "regressed": 3,
    "unchanged": 3,
    "baselineMean": 0.725,
    "currentMean": 0.810,
    "meanDelta": 0.085,
    "netDelta": 1.020
  },
  "regressed": [
    {
      "datasetItemId": "item-5",
      "baselineScore": 0.90,
      "currentScore": 0.40,
      "delta": -0.50,
      "classification": "REGRESSED"
    }
  ],
  "improved": [ /* ... */ ]
}

For each dataset item that exists in both runs, the aggregate evaluator score is averaged separately for each run, and delta = currentScore − baselineScore. Classification is straightforward:

Condition	Classification
`delta > +threshold`	`IMPROVED`
`delta < −threshold`	`REGRESSED`
otherwise	`UNCHANGED`

Items that exist in only one of the two runs (e.g., the dataset version changed) are excluded from the comparison — there's no sensible numeric delta to report.

Reading the UI

The experiment detail page surfaces regression inline in the Runs table: every completed non-baseline row shows three counters (N improved · N regressed · N unchanged) plus a "Details ›" affordance. Clicking opens a side panel with:

A verdict bar — stacked proportional segments (emerald / destructive / muted) so the shape of the change is visible at a glance.
A hero Mean Δ — the single most telling number, toned by sign.
Supporting metrics — Net Δ, baseline mean, current mean.
Regressed and Improved sections — each item on its own line with a dumbbell plot (baseline dot → current dot on a 0–1 axis) and the signed delta.

Clicking any item row deep-links into the experiment's Results page with the dataset item pre-selected, so you can see the actual input and per-variant outputs side-by-side in the existing diff view.

Experiment Lifecycle

The experiment's status field is derived from its runs at read time, so it can't drift out of sync with the runs themselves:

Status	Meaning
`DRAFT`	No runs yet
`RUNNING`	At least one run is `PENDING` or `RUNNING`
`COMPLETED`	All runs finished successfully
`PARTIAL_SUCCESS`	At least one run succeeded and at least one failed
`FAILED`	All runs failed

API Reference

Method	Endpoint	Description
`POST`	`/v1/experiments`	Create an experiment
`GET`	`/v1/experiments`	List experiments (paginated)
`GET`	`/v1/experiments/{id}`	Get an experiment
`PUT`	`/v1/experiments/{id}`	Update an experiment
`DELETE`	`/v1/experiments/{id}`	Delete an experiment
`POST`	`.../experiments/{id}/variants`	Add a variant
`PUT`	`.../experiments/{id}/variants/{variantId}`	Update a variant
`DELETE`	`.../experiments/{id}/variants/{variantId}`	Delete a variant
`GET`	`.../experiments/{id}/variants`	List variants
`POST`	`.../experiments/{id}/runs`	Start experiment runs
`GET`	`.../experiments/{id}/runs`	List runs (paginated)
`GET`	`.../experiments/{id}/runs/{runId}`	Get run details
`GET`	`.../experiments/{id}/runs/{runId}/results`	List run results (paginated)
`PUT`	`.../experiments/{id}/runs/{runId}/baseline`	Mark run as its variant's baseline
`DELETE`	`.../experiments/{id}/runs/{runId}/baseline`	Clear the baseline flag
`GET`	`.../experiments/{id}/runs/{runId}/regression`	Get regression report for a run