Evaluators

Evaluators turn raw experiment outputs into comparable scores. Instead of eyeballing which variant's output looks better, you get a number per item and an average per variant — so you can pick a winner with confidence.

Why Evaluators

Without evaluators, comparing variants means reading their outputs side by side and forming a subjective impression. That works for two or three items, but falls apart fast:

Scale: Can you really judge 100 outputs fairly?
Consistency: Your opinion drifts as you get tired.
Reproducibility: Your reviewer sees different nuances than you.

An evaluator applies the same scoring rule to every item, every time. The comparison matrix shows who scored highest, where variants disagree, and which items are hard for every model.

How They Work

When an experiment run finishes an item, each configured evaluator inspects the output and produces a score between 0.0 and 1.0 plus a label:

Label	Meaning
PASS	Output meets quality threshold (typically ≥ 0.8)
PARTIAL	Partially correct (typically 0.5 – 0.8)
FAIL	Below threshold
SKIP	Evaluator can't score this item (e.g. missing ground truth)

Scores appear as columns in the experiment results matrix — one column per evaluator, one row per dataset item. The best score in each row is highlighted, and variant averages are shown at the bottom.

Available Evaluators

Grounding

Checks whether the values in the extracted output actually appear in the source document. Useful for extraction experiments where hallucinations are the main risk.

A grounding score of 1.0 means every extracted value was found verbatim (or with minor normalization) in the input. Lower scores indicate the model invented values that aren't supported by the source.

No ground truth needed

Grounding doesn't require an expected output in your dataset — it only needs the source document. That makes it cheap to add to any extraction experiment.

Use when: You're testing extraction models and want to detect hallucination. Default for extraction experiments.

Exact Match

Compares the output to the expectedOutput you've defined on each dataset item, field by field. Score = fraction of expected fields that match exactly.

A score of 1.0 means every expected field is present with the expected value. 0.5 means half the fields match. SKIP means the dataset item doesn't have an expectedOutput to compare against.

Use when: Your dataset has gold-standard outputs and you want strict accuracy measurement.

Selecting Evaluators

When you create an experiment, the form includes an Evaluators multi-select where you choose which ones to run. For extraction experiments, Grounding is selected by default — you can add Exact Match or deselect Grounding as needed.

You can also set evaluators via the API when creating or updating an experiment, by including them in the metadata field:

Request

// POST /v1/experiments
{
  "name": "Invoice extraction — GPT-4.1 vs GPT-5.1",
  "type": "extraction",
  "datasetVersionId": "dsv_789",
  "metadata": {
    "evaluators": [
      { "evaluatorId": "grounding" },
      { "evaluatorId": "exact_match" }
    ]
  }
}

When you trigger a run, every configured evaluator scores every result — you don't need to opt in per-run.

Reading the Matrix

On the experiment results page, evaluator scores appear as short labels below duration, tokens, and cost in each cell:

Row 1 │  GPT-4.1        │  GPT-5.1
      │  dur  2.3s      │  dur  1.5s
      │  cost $0.0066   │  cost $0.0088
      │  GRD  1.00  ◄─  │  GRD  0.87
      │  EXA  0.75      │  EXA  1.00  ◄─

GRD, EXA — evaluator IDs abbreviated
Green highlight — best score in the row (higher is better)
Amber row marker — variants produced different outputs (regardless of score)
Avg row (bottom) — variant-level averages across all items

What Happens Next

Evaluator scores are stored alongside each result, so you can:

Compare variants objectively in the matrix — higher average wins
Find hard items — rows where every variant scores low are candidates for prompt tuning or better training data
Track regressions — re-running an experiment tells you if changes helped or hurt (relative to the previous run's scores)

API

List available evaluators

Returns the evaluators registered on your instance. Use this to build selection UIs or validate evaluator IDs before submission.

Request

GET /v1/evaluators

Response

[
  { "id": "grounding", "displayName": "Grounding" },
  { "id": "exact_match", "displayName": "Exact Match" }
]

Read scores

Run results include a scores array — one entry per evaluator that ran against that result:

Response excerpt

{
  "id": "res_abc",
  "output": { "total": 154.7, "vendor": "Acme" },
  "scores": [
    {
      "evaluatorId": "grounding",
      "evaluatorName": "Grounding",
      "score": 0.87,
      "label": "PASS",
      "details": { /* per-field breakdown */ }
    }
  ]
}

The details field shape depends on the evaluator — grounding returns per-field scoring, exact match returns matched vs. mismatched field lists.

Coming Soon

More evaluators are on the roadmap:

LLM-as-Judge — use an LLM with a rubric prompt to score outputs qualitatively (e.g. response quality, factuality, tone)
Semantic Similarity — embedding-based comparison for freeform text outputs
JSON Schema Valid — validate output structure against a schema
Latency / Cost — score against budgets and flag outliers
Webhook — call your own scoring service

The evaluator interface is pluggable, so custom evaluators can be added without changing the experiment engine.