Evaluators
Evaluators turn raw experiment outputs into comparable scores. Instead of eyeballing which variant's output looks better, you get a number per item and an average per variant — so you can pick a winner with confidence.
Why Evaluators
Without evaluators, comparing variants means reading their outputs side by side and forming a subjective impression. That works for two or three items, but falls apart fast:
- Scale: Can you really judge 100 outputs fairly?
- Consistency: Your opinion drifts as you get tired.
- Reproducibility: Your reviewer sees different nuances than you.
An evaluator applies the same scoring rule to every item, every time. The comparison matrix shows who scored highest, where variants disagree, and which items are hard for every model.
How They Work
When an experiment run finishes an item, each configured evaluator inspects the output and produces a score between 0.0 and 1.0 plus a label:
| Label | Meaning |
|---|---|
| PASS | Output meets quality threshold (typically ≥ 0.8) |
| PARTIAL | Partially correct (typically 0.5 – 0.8) |
| FAIL | Below threshold |
| SKIP | Evaluator can't score this item (e.g. missing ground truth) |
Scores appear as columns in the experiment results matrix — one column per evaluator, one row per dataset item. The best score in each row is highlighted, and variant averages are shown at the bottom.
Available Evaluators
Grounding
Checks whether the values in the extracted output actually appear in the source document. Useful for extraction experiments where hallucinations are the main risk.
A grounding score of 1.0 means every extracted value was found verbatim (or with minor normalization) in the input. Lower scores indicate the model invented values that aren't supported by the source.
No ground truth needed
Grounding doesn't require an expected output in your dataset — it only needs the source document. That makes it cheap to add to any extraction experiment.
Use when: You're testing extraction models and want to detect hallucination. Default for extraction experiments.
Exact Match
Compares the output to the expectedOutput you've defined on each dataset item, field by field. Score = fraction of expected fields that match exactly.
A score of 1.0 means every expected field is present with the expected value. 0.5 means half the fields match. SKIP means the dataset item doesn't have an expectedOutput to compare against.
Use when: Your dataset has gold-standard outputs and you want strict accuracy measurement.
Selecting Evaluators
When you create an experiment, the form includes an Evaluators multi-select where you choose which ones to run. For extraction experiments, Grounding is selected by default — you can add Exact Match or deselect Grounding as needed.
You can also set evaluators via the API when creating or updating an experiment, by including them in the metadata field:
Request
// POST /v1/experiments
{
"name": "Invoice extraction — GPT-4.1 vs GPT-5.1",
"type": "extraction",
"datasetVersionId": "dsv_789",
"metadata": {
"evaluators": [
{ "evaluatorId": "grounding" },
{ "evaluatorId": "exact_match" }
]
}
}
When you trigger a run, every configured evaluator scores every result — you don't need to opt in per-run.
Reading the Matrix
On the experiment results page, evaluator scores appear as short labels below duration, tokens, and cost in each cell:
Row 1 │ GPT-4.1 │ GPT-5.1
│ dur 2.3s │ dur 1.5s
│ cost $0.0066 │ cost $0.0088
│ GRD 1.00 ◄─ │ GRD 0.87
│ EXA 0.75 │ EXA 1.00 ◄─
GRD,EXA— evaluator IDs abbreviated- Green highlight — best score in the row (higher is better)
- Amber row marker — variants produced different outputs (regardless of score)
- Avg row (bottom) — variant-level averages across all items
What Happens Next
Evaluator scores are stored alongside each result, so you can:
- Compare variants objectively in the matrix — higher average wins
- Find hard items — rows where every variant scores low are candidates for prompt tuning or better training data
- Track regressions — re-running an experiment tells you if changes helped or hurt (relative to the previous run's scores)
API
List available evaluators
Returns the evaluators registered on your instance. Use this to build selection UIs or validate evaluator IDs before submission.
Request
GET /v1/evaluators
Response
[
{ "id": "grounding", "displayName": "Grounding" },
{ "id": "exact_match", "displayName": "Exact Match" }
]
Read scores
Run results include a scores array — one entry per evaluator that ran against that result:
Response excerpt
{
"id": "res_abc",
"output": { "total": 154.7, "vendor": "Acme" },
"scores": [
{
"evaluatorId": "grounding",
"evaluatorName": "Grounding",
"score": 0.87,
"label": "PASS",
"details": { /* per-field breakdown */ }
}
]
}
The details field shape depends on the evaluator — grounding returns per-field scoring, exact match returns matched vs. mismatched field lists.
Coming Soon
More evaluators are on the roadmap:
- LLM-as-Judge — use an LLM with a rubric prompt to score outputs qualitatively (e.g. response quality, factuality, tone)
- Semantic Similarity — embedding-based comparison for freeform text outputs
- JSON Schema Valid — validate output structure against a schema
- Latency / Cost — score against budgets and flag outliers
- Webhook — call your own scoring service
The evaluator interface is pluggable, so custom evaluators can be added without changing the experiment engine.