Data Extraction

Define a schema once, then extract structured data from any text —consistently, every time.

How It Works

2kw.ai takes a standard JSON Schema definition and uses it to extract structured output from unstructured text. You define the fields, types, and constraints — 2kw.ai returns validated, schema-conformant JSON.

The workflow is: define schema → commit version → send text → receive structured output.

Schemas use JSON Schema

If you've used JSON Schema before, you already know the format. If not —it's just a way to describe what fields you expect, their types, and which ones are required.

Schema Management

Before you can extract anything, you need a schema. Schemas are organization-scoped and support full version control.

Creating a Schema

Schemas have a name, optional description, and a JSON Schema definition:

{
  "type": "object",
  "properties": {
    "company": { "type": "string" },
    "invoice_number": { "type": "string" },
    "total": { "type": "number" },
    "line_items": {
      "type": "array",
      "items": {
        "type": "object",
        "properties": {
          "description": { "type": "string" },
          "quantity": { "type": "integer" },
          "price": { "type": "number" }
        }
      }
    }
  },
  "required": ["company", "invoice_number"]
}

Version Control

Every time you change your schema, you create a new version. Versions are immutable —this gives you a full audit trail of every change.

You can:

Commit new versions with a change description
Deactivate versions you don't want used anymore
Re-activate a historical version (creates a new version as a copy, preserving the audit trail)
Pin extractions to a specific version for reproducibility

Each new version auto-increments the version number and updates the latest label.

Re-activation creates a copy

Re-activating version 3 doesn't revert —it creates version N+1 with the same content. The original stays untouched.

Labels

Labels are named pointers to specific schema versions. They decouple your application code from version numbers, enabling controlled rollouts.

How Labels Work

Every schema gets a latest label automatically, updated on each new version. You create custom labels for deployment stages:

Label	Points to	Purpose
`latest`	v5 (auto)	Always the newest version
`production`	v3	What your API consumers use
`staging`	v5	What you're testing

Your application resolves schemas by ID + label —when you're ready to promote, just move the label pointer.

Managing Labels

`POST /api/v1/schemas/{schemaId}/labels`

Create a label:

{
  "name": "production",
  "schemaVersionId": "version-uuid"
}

Update a label to point to a different version:

`PUT /api/v1/schemas/{schemaId}/labels/{labelName}`

{
  "schemaVersionId": "new-version-uuid"
}

Label names must be lowercase alphanumeric with hyphens (e.g., production, staging, v2-rollback). The system latest label cannot be deleted.

Resolving Schemas

`GET /api/v1/schemas/{schemaId}/resolve`

Resolve a schema by ID and optional label to get the JSON Schema definition for a specific version:

Request

curl "https://api.2kw.ai/v1/schemas/{schemaId}/resolve?label=production" \
  -H "Authorization: Bearer sk_your_api_key"

Parameter	Type	Required	Description
`label`	string	No	Label to resolve (defaults to `latest`)

Response:

{
  "schemaId": "uuid",
  "schemaName": "invoice-schema",
  "versionId": "uuid",
  "versionNumber": 3,
  "jsonSchema": {
    "type": "object",
    "properties": {
      "company": { "type": "string" },
      "total": { "type": "number" }
    }
  },
  "label": "production"
}

Validate Before Committing

`POST /api/v1/schemas/{schemaId}/validate`

Check if a schema is valid before you commit it. Returns structured errors and warnings:

{
  "jsonSchema": {
    "type": "object",
    "properties": {
      "name": { "type": "string" }
    }
  }
}

Test Against Sample Text

`POST /api/v1/schemas/{schemaId}/test`

Run an extraction against sample text without persisting the result. Great for iterating on your schema:

Request

curl -X POST https://api.2kw.ai/v1/schemas/{schemaId}/test \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer sk_your_api_key" \
  -d '{
    "jsonSchema": {
      "type": "object",
      "properties": {
        "name": { "type": "string" },
        "email": { "type": "string" }
      }
    },
    "sampleText": "Contact Jane Smith at [email protected]",
    "model": "gpt-5.1"
  }'

The response includes token usage and processing time so you can estimate costs:

{
  "success": true,
  "extractedData": { "name": "Jane Smith", "email": "[email protected]" },
  "inputTokens": 85,
  "outputTokens": 22,
  "processingDurationMs": 890,
  "modelUsed": "gpt-5.1"
}

Running Extractions

Synchronous

`POST /api/v1/extractions`

The standard extraction endpoint. Send text, get structured data back immediately.

Request body:

Field	Type	Required	Description
`schemaId`	string	Yes	Schema to use for extraction
`schemaVersionId`	string	No	Pin to a specific version (defaults to latest active)
`inputText`	string	Conditional	Text to extract data from (required if no `inputImages`)
`inputImages`	array	Conditional	Base64-encoded images to extract from, max 10 (required if no `inputText`)
`model`	string	Yes	Platform model name (e.g., `gpt-5.1`) or `provider/model` for BYOK

Request

curl -X POST https://api.2kw.ai/v1/extractions \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer sk_your_api_key" \
  -d '{
    "schemaId": "your-schema-id",
    "inputText": "Invoice #INV-2024-001 from Acme Corp. Total: $1,250.00",
    "model": "gpt-5.1"
  }'

Extracting from Images

You can also extract structured data from images (up to 10 per request). Each image must be base64-encoded with its MIME type:

Image extraction

curl -X POST https://api.2kw.ai/v1/extractions \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer sk_your_api_key" \
  -d '{
    "schemaId": "your-schema-id",
    "inputImages": [
      {
        "data": "iVBORw0KGgoAAAANSUhEUg...",
        "mimeType": "image/png"
      }
    ],
    "model": "gpt-5.1"
  }'

Supported image MIME types: image/png, image/jpeg, image/gif, image/webp. You can combine inputText and inputImages in a single request.

Asynchronous

`POST /api/v1/extractions/async`

For large inputs, submit an extraction for background processing. You get back a 202 Accepted with a Location header to poll:

HTTP/1.1 202 Accepted
Location: /api/v1/extractions/{id}
Retry-After: 5

Poll the Location URL until status changes from PENDING/PROCESSING to COMPLETED or FAILED.

Estimate Tokens First

`POST /api/v1/extractions/estimate`

Before running an extraction, you can estimate the cost. Send the same schema and input text you'd use for a real extraction:

Request body:

Field	Type	Required	Description
`schemaId`	string	Yes	Schema to estimate for
`schemaVersionId`	string	No	Pin to a specific version (defaults to latest active)
`inputText`	string	Yes	Text to estimate token usage for

Response:

{
  "inputTokens": 4200,
  "estimatedOutputTokens": 350,
  "strategy": "SINGLE_SHOT"
}

The strategy field tells you which processing path will be used (SINGLE_SHOT, CHUNKED, or ASYNC).

Extraction Strategies

2kw.ai automatically picks the best strategy based on input size:

Strategy	Token Range	What Happens
Single-shot	Under 50K tokens	Entire input processed in one LLM call
Chunked	50K - 100K tokens	Input split into overlapping chunks, results merged and validated
Async	Over 100K tokens	Background processing with polling

Chunking details

Chunks are max 8,000 tokens with 200 tokens of overlap. Each chunk receives context about what was already extracted from previous chunks (carry-forward), so the model can extend partial items rather than re-extract them. Results are automatically merged, deduplicated, and validated.

Quality Scoring

Every extraction result includes quality scores that help you assess how trustworthy the extracted data is. These scores are purely informational — structuredOutput is never modified.

Grounding vs Confidence

2kw.ai distinguishes between two types of quality signals:

Grounding measures whether an extracted value has evidence in the source text. If your schema extracts a sourceFile field and the value "gear.geo" appears in the input, that's grounded. If "phantom.geo" doesn't appear anywhere in the input, it was likely hallucinated by the model. Grounding is deterministic, free (no extra LLM calls), and available on every extraction.

Confidence measures how certain the model is about a value — regardless of whether it appears in the source text. A derived value like widthMm: 150.5 (computed from bounding box coordinates) might not appear verbatim in the source, but the model could be highly confident about it. Confidence requires additional model calls and is a planned future feature.

Currently, 2kw.ai provides grounding scores. The API is designed so confidence scores can be added later without breaking existing integrations — the scoring.methods array tells you which scoring methods were applied.

Scores never modify output

Grounding scores are metadata only. structuredOutput always contains the complete extraction result, exactly as the model produced it (after deduplication for chunked extractions). Your application decides how to use the scores — flag low-scoring items, highlight them in the UI, or filter them client-side.

How Grounding Works

For each extracted field, 2kw.ai searches the original input text for the extracted value:

Basis	Score	Meaning
`exact_match`	1.0	Value found verbatim in the source text
`normalized_match`	0.9	Found after normalizing separators (underscores, hyphens, spaces)
`partial_match`	0.8	Substring or filename stem found (e.g. `gear` matches `gear.geo`)
`unverifiable`	0.5	Numbers, booleans, or strings shorter than 3 characters — can't be meaningfully searched
`not_found`	0.0 – 0.1	String value not found in source text. Key fields score 0.0 (strong hallucination signal), regular fields score 0.1.

Per-item scores are weighted averages of their fields, with key fields (identifiers like filenames, IDs, names) weighted 3x higher than regular fields. This means a hallucinated filename dominates the item score even if other fields look fine. Unverifiable fields (numbers, booleans, short strings) are excluded from this average — they don't drag the score down or inflate it.

Extraction Response

The full extraction response includes structuredOutput, process metadata, and resultMetadata with dedup stats and grounding scores:

{
  "id": "extraction-uuid",
  "status": "COMPLETED",
  "strategy": "CHUNKED",
  "result": {
    "structuredOutput": {
      "company": "ACME GmbH",
      "parts": [
        { "sourceFile": "gear.geo", "material": "1.4301", "quantity": 5 },
        { "sourceFile": "shaft.geo", "material": "Steel", "quantity": 2 }
      ]
    },
    "metadata": {
      "inputTokens": 12500,
      "outputTokens": 340,
      "cost": 0.135,
      "durationMs": 4200,
      "model": "gpt-5.1",
      "chunksProcessed": 2
    },
    "resultMetadata": {
      "dedup": {
        "applied": true,
        "itemsRemoved": 1,
        "itemsMerged": 0,
        "resequenced": true,
        "originalItemCount": 3,
        "finalItemCount": 2
      },
      "postValidation": {
        "applied": true,
        "hallucinatedItemsRemoved": 0,
        "requiredFieldsRestored": 0,
        "removedItemKeys": []
      },
      "scoring": {
        "methods": ["grounding"],
        "grounding": { "score": 0.91 },
        "fields": {
          "company": {
            "grounding": { "score": 1.0, "basis": "exact_match" }
          }
        },
        "items": {
          "parts[0]": {
            "grounding": { "score": 0.92 },
            "fields": {
              "sourceFile": { "grounding": { "score": 1.0, "basis": "exact_match" } },
              "material":   { "grounding": { "score": 1.0, "basis": "exact_match" } },
              "quantity":   { "grounding": { "score": 0.5, "basis": "unverifiable" } }
            }
          },
          "parts[1]": {
            "grounding": { "score": 0.95 },
            "fields": {
              "sourceFile": { "grounding": { "score": 1.0, "basis": "exact_match" } },
              "material":   { "grounding": { "score": 0.9, "basis": "normalized_match" } },
              "quantity":   { "grounding": { "score": 0.5, "basis": "unverifiable" } }
            }
          }
        }
      }
    }
  }
}

Deduplication

When an extraction uses the chunked strategy, overlapping chunks can produce duplicate items. 2kw.ai automatically detects and merges these before returning results. Items with the same identifier fields (like sourceFile, partNumber, or name) are grouped — gaps between chunks are filled, exact duplicates removed, and sequential numbering fixed.

The resultMetadata.dedup object tells you what happened:

Field	Description
`applied`	Whether dedup was needed
`itemsRemoved`	Number of exact duplicates removed
`itemsMerged`	Number of partial items merged into one
`resequenced`	Whether sequential numbering was fixed
`originalItemCount`	Items before dedup
`finalItemCount`	Items after dedup

Dedup just works for most schemas

If your array items have clearly named identifier fields (sourceFile, partId, name, etc.), dedup handles everything automatically. See the Advanced section below for how merging works under the hood and how to override the defaults.

Post-Merge Validation

After deduplication, chunked extractions go through a validation pass that catches two types of issues:

Hallucination removal — When chunks overlap at entity boundaries, models sometimes fabricate items that don't exist in the source (e.g., inventing bbq_0_3_1.geo when only bbq_0_3_0.geo exists). The validator checks each item's literal reference fields (like sourceFile) against all chunk texts. If a filename doesn't appear anywhere in the source, the item is removed.

Required field restoration — When results are merged across chunks, required top-level fields (like date, inquiryNumber) can be silently dropped. The validator checks the schema's required array and inserts null for any missing required field, making the absence explicit rather than silent.

The resultMetadata.postValidation object tells you what happened:

Field	Description
`applied`	Whether post-merge validation ran
`hallucinatedItemsRemoved`	Number of items removed because their identifier wasn't found in the source
`requiredFieldsRestored`	Number of required fields that were missing and filled with `null`
`removedItemKeys`	The identifier values of removed items (for debugging)

Hallucination detection uses literal fields

By default, only file-reference fields (sourceFile, fileName, etc.) are checked against the source text. These fields should appear verbatim in the input. Derived values like summaries or computed measurements are never checked — they wouldn't be found via text matching even when correct.

To mark additional fields as literal references, use the literal field hint in your extraction config:

"fieldHints": {
  "sourceFile": { "scoring": "literal" },
  "documentId": { "scoring": "literal" }
}

Context Carry-Forward

When processing chunks sequentially, each chunk after the first receives a summary of what was already extracted from previous chunks. This helps the model:

Avoid re-extracting items it already found in earlier chunks
Extend partial items that span chunk boundaries (e.g., adding missing contours to a part that started in the previous chunk)
Reduce hallucinations by knowing which entities already exist

Carry-forward is automatic and requires no configuration. It works with any schema.

Extraction Configuration

You can configure per-property scoring and dedup behavior on your schema version using the extractionConfig field. This is stored alongside the JSON Schema but is never sent to the LLM — it controls pipeline behavior only.

{
  "jsonSchema": {
    "type": "object",
    "properties": {
      "parts": {
        "type": "array",
        "items": {
          "type": "object",
          "properties": {
            "sourceFile": { "type": "string" },
            "partNumber": { "type": "integer" },
            "material": { "type": "string" },
            "quantity": { "type": "integer" }
          }
        }
      }
    }
  },
  "extractionConfig": {
    "validation": {
      "enabled": true,
      "properties": {
        "parts": {
          "confidenceThreshold": 0.5,
          "keyFields": ["sourceFile"],
          "fieldHints": {
            "partNumber": { "scoring": "system_assigned" },
            "quantity": { "scoring": "default_absent" }
          }
        }
      }
    }
  }
}

Config	Description
`confidenceThreshold`	Score below which items are flagged in the UI (informational, does not remove items)
`keyFields`	Override which fields are used as identifiers for dedup and scoring weight. Auto-detected if not set.
`fieldHints`	Override scoring behavior for specific fields. Each hint is an object with a `scoring` key.

Available field hints:

Hint	Effect	Use for
`system_assigned`	Always scores 1.0	Fields assigned by the system (partNumber, auto-generated IDs)
`default_absent`	Scores 0.8 when value is empty, normal scoring otherwise	Fields where empty means "not applicable" (tolerances, surface treatment)
`literal`	Enables hallucination detection — value must appear verbatim in source text	Fields that reference entities in the source (filenames, document IDs, reference codes)

Most fields don't need hints

Grounding scoring works automatically for string fields by searching the source text. You only need hints for fields where the default scoring doesn't apply — typically 2-3 fields per schema.

Advanced

Key Field Auto-Detection

When keyFields is not set in your extraction config, 2kw.ai auto-detects identifier fields from your schema using these patterns:

Pattern	Examples
Exact name: `id`, `name`, `key`	`id`, `name`
Ends with: `id`, `name`, `file`, `key`, `code`, `number`	`sourceFile`, `partNumber`, `materialCode`
Starts with: `source`, `file`	`sourceDocument`, `fileName`

These fields serve double duty: they're used for dedup grouping (matching items across chunks) and are weighted 3x in grounding scores (a hallucinated identifier dominates the item score).

If auto-detection picks the wrong fields — or misses yours — set keyFields explicitly in your extraction config.

Dedup Merge Algorithm

Understanding how merging works helps when debugging unexpected results in chunked extractions. Here's a complete example before we break down each step.

End-to-end example

A 20-page bill of materials is too large for a single extraction. 2kw.ai splits it into two overlapping chunks. The item housing.geo spans the boundary — Chunk 1 sees the beginning of it (material is mentioned) but cuts off before the quantity. Chunk 2 picks up in the overlap region and sees the quantity, but the material description is already behind it.

Chunk 1 extracts 3 items, the last one incomplete:

{
  "parts": [
    { "sourceFile": "gear.geo",    "material": "1.4301",    "quantity": 5,    "partNumber": 1 },
    { "sourceFile": "shaft.geo",   "material": "Steel",     "quantity": 2,    "partNumber": 2 },
    { "sourceFile": "housing.geo", "material": "Aluminum",  "quantity": null,  "partNumber": 3 }
  ]
}

Chunk 2 extracts housing.geo again (from the overlap) plus one new item:

{
  "parts": [
    { "sourceFile": "housing.geo", "material": null,    "quantity": 3,    "partNumber": 1 },
    { "sourceFile": "bracket.geo", "material": "1.4301", "quantity": 1,    "partNumber": 2 }
  ]
}

Now 2kw.ai merges the two outputs:

1. Detect key fields. sourceFile matches the *file suffix pattern → used as the dedup key. partNumber matches *number → marked as a sequence field.

2. Group by key. Two items share the key sourceFile: "housing.geo" — one from each chunk. The other three items (gear.geo, shaft.geo, bracket.geo) are unique and pass through unchanged.

3. Merge the group. Both housing.geo items have 3 non-null fields (tie). The one from Chunk 1 was encountered first, so it becomes the base. Its quantity is null → filled with 3 from Chunk 2's item:

Base (Chunk 1):  { sourceFile: "housing.geo", material: "Aluminum", quantity: null, partNumber: 3 }
Fill from Chunk 2:                                                   quantity: 3  ←── null filled
Result:          { sourceFile: "housing.geo", material: "Aluminum", quantity: 3,    partNumber: 3 }

4. Resequence. After merge, partNumber values are 1, 2, 3, 2 — duplicate 2. Renumbered to 1, 2, 3, 4.

Final result:

{
  "parts": [
    { "sourceFile": "gear.geo",    "material": "1.4301",    "quantity": 5, "partNumber": 1 },
    { "sourceFile": "shaft.geo",   "material": "Steel",     "quantity": 2, "partNumber": 2 },
    { "sourceFile": "housing.geo", "material": "Aluminum",  "quantity": 3, "partNumber": 3 },
    { "sourceFile": "bracket.geo", "material": "1.4301",    "quantity": 1, "partNumber": 4 }
  ]
}

{
  "dedup": {
    "applied": true,
    "itemsRemoved": 0,
    "itemsMerged": 1,
    "resequenced": true,
    "originalItemCount": 5,
    "finalItemCount": 4
  }
}

The incomplete housing.geo from Chunk 1 and the incomplete one from Chunk 2 were combined into a single complete item. Neither chunk had all the data, but together they did.

Step 1: Group by key fields

Items are grouped by the normalized values (lowercase, trimmed) of their key fields. Items where all key fields are null cannot be fingerprinted and are kept as-is — they are never merged.

If no key fields are detected at all, 2kw.ai falls back to pairwise similarity matching (items with 80%+ field value overlap are grouped).

Step 2: Merge each group

Within a group, the item with the most non-null fields becomes the base. Then, for each remaining item, any field that is null in the base is filled from the other item. If completeness is equal, the item encountered first (from the earlier chunk) becomes the base.

The end-to-end example above shows the common case: each chunk sees a different part of the item, and the merge fills the gaps.

Edge case: conflicting values. When both chunks have the same field with different non-null values, 2kw.ai uses provenance-aware conflict resolution: it prefers the value from the "authority" chunk — the chunk that contains the item's identifier (e.g., the chunk where sourceFile: "gear.geo" actually appears in the text). The authority chunk is most likely to have seen the item's header and metadata, making its scalar values more reliable.

Chunk 1: { sourceFile: "gear.geo", material: "Steel" }              (authority — "gear.geo" appears in chunk 1 text)
Chunk 2: { sourceFile: "gear.geo", material: "1.4301", quantity: 5 }

→ Base: Chunk 2 (more fields), but material overridden from Chunk 1 (authority)
→ Result: { sourceFile: "gear.geo", material: "Steel", quantity: 5 }

If no authority chunk can be determined (the identifier doesn't appear in any chunk text, or both chunks contain it), the base item's value wins (most-complete-item-first, same as before).

Edge case: nested arrays. When both chunks extract array fields (like contours or holes) for the same item, the arrays are concatenated and deduplicated rather than one replacing the other. This is critical for items that span chunk boundaries — Chunk 1 might extract contours 1-10, Chunk 2 might extract contours 8-13, and the merge produces the complete set 1-13 (with duplicates 8-10 removed).

Empty strings block merging

Only null fields are filled during merge. An empty string "" counts as a value and will not be replaced. If your schema has optional fields, use nullable types ("type": ["string", "null"]) rather than defaulting to "" so that null-fill works correctly across chunks.

Step 3: Resequence

After merging, sequential number fields (names ending in number, index, num, order, sequence, position, or named nr/pos) are checked for broken numbering. If gaps or duplicates are found, the field is renumbered starting at 1.

Re-running Extractions

`POST /api/v1/extractions/{id}/rerun`

Need to retry an extraction with the same config? Hit the rerun endpoint. Creates a new extraction —the original stays untouched.

Listing and Filtering

`GET /api/v1/extractions`

Parameter	Type	Description
`search`	string	Filter by model name
`schemaVersionId`	string	Filter by schema version
`status`	string	`PENDING`, `PROCESSING`, `COMPLETED`, or `FAILED`
`page`	number	Page number (0-based)
`size`	number	Page size (default: 20)

Use Cases

Contact extraction —names, emails, phone numbers from unstructured text
Invoice processing —invoice numbers, dates, amounts, line items from documents
Resume parsing —skills, experience, education from CVs
Document analysis —key fields from contracts, reports, forms
Data entry automation —turn free-text notes into structured database records