ironclad logo

Systems for Grounding AI Extraction in the Source Document

We built a grounding system that links every LLM extraction to its exact source location in a contract using semantic OCR linking, context engineering, and document segmentation. Here’s how we did it.

Four white rectangles with rounded corners on a beige background, each displaying red monospaced text—value, raw_text, page_number, and location_hint—like organized notes prepared by a meticulous jurist.

Authored by the Ironclad AI Systems team: Aliasgar Kutiyanawala, Ellie Zhou, Hersh Singh, and Rohit Mishra

The problem: Legal AI cannot be a black box

In contract management, a confident wrong answer becomes an expensive liability. A renewal date from the wrong section triggers a missed opt-out. A counterparty name from the preamble instead of the signature block ends up on the wrong compliance filing. These errors remain silent and plausible.

The core constraint: the system must not only be right, it must be provably right.

Every prediction needs a citation. Bounding boxes, page numbers, highlighted spans are all table-stakes for a human to verify it in seconds. In-house teams managing thousands of contracts need the system to show its work.

Our legacy extraction system gave us this for free. We initially pre-trained a custom NER (Named Entity Recognition) model that operated directly with OCR tokens, where every prediction carried a pointer to its source location.

When we made the choice to switch to LLMs, what we gained in entity extraction accuracy, we lost the direct link to the relevant page locations.

No entity can go unmatched, so we built a custom grounding system. This post covers three systems: semantic OCR linking, context engineering, and document segmentation.

Why string matching is not enough

The naive approach of searching the OCR text for the extracted value and highlighting the match handles easy cases. This doesn’t solve for the tricky cases:

  • Normalization gaps: The LLM extracted and normalized a date to "January 1, 2025" but the document reads "1/1/2025".
  • Multi-instance ambiguity: "ACME Corp" appears in the preamble, the signature block, and the notice provisions. Three locations for the same string, but only one correct citation.
  • OCR artifacts: Hyphenation across lines, missing tokens, whitespace inconsistencies from scanned documents.
  • Semantic equivalence: The extracted value is "Net 30" but the document says "payment within thirty (30) days of receipt of invoice".

Each defeats lexical similarity. You need a system that reasons about where a value came from, not just what it looks like.

Semantic OCR linking

Building our extraction system necessitated use of frontier models, particularly given their balance of flexibility and accuracy at scale. The issue of returning consistent bounding boxes, though, was still pertinent to solve.

The intuition behind why LLM’s aren’t great at building bounding boxes and are unskilled at localization boils down to what the input contains. If we pass in pdf bytes, we have no information on the quantifiable location of an object within the document, regardless of relative positioning of the page components. For this reason, we break up extraction and grounding into distinct subproblems.

During extraction, the LLM produces structured predictions: property name, value, raw text, and a location hint. Location hints are natural-language description of where in the document the value was found. A dedicated localization pass takes these predictions plus the full OCR token stream and links them.

Example

The extraction model returns:

property: counterpartySignerTitle
value: "Chief Executive Officer"
raw_text: "Chief Executive Officer"
location_hint: "Within the signature block for the 'Vendor' entity,
               on the line immediately below 'By: /s/ Jane Smith'"

We then receive token metadata from an OCR model, allowing us to operate on the physical location of tokens. This token stream provides us crucial metadata, and provides:

"text": full_text[start:end]
"page": page_idx
"start": start_idx
"end": end_idx
"vertices": normalized

The OCR token stream for page 12 contains 847 tokens. “Chief Executive Officer” appears twice: once in the recitals paragraph defining the Vendor’s representative (token indices 31-33), and once in the signature block (token indices 812-814).

String matching returns both. The localization LLM receives the tokens, the property metadata, and the location hint. It identifies the signature block near the /s/ marker, matches tokens 812-814, and returns {"0": [812, 813, 814]}. Those tokens carry bounding-box coordinates, and finally the UI renders a precise highlight on page 12.

The location hint hierarchy

The key architectural decision: have the extraction model describe where it found something, not just what it found. We enforce a strict four-tier priority system:

  1. Structural references (highest): "Section 4.1(b) 'Termination for Cause' under the paragraph beginning 'Either party may'"
  2. Titled section references: "Under the heading 'Governing Law', in the sentence starting with 'This Agreement shall be governed by'"
  3. Contextual references with neighboring tokens: "Within the signature block for the 'Lessor' entity, on the line immediately below 'LESSOR:'"
  4. Page position (lowest): "At the top of the page"

Every hint must include 5-10 words of neighboring context as anchors. The prompt enforces a self-check: “Could a human find this exact text in under 10 seconds using only the page number and this hint?” If not, the hint must improve.

This converts a linear-time search problem into a targeted lookup. This is the difference between “find ACME Corp” and “find ACME Corp in the signature block at the bottom of page 12, near the ‘By:’ line.”

When the LLM returns no match, the system falls back to heuristic matching on raw text, then normalized value, always attempting to ground rather than returning nothing.

Context engineering: Pointing instead of generating

A raw OCR API response carries confidence scores, orientation metadata, detected languages, verbose polygon coordinates, etc. We strip it to its essential signal:

{
  "paragraphs": [
    {
      "text": "MASTER SERVICES AGREEMENT",
      "bbox": {"l": 0.24, "t": 0.03, "r": 0.76, "b": 0.05}
    }
  ]
}

The larger win was on the output side. Our segmentation system originally asked the LLM to reproduce each segment’s text—a generation task. The optimized version returns character offsets into the minimized OCR—a classification task.

Instead of generating "Quote Expiration Date: 10/30/2020" (output tokens that might hallucinate), the model returns {"start": 0, "end": 33, "page": 1}. The text already exists. The model just points at it. This eliminates a class of reproduction errors while cutting output cost.

Document segmentation

At its core, segmentation is the process of transforming a flat, continuous stream of raw text—like standard OCR output—into a structured, mapped document. It involves breaking a file down into logical, digestible chunks: identifying where paragraphs begin and end, what is a header versus body text, and how elements relate to one another on a page.

Why is it needed?

Without segmentation, documents are just noisy walls of text. This creates two major bottlenecks for downstream AI tasks like clause classification or data extraction:

  1. Noise pollution: Irrelevant text like running headers, footers, and page numbers pollute the model’s context window.
  2. Cross-page fragmentation: Hard page breaks physically fracture contiguous paragraphs. If a critical indemnification clause starts at the bottom of page 3 and finishes on page 4, standard extraction tools treat them as two separate, incomplete thoughts.

Proper segmentation intelligently identifies continuity and filters out the noise, ensuring downstream models get rich, unfragmented context.

The current approach: syntactic segmentation

Historically, segmentation has relied on a syntactic, rules-based approach. It takes plain text and splits it at natural linguistic boundaries that a careful reader would recognize (words, sentences, or paragraphs) using predefined character limits.

It handles edge cases—like preventing a sentence split after the period in “Ironclad Corp.”—by using explicit abbreviation lists. While fast and generally reliable for plain text, the syntactic approach relies entirely on text characters. It lacks a visual understanding of the document’s layout, meaning it struggles with multi-column formats, complex spacing, and distinguishing between a section header and a random bold word.

The leap forward: LLM-based segmentation (V1)

To solve the layout limitations, the next evolution utilizes a frontier, vision-capable Multimodal LLM. Instead of just reading text, the model “looks” at the document visually and semantically.

The LLM is given detailed rules to:

  • Group lines semantically, stitching together paragraphs that flow across page breaks into a single logical chunk.
  • Label every segment accurately (text, header, section_header, footer, etc.).
  • Exclude non-text grids like tables and signature blocks entirely.
  • Follow true visual reading order in multi-column layouts.

This approach provides a rich, structured output validated by exact bounding-box coordinates on the page. It solves the noise and fragmentation problems entirely. However, because V1 relies on autoregressive generation—forcing the LLM to output all that text token by token—it introduces higher latency and computational costs.

The need for speed: LLM-based segmentation V2 (pointing)

To eliminate the latency bottleneck of V1, we introduced the V2 architecture: Pointing.

Instead of having the large language model painstakingly generate text tokens to create the chunks, the V2 model acts as a visual director. It uses the coordinates of the underlying OCR output and simply “points” to the bounding boxes that make up a segment. By outputting pointers rather than generating text, the pipeline bypasses heavy decoding costs entirely.

The results: Why the upgrade matters

Transitioning from syntactic rules to LLM-based pointing yields massive improvements in both accuracy and speed.

A surge in accuracy (LLM vs. Current Syntactic)

When comparing the LLM approach to standard syntactic chunking, the reduction in noise and boost in precision is staggering:

  • Massive noise reduction: False positives dropped by roughly 80%, proving the LLM’s superior ability to identify and ignore headers, footers, and irrelevant layout noise.
  • Sharper precision: Precision improved by over 140%, meaning when the LLM makes a chunking decision, it is highly accurate.
  • Overall reliability: The F1 Score (the balance of precision and recall) saw an impressive ~70% improvement, creating a structurally superior foundation for downstream AI tasks.

A surge in speed (V2 Pointing vs. V1 Generation)

The V2 pointing optimization successfully cured the latency hangover of using a large multimodal model. By pointing instead of generating, processing times were slashed by 50% to over 70% depending on the document length.

Ultimately, this means you get all the deep, visual-semantic understanding of a frontier LLM, but at 2x to 3x the speed.

Impact

The move to LLM-based extraction with semantic OCR linking produced double-digit acceptance rate improvements (i.e. the rate at which production users accept AI predictions without editing):

Property CategoryAcceptance Rate Improvement
Counterparty Information (name, address, signers)+12 to 15pp
Financial Terms (contract value, payment terms)+10 to 13pp
Key Dates (agreement date, expiration, renewal)+4 to 10pp
Legal Provisions (governing law, termination, renewal type)+3 to 4pp

Counterparty properties, historically weakest due to multi-entity ambiguity, saw the largest gains. Financial terms, where normalization gaps are most common, were the second biggest mover.

These gains came from architectural decisions, not hyperparameter tuning: decoupling extraction from localization, engineering the context window, building structured document representations.

Looking forward

Legal documents are inherently complex. With strong entity extraction capabilities, autonomous agentic capabilities become more feasible.

In building a robust representation of a document, we enable the development of active assets for users. In turning static pages into usable primitives, this allows our systems of record to become a systems of intelligence.


Ironclad is not a law firm, and this post does not constitute or contain legal advice. To evaluate the accuracy, sufficiency, or reliability of the ideas and guidance reflected here, or the applicability of these materials to your business, you should consult with a licensed attorney.