DoqSeal/Blog/Engineering
ENGINEERING · 9 MIN READ

What we learned training OCR on 14 million panchayat ledgers.

RM
Ravi Menon
Co-founder, ML · Published 07 May 2026

When we started DoqSeal in 2024, we made a bet that turned out to be the entire company: that India's document-extraction problem is not a Western OCR problem with a Hindi font added. It is a fundamentally different problem.

Western OCR datasets — IIIT-5K, ICDAR, the IAM handwriting database — are built from clean photographs of receipts and tidy historical manuscripts. Indian rural records are made of carbon copies of carbon copies, photographed under a single fluorescent tube in a tehsildar's office at 4pm in monsoon season, by a clerk who is also answering three questions at once.

The pipeline that didn't work

Our first attempt was the standard one. Take a state-of-the-art transformer-based OCR model, fine-tune it on 200,000 hand-labeled Hindi pages, ship it. Field accuracy on the first customer's data: 43%.

The model failed in ways that surprised us:

Western OCR assumes one signal per pixel. Indian rural records have three: the original, the carbon shadow, and the next page bleeding through.

What worked

The breakthrough was treating extraction as a multi-stage problem rather than end-to-end. Each stage is a small, specialised model:

  1. Page geometry: a custom segmentation model trained to identify document edges, page-number watermarks, and ruling-line grids before any character recognition runs.
  2. De-ghosting: a UNet variant that takes the raw scan and removes carbon-copy shadows, bleed-through, and stamp ink.
  3. Script-aware tokenisation: instead of one general OCR model, we route to nine script-specific decoders. Devanagari shapes are not Tamil shapes; trying to share a vocabulary loses 11 points of accuracy.
  4. Field-level grounding: the final layer takes the recognised text and matches it against a schema. If the document says "khasra" appears in the top-left, that constrains what character sequences are even considered for that bounding box.

With this pipeline in place, the same first customer's data jumped to 97.8% field-level accuracy — better than the human data-entry team they had been employing.

The dataset is the moat

None of this would have worked without the data. Over 18 months we partnered with 14 state revenue departments, 6 cooperative banks, and 22 NGOs to assemble what we believe is the largest labeled corpus of Indian administrative documents in existence: 14 million pages, 9 scripts, hand-verified by paralegals fluent in each language.

What's next

We're now training the next-generation model on a 40M-page corpus including pre-Independence documents in Modi script, Sharada, and Kaithi. If you're a historian, archivist, or museum with collections that could benefit from this work, get in touch — we provide free processing for cultural-heritage projects.

Want to put this on your documents?

Start free with 500 pages. Production API in five minutes.