What we learned training OCR on 14 million panchayat ledgers

When we started DoqSeal in 2024, we made a bet that turned out to be the entire company: that India's document-extraction problem is not a Western OCR problem with a Hindi font added. It is a fundamentally different problem.

Western OCR datasets — IIIT-5K, ICDAR, the IAM handwriting database — are built from clean photographs of receipts and tidy historical manuscripts. Indian rural records are made of carbon copies of carbon copies, photographed under a single fluorescent tube in a tehsildar's office at 4pm in monsoon season, by a clerk who is also answering three questions at once.

The pipeline that didn't work

Our first attempt was the standard one. Take a state-of-the-art transformer-based OCR model, fine-tune it on 200,000 hand-labeled Hindi pages, ship it. Field accuracy on the first customer's data: 43%.

The model failed in ways that surprised us:

It read carbon-copy ghosting as a second character behind every glyph.
It missed entire columns when the page was tilted more than 8°.
It hallucinated dates in the future on faded ledgers from 1987.

Western OCR assumes one signal per pixel. Indian rural records have three: the original, the carbon shadow, and the next page bleeding through.

What worked

The breakthrough was treating extraction as a multi-stage problem rather than end-to-end. Each stage is a small, specialised model:

Page geometry: a custom segmentation model trained to identify document edges, page-number watermarks, and ruling-line grids before any character recognition runs.
De-ghosting: a UNet variant that takes the raw scan and removes carbon-copy shadows, bleed-through, and stamp ink.
Script-aware tokenisation: instead of one general OCR model, we route to nine script-specific decoders. Devanagari shapes are not Tamil shapes; trying to share a vocabulary loses 11 points of accuracy.
Field-level grounding: the final layer takes the recognised text and matches it against a schema. If the document says "khasra" appears in the top-left, that constrains what character sequences are even considered for that bounding box.

With this pipeline in place, the same first customer's data jumped to 97.8% field-level accuracy — better than the human data-entry team they had been employing.

The dataset is the moat

None of this would have worked without the data. Over 18 months we partnered with 14 state revenue departments, 6 cooperative banks, and 22 NGOs to assemble what we believe is the largest labeled corpus of Indian administrative documents in existence: 14 million pages, 9 scripts, hand-verified by paralegals fluent in each language.

What's next

We're now training the next-generation model on a 40M-page corpus including pre-Independence documents in Modi script, Sharada, and Kaithi. If you're a historian, archivist, or museum with collections that could benefit from this work, get in touch — we provide free processing for cultural-heritage projects.

What we learned training OCR on 14 million panchayat ledgers.

The pipeline that didn't work

What worked

The dataset is the moat

What's next

Want to put this on your documents?