doctr Hindi Language support

🚀 The feature

#Hindi Language Support for Indians As for Indians, Hindi is also must be considered in Doctr-Vocabs.

Motivation, pitch

As in India, Mostly documents are in Hindi language which is not currently supported by Doctr. The only thing which we Indians need so it will easy to create POC's and make solutions using Doctr as a first step in OCR related stuffs.

Alternatives

No response

Additional context

No response

May 29 '24 14:05 chaudhary-mohit

Let me know what is required w.r.t. datasets to make this happen, BTW this exists - https://github.com/iitb-research-code/indic-doctr . And there is this for corpus - https://huggingface.co/datasets/ai4bharat/sangraha.

Aug 15 '24 09:08 ramSeraph

Vocabs was added in https://github.com/mindee/doctr/pull/1687

Aug 15 '24 10:08 felixdittrich92

I haven't spent enough time on this yet. So, my comprehension of the requirements might be a bit lacking. But are you saying adding the vocab is enough, as others can now use the code base with their own models for Hindi?

If that is the case, maybe we can add the vocab for the rest of the scripts as well in a separate issue - I see the other indic vocabs here - https://github.com/iitb-research-code/indic-doctr/blob/main/doctr/datasets/vocabs.py

Aug 15 '24 10:08 ramSeraph

If you have a model which was trained with doctr and on exactly the added vocabs (same char order and length) then yes

Aug 15 '24 10:08 felixdittrich92

In general you should already be able to use one of the provided models here: https://github.com/iitb-research-code/indic-doctr/releases

For example:

import torch
from doctr.models import ocr_predictor, crnn_vgg16_bn

# Vocab copied from the indic-doctr repo
vocab = 'ॲऽऐथफएऎह८॥ॉम९ुँ१ं।षघठर॓ॼड़गछिॱटऩॄऑवल५ढ़य़अञसऔयण॑क़॒ौॽशऍ॰ूीऒॊख़उज़ॻॅ३ओऌळनॠ०ेढङ४़ॢग़पऊॐज२डैभझकआदबऋखॾ॔ोइ्धतफ़ईृःा६चऱऴ७-'
reco_model = crnn_vgg16_bn(pretrained=False, pretrained_backbone=False, vocab=vocab)
# Download: https://github.com/iitb-research-code/indic-doctr/releases/download/model2/crnn_vgg16_bn_hindi.pt
local_model_path = "~/xyz/crnn_vgg16_bn_hindi.pt"
reco_params = torch.load(local_model_path, map_location="cpu")
reco_model.load_state_dict(reco_params)

predictor = ocr_predictor(reco_arch=reco_model, pretrained=True)

Aug 15 '24 10:08 felixdittrich92

I will check that on some sample documents. Thanks for the clarifications.

Aug 15 '24 10:08 ramSeraph

Part of #1699

Oct 10 '24 17:10 felixdittrich92

from doctr.models import ocr_predictor, crnn_vgg16_bn, db_resnet50
from doctr.io import DocumentFile
from doctr.datasets import VOCABS

# Vocab copied from the indic-doctr repo
vocab = 'ॲऽऐथफएऎह८॥ॉम९ुँ१ं।षघठर॓ॼड़गछिॱटऩॄऑवल५ढ़य़अञसऔयण॑क़॒ौॽशऍ॰ूीऒॊख़उज़ॻॅ३ओऌळनॠ०ेढङ४़ॢग़पऊॐज२डैभझकआदबऋखॾ॔ोइ्धतफ़ईृःा६चऱऴ७-'
reco_model = crnn_vgg16_bn(pretrained=False, pretrained_backbone=False, vocab=vocab)
# Download: https://github.com/iitb-research-code/indic-doctr/releases/download/model2/crnn_vgg16_bn_hindi.pt
local_model_path = "crnn_vgg16_bn_hindi.pt"
reco_params = torch.load(local_model_path, map_location="cpu")
reco_model.load_state_dict(reco_params)

predictor = ocr_predictor(det_arch='db_resnet50',reco_arch=reco_model, pretrained=True)

single_img_doc = DocumentFile.from_images("0022-0024_3_5_2.jpg")
result = predictor(single_img_doc)
print(result.pages[0].export())  # print the result of the first page as a list of  #dicts```

I got an empty array with no output for a Hindi image. How to fix? Please help.

Sep 16 '25 22:09 manit2004