doctr icon indicating copy to clipboard operation
doctr copied to clipboard

Hindi Language support

Open chaudhary-mohit opened this issue 1 year ago • 6 comments

🚀 The feature

#Hindi Language Support for Indians As for Indians, Hindi is also must be considered in Doctr-Vocabs.

Motivation, pitch

As in India, Mostly documents are in Hindi language which is not currently supported by Doctr. The only thing which we Indians need so it will easy to create POC's and make solutions using Doctr as a first step in OCR related stuffs.

Alternatives

No response

Additional context

No response

chaudhary-mohit avatar May 29 '24 14:05 chaudhary-mohit

Let me know what is required w.r.t. datasets to make this happen, BTW this exists - https://github.com/iitb-research-code/indic-doctr . And there is this for corpus - https://huggingface.co/datasets/ai4bharat/sangraha.

ramSeraph avatar Aug 15 '24 09:08 ramSeraph

Vocabs was added in https://github.com/mindee/doctr/pull/1687

felixdittrich92 avatar Aug 15 '24 10:08 felixdittrich92

I haven't spent enough time on this yet. So, my comprehension of the requirements might be a bit lacking. But are you saying adding the vocab is enough, as others can now use the code base with their own models for Hindi?

If that is the case, maybe we can add the vocab for the rest of the scripts as well in a separate issue - I see the other indic vocabs here - https://github.com/iitb-research-code/indic-doctr/blob/main/doctr/datasets/vocabs.py

ramSeraph avatar Aug 15 '24 10:08 ramSeraph

If you have a model which was trained with doctr and on exactly the added vocabs (same char order and length) then yes

felixdittrich92 avatar Aug 15 '24 10:08 felixdittrich92

In general you should already be able to use one of the provided models here: https://github.com/iitb-research-code/indic-doctr/releases

For example:

import torch
from doctr.models import ocr_predictor, crnn_vgg16_bn

# Vocab copied from the indic-doctr repo
vocab = 'ॲऽऐथफएऎह८॥ॉम९ुँ१ं।षघठर॓ॼड़गछिॱटऩॄऑवल५ढ़य़अञसऔयण॑क़॒ौॽशऍ॰ूीऒॊख़उज़ॻॅ३ओऌळनॠ०ेढङ४़ॢग़पऊॐज२डैभझकआदबऋखॾ॔ोइ्धतफ़ईृःा६चऱऴ७-'
reco_model = crnn_vgg16_bn(pretrained=False, pretrained_backbone=False, vocab=vocab)
# Download: https://github.com/iitb-research-code/indic-doctr/releases/download/model2/crnn_vgg16_bn_hindi.pt
local_model_path = "~/xyz/crnn_vgg16_bn_hindi.pt"
reco_params = torch.load(local_model_path, map_location="cpu")
reco_model.load_state_dict(reco_params)

predictor = ocr_predictor(reco_arch=reco_model, pretrained=True)

felixdittrich92 avatar Aug 15 '24 10:08 felixdittrich92

I will check that on some sample documents. Thanks for the clarifications.

ramSeraph avatar Aug 15 '24 10:08 ramSeraph

Part of #1699

felixdittrich92 avatar Oct 10 '24 17:10 felixdittrich92

from doctr.models import ocr_predictor, crnn_vgg16_bn, db_resnet50
from doctr.io import DocumentFile
from doctr.datasets import VOCABS

# Vocab copied from the indic-doctr repo
vocab = 'ॲऽऐथफएऎह८॥ॉम९ुँ१ं।षघठर॓ॼड़गछिॱटऩॄऑवल५ढ़य़अञसऔयण॑क़॒ौॽशऍ॰ूीऒॊख़उज़ॻॅ३ओऌळनॠ०ेढङ४़ॢग़पऊॐज२डैभझकआदबऋखॾ॔ोइ्धतफ़ईृःा६चऱऴ७-'
reco_model = crnn_vgg16_bn(pretrained=False, pretrained_backbone=False, vocab=vocab)
# Download: https://github.com/iitb-research-code/indic-doctr/releases/download/model2/crnn_vgg16_bn_hindi.pt
local_model_path = "crnn_vgg16_bn_hindi.pt"
reco_params = torch.load(local_model_path, map_location="cpu")
reco_model.load_state_dict(reco_params)

predictor = ocr_predictor(det_arch='db_resnet50',reco_arch=reco_model, pretrained=True)

single_img_doc = DocumentFile.from_images("0022-0024_3_5_2.jpg")
result = predictor(single_img_doc)
print(result.pages[0].export())  # print the result of the first page as a list of  #dicts```

I got an empty array with no output for a Hindi image. How to fix? Please help.

manit2004 avatar Sep 16 '25 22:09 manit2004