Hindi Language support
🚀 The feature
#Hindi Language Support for Indians As for Indians, Hindi is also must be considered in Doctr-Vocabs.
Motivation, pitch
As in India, Mostly documents are in Hindi language which is not currently supported by Doctr. The only thing which we Indians need so it will easy to create POC's and make solutions using Doctr as a first step in OCR related stuffs.
Alternatives
No response
Additional context
No response
Let me know what is required w.r.t. datasets to make this happen, BTW this exists - https://github.com/iitb-research-code/indic-doctr . And there is this for corpus - https://huggingface.co/datasets/ai4bharat/sangraha.
Vocabs was added in https://github.com/mindee/doctr/pull/1687
I haven't spent enough time on this yet. So, my comprehension of the requirements might be a bit lacking. But are you saying adding the vocab is enough, as others can now use the code base with their own models for Hindi?
If that is the case, maybe we can add the vocab for the rest of the scripts as well in a separate issue - I see the other indic vocabs here - https://github.com/iitb-research-code/indic-doctr/blob/main/doctr/datasets/vocabs.py
If you have a model which was trained with doctr and on exactly the added vocabs (same char order and length) then yes
In general you should already be able to use one of the provided models here: https://github.com/iitb-research-code/indic-doctr/releases
For example:
import torch
from doctr.models import ocr_predictor, crnn_vgg16_bn
# Vocab copied from the indic-doctr repo
vocab = 'ॲऽऐथफएऎह८॥ॉम९ुँ१ं।षघठर॓ॼड़गछिॱटऩॄऑवल५ढ़य़अञसऔयण॑क़॒ौॽशऍ॰ूीऒॊख़उज़ॻॅ३ओऌळनॠ०ेढङ४़ॢग़पऊॐज२डैभझकआदबऋखॾ॔ोइ्धतफ़ईृःा६चऱऴ७-'
reco_model = crnn_vgg16_bn(pretrained=False, pretrained_backbone=False, vocab=vocab)
# Download: https://github.com/iitb-research-code/indic-doctr/releases/download/model2/crnn_vgg16_bn_hindi.pt
local_model_path = "~/xyz/crnn_vgg16_bn_hindi.pt"
reco_params = torch.load(local_model_path, map_location="cpu")
reco_model.load_state_dict(reco_params)
predictor = ocr_predictor(reco_arch=reco_model, pretrained=True)
I will check that on some sample documents. Thanks for the clarifications.
Part of #1699
from doctr.models import ocr_predictor, crnn_vgg16_bn, db_resnet50
from doctr.io import DocumentFile
from doctr.datasets import VOCABS
# Vocab copied from the indic-doctr repo
vocab = 'ॲऽऐथफएऎह८॥ॉम९ुँ१ं।षघठर॓ॼड़गछिॱटऩॄऑवल५ढ़य़अञसऔयण॑क़॒ौॽशऍ॰ूीऒॊख़उज़ॻॅ३ओऌळनॠ०ेढङ४़ॢग़पऊॐज२डैभझकआदबऋखॾ॔ोइ्धतफ़ईृःा६चऱऴ७-'
reco_model = crnn_vgg16_bn(pretrained=False, pretrained_backbone=False, vocab=vocab)
# Download: https://github.com/iitb-research-code/indic-doctr/releases/download/model2/crnn_vgg16_bn_hindi.pt
local_model_path = "crnn_vgg16_bn_hindi.pt"
reco_params = torch.load(local_model_path, map_location="cpu")
reco_model.load_state_dict(reco_params)
predictor = ocr_predictor(det_arch='db_resnet50',reco_arch=reco_model, pretrained=True)
single_img_doc = DocumentFile.from_images("0022-0024_3_5_2.jpg")
result = predictor(single_img_doc)
print(result.pages[0].export()) # print the result of the first page as a list of #dicts```
I got an empty array with no output for a Hindi image. How to fix? Please help.