surya icon indicating copy to clipboard operation
surya copied to clipboard

Urdu Text Does Not Get Detected

Open Aeyxen opened this issue 1 year ago • 2 comments

First things first, sincere appreciation for your outstanding work in developing this incredible AI-driven OCR library. It's a fantastic tool that holds immense potential for digital humanities, I am a student of this subject.

I started my testing with some old Urdu historical documents, and unfortunately, I didn't observe any bounding box (Bbox) detection for the Urdu text within those documents.

Subsequently, I tested it with an image that contains a mix of Hindi, English, and Urdu text. To my delight, it successfully detected the Hindi and English portions of the text. However, it only recognized one line of the Urdu text, which was less than expected. I have attached the image for your reference so that you can better understand the scenario.

image5-602w291h_0_bbox

Aeyxen avatar Jan 16 '24 05:01 Aeyxen

Try the new code/model - pip install -U surya

VikParuchuri avatar Jan 16 '24 20:01 VikParuchuri

This seems to work

image

and

image

You may need to experiment with the threshold settings to detect more text (see README)

VikParuchuri avatar Jan 16 '24 20:01 VikParuchuri

Noted with thanks

Aeyxen avatar Feb 17 '24 17:02 Aeyxen