tessdata_best Works mostly well for the pure Hindi text, but does NOT parse English words at all and misses out on a few Numbers

Works mostly well for the pure Hindi text, but does NOT parse English words at all and misses out on a few Numbers

Open ChintanDonda opened this issue 9 months ago • 1 comments

trafficstars

I've used the Hindi dataset.

It works mostly well for the pure Hindi text, but does NOT parse English words at all and misses out on a few Numbers.

English words with Hindi text

Example 1: आवेदन के नाम लेने से पहले (Registration process के पहले) समझने की बातें ==> Parsed from the PDF using the below code snippet as: आवेदन के नाम लेने से पहले (२८्टा57807) 0700€55 के पहले) समझने की बातें

Example 2: तेजस्विता जब किसी मालिकाना वस्तु पर (Possession) अथवा पद पर (Post/Position) निर्भर होते है ===> Parsed from the PDF using the below code snippet as: तेजस्विता जब किसी मालिकाना वस्तु पर (?०८५७५५०) अथवा पद पर (?०5६/?०5ाधं0ा) निर्भर होते है

Example 3: वस्तुनिष्ठ आनंद (objective happiness) यह हमेशा अपूर्ण होता है ===> Parsed from the PDF using the below code snippet as: वस्तुनिष्ठ आनंद (०णुं०ता५ह 09000655) यह हमेशा अपूर्ण होता है

English words & Numbers with Hindi text

Example 1: आवेदन लेने की प्रक्रिया (Registration process)) हमें 01/06/2024 से शुरू करनी है। ===> Parsed from the PDF using the below code snippet as: आवेदन लेने की प्रक्रिया (९८8्डा[507820770655) हमें 0/06/2024 से शुरू करनी है। ====> also missed out 1 in 01

How to reproduce:

from pdf2image import convert_from_path
import pytesseract

# Specify Tesseract executable location
pytesseract.pytesseract.tesseract_cmd = '/opt/homebrew/bin/tesseract'

# Load and convert PDF to images
documents = convert_from_path("path_to_pdf.pdf")    # Try PDF that has Hindi text mixed with some English words/phrases and/or Numbers

# Extract text from each image in Hindi
page_content = ""
for doc in documents:
    try:
        page_content += pytesseract.image_to_string(doc, lang='hin')
        page_content += "\n"
    except Exception as e:
        print(f"Error in extracting page content for: {doc}")
        pass

print(page_content[0:5])

Any idea how I can also parse the Hindi text mixed with some English words/phrases and/or Numbers?

Jan 22 '25 11:01 ChintanDonda

tessdata_best tessdata_best copied to clipboard

Works mostly well for the pure Hindi text, but does NOT parse English words at all and misses out on a few Numbers

tessdata_best
tessdata_best copied to clipboard