tesseract Tesseract returns invalid characters for images with lack of text (for PSM=12)

Tesseract returns invalid characters for images with lack of text (for PSM=12)

Open krzysiekj94 opened this issue 2 years ago • 2 comments

Environment

Tesseract Version: Tesseract Version: 5.1.0
Platform: Windows 32-bit

Current Behavior:

I have different processes that I work with OCR on a daily basis. Once, I did some research and as a result, PSM 12 = Sparse text with OSD was the best for me -> I really want to get as much text as possible for the document search engine in my program, regardless of the rotation of the document. However, recently I noticed that for images that have no OCR text, it returns garbage instead of an empty string. I did a test for the file. For other PSM modes, this value is empty. 1208971310

Is there anything else that can be done to make PSM = 12 return void in this case? Is it worth switching to a different PSM mode? I really want to get as much text as possible for the document search engine in my program, regardless of the rotation of the document (background processes in my app). what is your experience in this topic?

Expected Behavior:

I would expect that for PSM=12, if there is no text in the image, the mechanism will return empty string instead of garbage.

Suggested Fix:

I would expect that for PSM=12, if there is no text in the image, the mechanism will return empty string instead of garbage. Such text like above should not be returned.

Jun 15 '22 13:06 krzysiekj94

At the moment user is responsible to preprocess image and use "correct" input image for OCR process.

If you are not satisfied with this process your patch is welcomed.

Jun 17 '22 09:06 zdenop

By tweaking the command instead of getting the output on the command line we wrote the output on a separate text file and also updated the library, further instead of german i used english language pack and the command used was -

tesseract "//image path //" "//text file path for output//" -l eng --psm 12

on using this the text file produced for the output was empty as it should be

Oct 15 '23 08:10 Kaustubh-3105

tesseract tesseract copied to clipboard

Tesseract returns invalid characters for images with lack of text (for PSM=12)

Environment

Current Behavior:

Expected Behavior:

Suggested Fix:

tesseract
tesseract copied to clipboard