tesseract
tesseract copied to clipboard
Tesseract returns invalid characters for images with lack of text (for PSM=12)
Environment
- Tesseract Version: Tesseract Version: 5.1.0
- Platform: Windows 32-bit
Current Behavior:
I have different processes that I work with OCR on a daily basis. Once, I did some research and as a result, PSM 12 = Sparse text with OSD was the best for me -> I really want to get as much text as possible for the document search engine in my program, regardless of the rotation of the document. However, recently I noticed that for images that have no OCR text, it returns garbage instead of an empty string. I did a test for the file. For other PSM modes, this value is empty.
Is there anything else that can be done to make PSM = 12 return void in this case? Is it worth switching to a different PSM mode? I really want to get as much text as possible for the document search engine in my program, regardless of the rotation of the document (background processes in my app). what is your experience in this topic?
Expected Behavior:
I would expect that for PSM=12, if there is no text in the image, the mechanism will return empty string instead of garbage.
Suggested Fix:
I would expect that for PSM=12, if there is no text in the image, the mechanism will return empty string instead of garbage. Such text like above should not be returned.
At the moment user is responsible to preprocess image and use "correct" input image for OCR process.
If you are not satisfied with this process your patch is welcomed.
By tweaking the command instead of getting the output on the command line we wrote the output on a separate text file and also updated the library, further instead of german i used english language pack and the command used was -
tesseract "//image path //" "//text file path for output//" -l eng --psm 12
on using this the text file produced for the output was empty as it should be