Amit Dovev
Amit Dovev
BTW, the Intel C/C++ compiler is based on Clang. https://www.intel.com/content/www/us/en/developer/articles/technical/adoption-of-llvm-complete-icx.html
Check the output of hocr. How many 'div' tags (blocks) ? how many 'p' tags (paragraphs) ?
It does the blocks detection right. The paragraphs detection is wrong.
https://github.com/tesseract-ocr/tesseract/blob/272ebf995f99d9c926ce0c951836f3fd1db90a87/src/ccmain/paragraphs.cpp
Try setting the parameter `paragraph_text_based` to `false`.
>should paragraph_text_based be disabled by default Do you mean for all users of Tesseract, or just for you and your input images? Such a change in default value (for all...
Ray Smith from Google did extensive testing in the past. The testing images were not made public. Currently, Ray is not active in this project.
`paragraph_text_based=false` will only cause tesseract to skip some steps in its paragraphs detection phase. Currently, there is no way (even using the API) to completely disable paragraphs detection.
The problem is in the layout analysis phase. AFAIK, there is no solution for this issue.
To solve this issue, major changes to the layout analysis module are needed.