Amit Dovev comments

Results 538 comments of


                                            Amit Dovev

clang-cl also defines _MSC_VER and _MSC_FULL_VER macros

BTW, the Intel C/C++ compiler is based on Clang. https://www.intel.com/content/www/us/en/developer/articles/technical/adoption-of-llvm-complete-icx.html

Tesseract inserts extra blank lines

Check the output of hocr. How many 'div' tags (blocks) ? how many 'p' tags (paragraphs) ?

Tesseract inserts extra blank lines

It does the blocks detection right. The paragraphs detection is wrong.

Tesseract inserts extra blank lines

https://github.com/tesseract-ocr/tesseract/blob/272ebf995f99d9c926ce0c951836f3fd1db90a87/src/ccmain/paragraphs.cpp

Tesseract inserts extra blank lines

Try setting the parameter `paragraph_text_based` to `false`.

Tesseract inserts extra blank lines

>should paragraph_text_based be disabled by default Do you mean for all users of Tesseract, or just for you and your input images? Such a change in default value (for all...

Tesseract inserts extra blank lines

Ray Smith from Google did extensive testing in the past. The testing images were not made public. Currently, Ray is not active in this project.

Tesseract inserts extra blank lines

`paragraph_text_based=false` will only cause tesseract to skip some steps in its paragraphs detection phase. Currently, there is no way (even using the API) to completely disable paragraphs detection.

Touching letters cause incorrect word zoning and subsequently incorrect OCR.

The problem is in the layout analysis phase. AFAIK, there is no solution for this issue.

Touching letters cause incorrect word zoning and subsequently incorrect OCR.

To solve this issue, major changes to the layout analysis module are needed.