tesseract icon indicating copy to clipboard operation
tesseract copied to clipboard

Tesseract creates hOCR output without text results

Open stweil opened this issue 1 year ago • 4 comments

On some page images full of text Tesseract does not detect any text when using the default settings. Typically it prints Empty page!! twice for such pages. See issue #3021 for details and examples.

In some rare cases Tesseract prints Empty page!! only once and finds text in a 2nd pass. That text is written to ALTO and text output, but hOCR output does not show that text.

Example:

tesseract https://digi.bib.uni-mannheim.de/periodika/fileadmin/data/DeutReunP_856399094_19140210/max/856399094_1910_035_03.jpg 856399094_1910_035_03 alto hocr txt

stweil avatar Aug 05 '23 15:08 stweil

Tesseract normally runs Recognize from TessBaseAPI::ProcessPage, but most Tesseract renderers also run Recognize conditionally unless recognition was already done. The test whether recognition should be called by the renderer is done using two different implementations:

TessBaseAPI::GetAltoText, TessBaseAPI::GetTSVText, TessBaseAPI::GetHOCRText, TessBaseAPI::GetLSTMBoxText, TessBaseAPI::GetWordStrBoxText check page_res_ == nullptr.

TessBaseAPI::GetUTF8Text, TessBaseAPI::GetBoxText, TessBaseAPI::GetUNLVText, TessBaseAPI::AllWordConfidences check recognition_done_.

OCR on an "empty" page sets page_res, but not recognition_done_. Therefore all renderers which check recognition_done_ will trigger an additional OCR pass. Example:

tesseract 'https://ub-backup.bib.uni-mannheim.de/reichsanzeiger/1879-10-01--1914-07-31---001-036/029-1907/0312.jp2' - txt makebox wordstrbox unlv

Output:

Empty page!!
Empty page!!

Empty page!!
Empty page!!

If for example TessBaseAPI::GetUTF8Text triggers a 2nd OCR pass and that pass detects text, then all renderers which had been processed earlier did not get any text while the text renderer and all renderers which are processed after it will output the detected text from the 2nd pass.

stweil avatar Aug 06 '23 14:08 stweil

..but all Tesseract renderers also run Recognize conditionally...

The pdf renderer does not call Recognize().

amitdo avatar Aug 13 '23 08:08 amitdo

That's correct, thank you. I updated my comment and replaced "all" by "most".

stweil avatar Aug 13 '23 09:08 stweil

It seems odd both that the recognition is not deterministic and that recognition_done_ is not set for an empty page. Is the recognition_done_ used in a way where it's important to be able to distinguish an empty page from a non-empty page? It seems like

https://github.com/tesseract-ocr/tesseract/blob/637be531f649832032fc477fd7f82249bb7d776b/src/api/baseapi.cpp#L849 could just be moved up a few lines to fix the issue.

A small utility function that the renderers can use, so that they all do the check in the same way might be another improvement.

tfmorris avatar Jan 03 '24 01:01 tfmorris