tesseract
tesseract copied to clipboard
Tesseract creates hOCR output without text results
On some page images full of text Tesseract does not detect any text when using the default settings. Typically it prints Empty page!!
twice for such pages. See issue #3021 for details and examples.
In some rare cases Tesseract prints Empty page!!
only once and finds text in a 2nd pass. That text is written to ALTO and text output, but hOCR output does not show that text.
Example:
tesseract https://digi.bib.uni-mannheim.de/periodika/fileadmin/data/DeutReunP_856399094_19140210/max/856399094_1910_035_03.jpg 856399094_1910_035_03 alto hocr txt
Tesseract normally runs Recognize
from TessBaseAPI::ProcessPage
, but most Tesseract renderers also run Recognize
conditionally unless recognition was already done. The test whether recognition should be called by the renderer is done using two different implementations:
TessBaseAPI::GetAltoText
, TessBaseAPI::GetTSVText
, TessBaseAPI::GetHOCRText
, TessBaseAPI::GetLSTMBoxText
, TessBaseAPI::GetWordStrBoxText
check page_res_ == nullptr
.
TessBaseAPI::GetUTF8Text
, TessBaseAPI::GetBoxText
, TessBaseAPI::GetUNLVText
, TessBaseAPI::AllWordConfidences
check recognition_done_
.
OCR on an "empty" page sets page_res
, but not recognition_done_
. Therefore all renderers which check recognition_done_
will trigger an additional OCR pass. Example:
tesseract 'https://ub-backup.bib.uni-mannheim.de/reichsanzeiger/1879-10-01--1914-07-31---001-036/029-1907/0312.jp2' - txt makebox wordstrbox unlv
Output:
Empty page!!
Empty page!!
Empty page!!
Empty page!!
If for example TessBaseAPI::GetUTF8Text
triggers a 2nd OCR pass and that pass detects text, then all renderers which had been processed earlier did not get any text while the text renderer and all renderers which are processed after it will output the detected text from the 2nd pass.
..but all Tesseract renderers also run
Recognize
conditionally...
The pdf renderer does not call Recognize()
.
That's correct, thank you. I updated my comment and replaced "all" by "most".
It seems odd both that the recognition is not deterministic and that recognition_done_
is not set for an empty page. Is the recognition_done_
used in a way where it's important to be able to distinguish an empty page from a non-empty page? It seems like
https://github.com/tesseract-ocr/tesseract/blob/637be531f649832032fc477fd7f82249bb7d776b/src/api/baseapi.cpp#L849 could just be moved up a few lines to fix the issue.
A small utility function that the renderers can use, so that they all do the check in the same way might be another improvement.