tesseract
tesseract copied to clipboard
Text extraction process won't finish with the attached multipage tiff.
My friends told me they waited more than a day. I've waited >1 hour. With similar images process finishes in a few minutes. 8e3b2319a6bc41cab2f5c4507ea7e212.zip
const char* GetText2(const char* input) {
char* outText;
Pix* image = pixRead(input);
api->SetImage(image);
outText = api->GetUTF8Text();
pixDestroy(&image);
return outText;
}
api initialization (used language tur and eng, latest tessdata best):
tesseract::TessBaseAPI* api
tesseract::Dict::GlobalDawgCache();
api->Init(_datapath.c_str(), lang, tesseract::OcrEngineMode::OEM_DEFAULT);
api->SetPageSegMode(tesseract::PageSegMode::PSM_AUTO);
api->SetVariable("include_page_breaks", "1");
api->SetVariable("page_separator", pageSeparator);
Problem might be in:
textord/tabefind.cpp / function void TableFinder::GridMergeTableRegions() / loop: while ((neighbor = rectsearch.NextRectSearch()) != nullptr)
Could there be a possibility of an infinite loop here?
The problem seems to be caused by the vertical and horizontal lines. When I clean them, OCR finishes in seconds.
You can try setting textord_tabfind_find_tables
to false
.
Some notes.
Your tiff file contains 3 images. Images size: W 3552, H 32000
So your input is equivalent to about 39 A4 pages (3x13).
Tesseract's layout analysis algorithm was design to deal with books and magazines. The table detection can cope with simple tables only.