tesseract Text extraction process won't finish with the attached multipage tiff.

Text extraction process won't finish with the attached multipage tiff.

Open korhun opened this issue 5 years ago • 4 comments

My friends told me they waited more than a day. I've waited >1 hour. With similar images process finishes in a few minutes. 8e3b2319a6bc41cab2f5c4507ea7e212.zip


const char* GetText2(const char* input) {
      char* outText;
      Pix* image = pixRead(input);
      api->SetImage(image);        
      outText = api->GetUTF8Text();      
      pixDestroy(&image);
      return outText;
  }


api initialization (used language tur and eng, latest tessdata best):

 tesseract::TessBaseAPI* api
    tesseract::Dict::GlobalDawgCache();
    api->Init(_datapath.c_str(), lang, tesseract::OcrEngineMode::OEM_DEFAULT);
    
    api->SetPageSegMode(tesseract::PageSegMode::PSM_AUTO);

    api->SetVariable("include_page_breaks", "1");
    api->SetVariable("page_separator", pageSeparator);

Feb 17 '20 11:02 korhun

Problem might be in:

textord/tabefind.cpp / function void TableFinder::GridMergeTableRegions() / loop: while ((neighbor = rectsearch.NextRectSearch()) != nullptr)

Could there be a possibility of an infinite loop here?

Feb 17 '20 14:02 korhun

The problem seems to be caused by the vertical and horizontal lines. When I clean them, OCR finishes in seconds.

Feb 19 '20 11:02 korhun

You can try setting textord_tabfind_find_tables to false.

Apr 28 '20 13:04 amitdo

Some notes.

Your tiff file contains 3 images. Images size: W 3552, H 32000

So your input is equivalent to about 39 A4 pages (3x13).

Tesseract's layout analysis algorithm was design to deal with books and magazines. The table detection can cope with simple tables only.

Apr 29 '20 14:04 amitdo

tesseract tesseract copied to clipboard

Text extraction process won't finish with the attached multipage tiff.

tesseract
tesseract copied to clipboard