tesseract icon indicating copy to clipboard operation
tesseract copied to clipboard

Text extraction process won't finish with the attached multipage tiff.

Open korhun opened this issue 5 years ago • 4 comments

My friends told me they waited more than a day. I've waited >1 hour. With similar images process finishes in a few minutes. 8e3b2319a6bc41cab2f5c4507ea7e212.zip


const char* GetText2(const char* input) {
      char* outText;
      Pix* image = pixRead(input);
      api->SetImage(image);        
      outText = api->GetUTF8Text();      
      pixDestroy(&image);
      return outText;
  }


api initialization (used language tur and eng, latest tessdata best):

 tesseract::TessBaseAPI* api
    tesseract::Dict::GlobalDawgCache();
    api->Init(_datapath.c_str(), lang, tesseract::OcrEngineMode::OEM_DEFAULT);
    
    api->SetPageSegMode(tesseract::PageSegMode::PSM_AUTO);

    api->SetVariable("include_page_breaks", "1");
    api->SetVariable("page_separator", pageSeparator);

korhun avatar Feb 17 '20 11:02 korhun

Problem might be in:

textord/tabefind.cpp / function void TableFinder::GridMergeTableRegions() / loop: while ((neighbor = rectsearch.NextRectSearch()) != nullptr)

Could there be a possibility of an infinite loop here?

korhun avatar Feb 17 '20 14:02 korhun

The problem seems to be caused by the vertical and horizontal lines. When I clean them, OCR finishes in seconds.

korhun avatar Feb 19 '20 11:02 korhun

You can try setting textord_tabfind_find_tables to false.

amitdo avatar Apr 28 '20 13:04 amitdo

Some notes.

Your tiff file contains 3 images. Images size: W 3552, H 32000

So your input is equivalent to about 39 A4 pages (3x13).

Tesseract's layout analysis algorithm was design to deal with books and magazines. The table detection can cope with simple tables only.

amitdo avatar Apr 29 '20 14:04 amitdo