tesseract icon indicating copy to clipboard operation
tesseract copied to clipboard

Chopper interferes with associator in Legacy engine

Open Balearica opened this issue 1 year ago • 0 comments

Environment

  • Tesseract Version: 5.2.0
  • Commit Number: 15200c6fe7733f71a6cf52fbc1e4d94150f9f168
  • Platform: Linux ubuntu 5.15.0-43-generic

Current Behavior:

The legacy engine often fails to recognize words where characters towards the end of the word are comprised of multiple blobs. For example, the images below (and, high, and which) are misidentified as ancl, higll, and Whicll.

mb_words_1_8_0_ancl_and mb_words_1_9_0_higli_high mb_words_3_1_0_whicli_which

This issue is caused by the chopper. When disabling the chopper (chop_enable=0) all 3 words are identified correctly (capitalization notwithstanding). Notably, this is not a direct effect of the chopper splitting the last letter (in most cases the chopper does not change the best_choice), but rather that running the chopper prevents the associator from working correctly (which should join the last couple blobs into a single letter).

The issue appears to be that the chopper creates additional blobs, and as this is not properly accounted for, no pain points exist for the last several blobs by the time the associator runs. In other words, when chop_enable=1 Tesseract does not even try to combine the last 2 blobs in the examples above.

Suggested Fix:

I'm creating a pull request that resolves by making sure InitialSegSearch (which creates pain points between blobs) runs directly before the associator step, regardless of whether the chopper is run. When this change is made the words above are identified correctly whether clop_enable=0 or clop_enable=1.

Balearica avatar Aug 06 '22 03:08 Balearica