tesseract icon indicating copy to clipboard operation
tesseract copied to clipboard

Create pain points before running associator to resolve #3892

Open Balearica opened this issue 2 years ago • 5 comments

See #3892

Balearica avatar Aug 06 '22 03:08 Balearica

The problem is that we don't know if there might be some cases where this patch will cause worst results.

@stweil, @zdenop, maybe we can accept this addition if it will be optional, using a config variable and will be turned off by default?

amitdo avatar Oct 09 '22 21:10 amitdo

@amitdo Do you have a specific scenario in mind where this change would plausibly cause worse results? Alternatively, is there some corpus of benchmark documents where we could assess the impact of this change empirically?

The behavior this PR addresses (fully documented in #3892) is clearly a bug, as there is no reason why the associator should randomly skip letters at the end of certain words. I don't think it makes sense to avoid changing this behavior (using the default settings) in the absence of specific concerns regarding this fix. As stated above, I am happy to run an accuracy benchmark if one already exists.

Balearica avatar Oct 09 '22 21:10 Balearica

Do you have a specific scenario in mind where this change would plausibly cause worse results?

No.

Don't assume that the few currently active developers deeply know all the algorithms used in Tesseract.

Alternatively, is there some corpus of benchmark documents where we could assess the impact of this change empirically?

In the past the UNLV dataset and tools were used for testing Tesseract's accuracy.

See:

But the UNLV dataset have just English and Spanish written texts. Do you think your patch is fine for all the scripts that Tesseract supports?

Related issue: #3402.

amitdo avatar Oct 09 '22 23:10 amitdo