tesseract
tesseract copied to clipboard
Create pain points before running associator to resolve #3892
See #3892
The problem is that we don't know if there might be some cases where this patch will cause worst results.
@stweil, @zdenop, maybe we can accept this addition if it will be optional, using a config variable and will be turned off by default?
@amitdo Do you have a specific scenario in mind where this change would plausibly cause worse results? Alternatively, is there some corpus of benchmark documents where we could assess the impact of this change empirically?
The behavior this PR addresses (fully documented in #3892) is clearly a bug, as there is no reason why the associator should randomly skip letters at the end of certain words. I don't think it makes sense to avoid changing this behavior (using the default settings) in the absence of specific concerns regarding this fix. As stated above, I am happy to run an accuracy benchmark if one already exists.
Do you have a specific scenario in mind where this change would plausibly cause worse results?
No.
Don't assume that the few currently active developers deeply know all the algorithms used in Tesseract.
Alternatively, is there some corpus of benchmark documents where we could assess the impact of this change empirically?
In the past the UNLV dataset and tools were used for testing Tesseract's accuracy.
See:
- The test repo
- UNLV Testing of Tesseract
- https://github.com/eddieantonio/ocreval
But the UNLV dataset have just English and Spanish written texts. Do you think your patch is fine for all the scripts that Tesseract supports?
Related issue: #3402.