tesseract
tesseract copied to clipboard
Touching letters cause incorrect word zoning and subsequently incorrect OCR.
When lines are close and letters on different lines touch one another, words are not bounded correctly.
Environment
- Tesseract Version: 5.0.0-alpha and 4.1 and 4.0
- Commit Number:
- Platform: ubuntu, mac osx.
Current Behavior:
level page block par line word left top width height conf text
1 1 0 0 0 0 0 0 229 155 -1
2 1 1 0 0 0 59 31 137 91 -1
3 1 1 1 0 0 59 31 137 91 -1
4 1 1 1 1 0 59 31 124 36 -1
5 1 1 1 1 1 59 31 124 36 89 ankie
4 1 1 1 2 0 66 76 130 46 -1
5 1 1 1 2 1 66 76 130 46 91 ranky
Expected Behavior:
Created expected behaviour by creating a single pixel gap.
level page block par line word left top width height conf text
1 1 0 0 0 0 0 0 229 155 -1
2 1 1 0 0 0 30 31 166 91 -1
3 1 1 1 0 0 30 31 166 91 -1
4 1 1 1 1 0 30 31 153 44 -1
5 1 1 1 1 1 30 31 153 44 91 yankie
4 1 1 1 2 0 33 76 163 46 -1
5 1 1 1 2 1 33 76 163 46 92 Pranky
Suggested Fix:
Not certain...
The problem is in the layout analysis phase.
AFAIK, there is no solution for this issue.
To solve this issue, major changes to the layout analysis module are needed.
This issue is unlikely to be solved in the foreseeable future.
@stweil, I think we should close this issue as wontfix
.