tesseract icon indicating copy to clipboard operation
tesseract copied to clipboard

Touching letters cause incorrect word zoning and subsequently incorrect OCR.

Open idrise opened this issue 4 years ago • 3 comments

When lines are close and letters on different lines touch one another, words are not bounded correctly.

Environment

  • Tesseract Version: 5.0.0-alpha and 4.1 and 4.0
  • Commit Number:
  • Platform: ubuntu, mac osx.

Current Behavior:

touchinglines

level	page	block	par	line	word	left	top	width	height	conf	text
1	1	0	0	0	0	0	0	229	155	-1
2	1	1	0	0	0	59	31	137	91	-1
3	1	1	1	0	0	59	31	137	91	-1
4	1	1	1	1	0	59	31	124	36	-1
5	1	1	1	1	1	59	31	124	36	89	ankie
4	1	1	1	2	0	66	76	130	46	-1
5	1	1	1	2	1	66	76	130	46	91	ranky

Expected Behavior:

Created expected behaviour by creating a single pixel gap.

touchinglinesinglepixelgap

level	page	block	par	line	word	left	top	width	height	conf	text
1	1	0	0	0	0	0	0	229	155	-1
2	1	1	0	0	0	30	31	166	91	-1
3	1	1	1	0	0	30	31	166	91	-1
4	1	1	1	1	0	30	31	153	44	-1
5	1	1	1	1	1	30	31	153	44	91	yankie
4	1	1	1	2	0	33	76	163	46	-1
5	1	1	1	2	1	33	76	163	46	92	Pranky

Suggested Fix:

Not certain...

idrise avatar Dec 24 '19 17:12 idrise

The problem is in the layout analysis phase.

AFAIK, there is no solution for this issue.

amitdo avatar Jan 28 '20 12:01 amitdo

To solve this issue, major changes to the layout analysis module are needed.

amitdo avatar Jun 13 '22 11:06 amitdo

This issue is unlikely to be solved in the foreseeable future.

@stweil, I think we should close this issue as wontfix.

amitdo avatar Jun 13 '22 11:06 amitdo