tesseract
tesseract copied to clipboard
Tesseract produces overlapping bounding boxes for clearly separated lines
Environment
- Tesseract Version: 5.2.0
- Platform: Windows 10, x64
Current Behavior:
PDF Render renders two different lines on the same line, intermixing the chars.
I cannot post the full original image here because of GDPR, but I can show part of it and the HOCR from that and the resulting text in PDF. Hopefully this is enough, but if not, feel free to contact me.
Part of HOCR;
tesseract.exe "C:\support\redacted.png" "c:\support\redacted" --tessdata-dir "C:\Tesseract\tessdata_best-main" -l eng --psm 4 --oem 1 -c tessedit_create_hocr=1
Selected all text in PDF:
Copied and pasted to notepad gives intermixed text;
No: Date: 09.2221420323 09.22
Expected Behavior:
To have two separate lines which can be copy/pasted.
Suggested Fix:
What I think the problem is, is the bbox of 'No.' is much taller than the invoice number, and actually intersects with 'Date.:'. Not sure why this happens, but what I also notice is the missing ':'. I know the trained data I'm using is based on books and not invoices as I'm working on here. Might be the reason, but I'm hoping that it might be a fixable bug in Tesseract LSTM. I will in the future look to fine tuning, but in the short term, that is not a solution for me.
Btw, got this output when creating hocr file. Not sure if it's relevant but here goes;
Error in boxClipToRectangle: box outside rectangle Error in pixScanForForeground: invalid box Error in boxClipToRectangle: box outside rectangle Error in pixScanForForeground: invalid box
Error in boxClipToRectangle: box outside rectangle >Error in pixScanForForeground: invalid box Error in >boxClipToRectangle: box outside rectangle Error in >pixScanForForeground: invalid box
This is a known bug.
It is caused by the frame in the image.
This is a known bug.
It is caused by the frame in the image.
Which part is known? The output in my last comment or that the bounding box of 'No.' is much taller than the actual glyphs.
The link you refer to is all the way back from 2016 which means there is little chance this will get fixed any time soon? Any idea if fine tuning will actually help with this, because this is just one example where this occurs. It's a reoccurring issue. Comparing with Abbyy it does not exhibit the same. Not a fair comparison, but that is what I'm up against all the time. I will look into fine tuning as time permits me. I hope it will bring Tesseract on par with Abbyy.
Which part is known?
This only refers to the message
Error in boxClipToRectangle ...
Ok, I don't care about the error message (it was just a possible hint), only the tall bounding box causing the output of PDF render to merge two lines :)
Fine tuning might help to detect the missing dot in No.:
.
The wrong overlapping bounding boxes is a bug in the layout analysis phase done by Tesseract. Fine tuning can't help here. Don't expect this bug to be fixed soon.
Damn, what I feared. Do you know if there is a ticket for this error, that I can follow and get a better understanding on what's going wrong and the status of it?