tesseract LSTM: Non-dictionary words with combination of letters and numbers not recognized.

LSTM: Non-dictionary words with combination of letters and numbers not recognized.

Open Shreeshrii opened this issue 8 years ago • 10 comments

https://groups.google.com/d/msgid/tesseract-ocr/1a3e8773-7151-48f9-92bb-fda888293eab%40googlegroups.com?utm_medium=email&utm_source=footer

While the single "S" is recognized correctly, the text "2S" is recognized as "25".

Here is link to the test image:

https://03054610326450256607.googlegroups.com/attach/b8b86693ac072/2s.png?part=0.4&view=1

Feb 22 '17 03:02 Shreeshrii

On 22-Feb-2017 9:02 PM, "Amit D." [email protected] wrote:

The lstm engine is train on text-lines images. and learns from context, so it does not surprise me that for a single glyph the OCR accuracy is not so good.

So, is this another case where legacy engine is better than LSTM?

excuse the brevity, sent from mobile

Feb 22 '17 15:02 Shreeshrii

Yes, the legacy engine (--oem 0) gets this one right.

tesseract4 --psm 7 --oem 0 2s.png 2s-out-oem0-psm7.txt

2s-out-oem0-psm7.txt

Feb 23 '17 08:02 andrewisplinghoff

@zdenop Please label : accuracy.

Mar 28 '18 09:03 Shreeshrii

Another instance reported in forum, in context of recognizing license plates.

Please see https://groups.google.com/forum/?utm_medium=email&utm_source=footer#!topic/tesseract-ocr/qxB-aCa3r6E

Test image is

minus-4l

Mar 28 '18 09:03 Shreeshrii

numbers-dawg has patterns of numbers with punctuation and letters. However currently there is no way to specify patterns such as license plates, VIN, product IDs which are non-dictionary words and random combinations of numbers and letters.

Here are the other two images from error reports:

minus-0o

@theraysmith

Is there a variable which can be set for better accuracy in such cases?

Mar 29 '18 03:03 Shreeshrii

Another issue, reported in the forum

https://groups.google.com/forum/?utm_medium=email&utm_source=footer#!topic/tesseract-ocr/6a6sKOXdZsA

I to 1 A to 4

- an image containing "12345678I" => `123456781`
- an image containing "GLOTHUVFI" => `GLOTHUVFI`
- an image containing "12345678H" => `12345678H`
- an image containing "GLOTHUVFH" => `GLOTHUVFH`
- an image containing "12345678A" => `123456784`
- an image containing "GLOTHUVFA" => `GLOTHUVFA`

Apr 30 '18 14:04 Shreeshrii

Unfortunately, I've fallen into the same pit, is there any solution yet maybe? I guess I've tried everything and all the topics regarding that matter in the internet are left without the solution.

Apr 14 '19 11:04 kolakao

Same problem here

Dec 15 '19 12:12 FrancescoSaverioZuppichini

Hello, do you have datasets somewhere available for testing?

Feb 24 '22 14:02 ghost

This thread has been open for 5 years. Has anyone come up with a method for reliably getting tesseract to read a combination of letters and numbers?

Apr 22 '22 15:04 SHANDLEMAN

tesseract tesseract copied to clipboard

LSTM: Non-dictionary words with combination of letters and numbers not recognized.

tesseract
tesseract copied to clipboard