tesseract icon indicating copy to clipboard operation
tesseract copied to clipboard

LSTM: Non-dictionary words with combination of letters and numbers not recognized.

Open Shreeshrii opened this issue 8 years ago • 10 comments

https://groups.google.com/d/msgid/tesseract-ocr/1a3e8773-7151-48f9-92bb-fda888293eab%40googlegroups.com?utm_medium=email&utm_source=footer

While the single "S" is recognized correctly, the text "2S" is recognized as "25".

Here is link to the test image:

https://03054610326450256607.googlegroups.com/attach/b8b86693ac072/2s.png?part=0.4&view=1

Shreeshrii avatar Feb 22 '17 03:02 Shreeshrii

On 22-Feb-2017 9:02 PM, "Amit D." [email protected] wrote:

The lstm engine is train on text-lines images. and learns from context, so it does not surprise me that for a single glyph the OCR accuracy is not so good.

So, is this another case where legacy engine is better than LSTM?

  • excuse the brevity, sent from mobile

Shreeshrii avatar Feb 22 '17 15:02 Shreeshrii

Yes, the legacy engine (--oem 0) gets this one right.

tesseract4 --psm 7 --oem 0 2s.png 2s-out-oem0-psm7.txt

2s-out-oem0-psm7.txt

andrewisplinghoff avatar Feb 23 '17 08:02 andrewisplinghoff

@zdenop Please label : accuracy.

Shreeshrii avatar Mar 28 '18 09:03 Shreeshrii

Another instance reported in forum, in context of recognizing license plates.

Please see https://groups.google.com/forum/?utm_medium=email&utm_source=footer#!topic/tesseract-ocr/qxB-aCa3r6E

Test image is

minus-4l

Shreeshrii avatar Mar 28 '18 09:03 Shreeshrii

numbers-dawg has patterns of numbers with punctuation and letters. However currently there is no way to specify patterns such as license plates, VIN, product IDs which are non-dictionary words and random combinations of numbers and letters.

Here are the other two images from error reports:

minus-0o

2s

@theraysmith

Is there a variable which can be set for better accuracy in such cases?

Shreeshrii avatar Mar 29 '18 03:03 Shreeshrii

Another issue, reported in the forum

https://groups.google.com/forum/?utm_medium=email&utm_source=footer#!topic/tesseract-ocr/6a6sKOXdZsA

I to 1 A to 4

- an image containing "12345678I" => `123456781`
- an image containing "GLOTHUVFI" => `GLOTHUVFI`
- an image containing "12345678H" => `12345678H`
- an image containing "GLOTHUVFH" => `GLOTHUVFH`
- an image containing "12345678A" => `123456784`
- an image containing "GLOTHUVFA" => `GLOTHUVFA`

Shreeshrii avatar Apr 30 '18 14:04 Shreeshrii

Unfortunately, I've fallen into the same pit, is there any solution yet maybe? I guess I've tried everything and all the topics regarding that matter in the internet are left without the solution.

kolakao avatar Apr 14 '19 11:04 kolakao

Same problem here

Hello, do you have datasets somewhere available for testing?

ghost avatar Feb 24 '22 14:02 ghost

This thread has been open for 5 years. Has anyone come up with a method for reliably getting tesseract to read a combination of letters and numbers?

SHANDLEMAN avatar Apr 22 '22 15:04 SHANDLEMAN