tesseract
                                
                                 tesseract copied to clipboard
                                
                                    tesseract copied to clipboard
                            
                            
                            
                        LSTM: Non-dictionary words with combination of letters and numbers not recognized.
https://groups.google.com/d/msgid/tesseract-ocr/1a3e8773-7151-48f9-92bb-fda888293eab%40googlegroups.com?utm_medium=email&utm_source=footer
While the single "S" is recognized correctly, the text "2S" is recognized as "25".
Here is link to the test image:
https://03054610326450256607.googlegroups.com/attach/b8b86693ac072/2s.png?part=0.4&view=1
On 22-Feb-2017 9:02 PM, "Amit D." [email protected] wrote:
The lstm engine is train on text-lines images. and learns from context, so it does not surprise me that for a single glyph the OCR accuracy is not so good.
So, is this another case where legacy engine is better than LSTM?
- excuse the brevity, sent from mobile
Yes, the legacy engine (--oem 0) gets this one right.
tesseract4 --psm 7 --oem 0 2s.png 2s-out-oem0-psm7.txt
@zdenop Please label : accuracy.
Another instance reported in forum, in context of recognizing license plates.
Please see https://groups.google.com/forum/?utm_medium=email&utm_source=footer#!topic/tesseract-ocr/qxB-aCa3r6E
Test image is

numbers-dawg has patterns of numbers with punctuation and letters. However currently there is no way to specify patterns such as license plates, VIN, product IDs which are non-dictionary words and random combinations of numbers and letters.
Here are the other two images from error reports:


@theraysmith
Is there a variable which can be set for better accuracy in such cases?
Another issue, reported in the forum
https://groups.google.com/forum/?utm_medium=email&utm_source=footer#!topic/tesseract-ocr/6a6sKOXdZsA
I to 1 A to 4
- an image containing "12345678I" => `123456781`
- an image containing "GLOTHUVFI" => `GLOTHUVFI`
- an image containing "12345678H" => `12345678H`
- an image containing "GLOTHUVFH" => `GLOTHUVFH`
- an image containing "12345678A" => `123456784`
- an image containing "GLOTHUVFA" => `GLOTHUVFA`
Unfortunately, I've fallen into the same pit, is there any solution yet maybe? I guess I've tried everything and all the topics regarding that matter in the internet are left without the solution.
Same problem here
Hello, do you have datasets somewhere available for testing?
This thread has been open for 5 years. Has anyone come up with a method for reliably getting tesseract to read a combination of letters and numbers?