SikuliX1 OCR quality --- lessons learned should be in the docs

Hello!

I´m using the current sikulix-version 2.0.1 on Windows 7 and 10 with Java 1.8.0_231 (64bit)

I notice that the OCR-feature is working better than before for some texts, but as well a little worse for other texts, although I am using finally the german traineddata.

See my screenshot. The elder sikulix had no problem with this word. failure

Any ideas?

Nov 26 '19 11:11 LisBerndt

Have you already tried to disable ClearType in Windows Settings?

Nov 27 '19 21:11 balmma

And are you using traineddata from https://github.com/tesseract-ocr/tessdata_fast?

Nov 27 '19 21:11 balmma

Have you already tried another page segmentation mode (e. g. 11)?

Nov 27 '19 21:11 balmma

Ok, I tried the tessdata_fast. --> the word was regcognized. But therefore regcognition another simple word, which could be recognized before failed. :( What do you mean with page segmentation?

Nov 28 '19 09:11 LisBerndt

If you want to have maximum accuracy you can also use traineddata from https://github.com/tesseract-ocr/tessdata_best . But OCR will be MUCH slower.

What do you mean with page segmentation?

https://github.com/tesseract-ocr/tesseract/wiki/ImproveQuality#page-segmentation-method

In SikuliX you can choose all PSMs without OSD out of the box. 11 might be a good choice in your case:

tr = TextOCR.start()
tr.setPSM(11)

If you want to try PSMs with OSD (e.g. 12) you have to put https://github.com/tesseract-ocr/tessdata_best/raw/master/osd.traineddata into your tessdata directory.

The most promising measure is almost always to disable ClearType in Windows settings.

Since you do not seem to recognize many real words it may also help to disable the word dictionaries:

tr.setConfigs(["nodict"])

https://github.com/tesseract-ocr/tesseract/wiki/ImproveQuality#dictionaries-word-lists-and-patterns

Nov 28 '19 10:11 balmma

Or probably traineddata files from https://github.com/tesseract-ocr/tessdata are a good compromise for you. They use an integerized version of tessdata_best and are quite a bit faster without sacrificing too much accuracy.

Nov 28 '19 10:11 balmma

If all those measures do not work, can you please provide some example images?

Nov 28 '19 10:11 balmma

@ballma Thanks for the information about what can be done in such cases. Will add it to the docs. Should we add osd.traineddata to the package or is it sufficient to talk about in what cases it is needed, where to get it and where to put it?

Nov 29 '19 09:11 RaiMan

@RaiMan

Should we add osd.traineddata to the package

osd.traineddata is approx. 4.3 MB in zipped form. Because I do not see too many use cases for OSD in SikuliX it's most probably not worth to integrate it per default. IMHO mentioning it in the docs should be sufficient.

Nov 29 '19 09:11 balmma

OK

Nov 29 '19 09:11 RaiMan

Or probably we should do some experiments to get a feeling if OSD helps to improve accuracy. Similar to the experiments here: https://github.com/RaiMan/SikuliX1/commit/10eb3798961cd5f13b85bf209ad52efe9d7ff3ab#commitcomment-35389746

Nov 29 '19 09:11 balmma

@PfrLisBerndt

Would be interesting to get some sample data from you :-)

Nov 29 '19 09:11 balmma

And for the really interested, the paper about OSD: https://storage.googleapis.com/pub-tools-public-publication-data/pdf/35506.pdf

Nov 29 '19 09:11 balmma

Would be interesting to get some sample data from you

Especially for stuff that is not working of course :-)

Nov 29 '19 10:11 balmma

Oh. One very important use case is vertical text. Not possible to recognize without OSD.

Nov 29 '19 10:11 balmma

Not possible to recognize without OSD

Interestingly, the LSTM model seems to be able to recognize vertical text even without OSD. But only if PSM is set to 2 or 3. 11 doesn't seem to work with vertical text. If I add the osd.traineddata, vertical text works as well with modes 1 and 12. Strange, needs some more investigation.

Nov 29 '19 10:11 balmma

Silly question:

Have your ever tried to use the eng.traineddata from https://github.com/tesseract-ocr/tessdata_fast/raw/master/eng.traineddata instead of deu.traineddata? Since you do OCR on rather small images with only a few words on it this one might be almost more accurate than the German one.

Disabling ClearType is not an option? Have you already tried it?

Nov 29 '19 12:11 balmma

And you say that you do OCR on the complete screen. May you provide a complete screenshot of such a screen?

Nov 29 '19 12:11 balmma

Can I get your mail-address?

Nov 29 '19 13:11 LisBerndt

We changed to the E-Mail "channel" to debug further. Will add a summary here shortly.

Dec 02 '19 10:12 balmma

It turned out that @PfrLisBerndt's use cases need quite some individual fine tuning.

What did not really help:

disabling Clear Type
Use tessdata_best
Fiddling around with the tr.optimumDPI setting

What did finally help:

For some screens we had do go back to the Tesseract legacy model since LSTM seems to be very picky about font size and on @PfrLisBerndt's screens there are great variations of those (see also https://github.com/tesseract-ocr/tesseract/wiki/ImproveQuality#rescaling and https://groups.google.com/forum/#!msg/tesseract-ocr/Wdh_JJwnw94/24JHDYQbBQAJ). So we had to use traineddata from https://github.com/tesseract-ocr/tessdata and set OEM to 0 using tr.setOEM(0).
On the other hand, on some other screens the legacy model didn't work at all. Here LSTM does a far better job.

To sum up, there seems to be no good "one-size-fits-all" setup. One can't even say that LSTM is superior to the legacy model in all cases.

Dec 02 '19 10:12 balmma

Yes, thanks for your excellent support!

My personal best practice is to switch at runtime switch between oem=0 and oem=3 if a text could not be recognized to try it again.

Dec 02 '19 11:12 LisBerndt

Have your also tried OEM 2? This seems to combine those two.

Dec 02 '19 11:12 balmma

Good idea. But no, it doesn´t help. - There is a difference obviously.

Dec 02 '19 11:12 LisBerndt

@PfrLisBerndt

My personal best practice is to switch at runtime switch between oem=0 and oem=3 if a text could not be recognized to try it again.

And what .traineddata version are you using in what oem case?

https://github.com/tesseract-ocr/tessdata (legacy)
https://github.com/tesseract-ocr/tessdata_best (best)
https://github.com/tesseract-ocr/tessdata_fast (fast)

As far as I understand, only 1.(legacy) is useable with oem=0 and contains models for LSTM too. So what does Tesseract use, when running with 1. and oem=3 and why does switching help?

Dec 02 '19 11:12 RaiMan

@balmma

In the standard (2.0.x) we currently run with traineddata fast and oem=3, which IMO means, that internally the LSTM engine is used.

Should there be changed anything?

Do we need a better support for switching between the different combinations of traineddata - oem, or is it enough to provide some information about where to get the stuff, where to put it at runtime and how to use the oem-switching?

Dec 02 '19 11:12 RaiMan

And what .traineddata version are you using in what oem case?

https://github.com/tesseract-ocr/tessdata for both

Dec 02 '19 12:12 balmma

So what does Tesseract use, when running with 1. and oem=3 and why does switching help?

OEM 0: Legacy OEM 1: LSTM only OEM 2: combine LSTM and Legacy OEM 3: LSTM if available, legacy otherwise

Switching helps becuause 0 uses legacy model only.

Dec 02 '19 12:12 balmma

In the standard (2.0.x) we currently run with traineddata fast and oem=3, which IMO means, that internally the LSTM engine is used.

correct

Should there be changed anything?

No, LSTM should be OK in most situations.

Dec 02 '19 12:12 balmma

I recognized, often when virtual machine has restarted, the deu-traineddata in my tessdata-folder has vanished. I don´t know the reasons yet, and why it does not happen every time. Since on my developer pc with admin-permissions the data does not vanish, I´m assuming that this has something to do with that. Since it would be very annoying everytime the file has vanished to recopy it by hand into that hidden folder of currently eight VMs, it would be nice if the tessdata is packed into the project. Or is there any workaround?

Dec 05 '19 12:12 LisBerndt