OCR quality --- lessons learned should be in the docs
Hello!
I´m using the current sikulix-version 2.0.1 on Windows 7 and 10 with Java 1.8.0_231 (64bit)
I notice that the OCR-feature is working better than before for some texts, but as well a little worse for other texts, although I am using finally the german traineddata.
See my screenshot. The elder sikulix had no problem with this word.
Any ideas?
Have you already tried to disable ClearType in Windows Settings?
And are you using traineddata from https://github.com/tesseract-ocr/tessdata_fast?
Have you already tried another page segmentation mode (e. g. 11)?
Ok, I tried the tessdata_fast. --> the word was regcognized. But therefore regcognition another simple word, which could be recognized before failed. :( What do you mean with page segmentation?
If you want to have maximum accuracy you can also use traineddata from https://github.com/tesseract-ocr/tessdata_best . But OCR will be MUCH slower.
What do you mean with page segmentation?
https://github.com/tesseract-ocr/tesseract/wiki/ImproveQuality#page-segmentation-method
In SikuliX you can choose all PSMs without OSD out of the box. 11 might be a good choice in your case:
tr = TextOCR.start()
tr.setPSM(11)
If you want to try PSMs with OSD (e.g. 12) you have to put https://github.com/tesseract-ocr/tessdata_best/raw/master/osd.traineddata into your tessdata directory.
The most promising measure is almost always to disable ClearType in Windows settings.
Since you do not seem to recognize many real words it may also help to disable the word dictionaries:
tr.setConfigs(["nodict"])
https://github.com/tesseract-ocr/tesseract/wiki/ImproveQuality#dictionaries-word-lists-and-patterns
Or probably traineddata files from https://github.com/tesseract-ocr/tessdata are a good compromise for you. They use an integerized version of tessdata_best and are quite a bit faster without sacrificing too much accuracy.
If all those measures do not work, can you please provide some example images?
@ballma
Thanks for the information about what can be done in such cases.
Will add it to the docs.
Should we add osd.traineddata to the package or is it sufficient to talk about in what cases it is needed, where to get it and where to put it?
@RaiMan
Should we add osd.traineddata to the package
osd.traineddata is approx. 4.3 MB in zipped form. Because I do not see too many use cases for OSD in SikuliX it's most probably not worth to integrate it per default. IMHO mentioning it in the docs should be sufficient.
OK
Or probably we should do some experiments to get a feeling if OSD helps to improve accuracy. Similar to the experiments here: https://github.com/RaiMan/SikuliX1/commit/10eb3798961cd5f13b85bf209ad52efe9d7ff3ab#commitcomment-35389746
@PfrLisBerndt
Would be interesting to get some sample data from you :-)
And for the really interested, the paper about OSD: https://storage.googleapis.com/pub-tools-public-publication-data/pdf/35506.pdf
Would be interesting to get some sample data from you
Especially for stuff that is not working of course :-)
Oh. One very important use case is vertical text. Not possible to recognize without OSD.
Not possible to recognize without OSD
Interestingly, the LSTM model seems to be able to recognize vertical text even without OSD. But only if PSM is set to 2 or 3. 11 doesn't seem to work with vertical text. If I add the osd.traineddata, vertical text works as well with modes 1 and 12. Strange, needs some more investigation.
Silly question:
Have your ever tried to use the eng.traineddata from https://github.com/tesseract-ocr/tessdata_fast/raw/master/eng.traineddata instead of deu.traineddata? Since you do OCR on rather small images with only a few words on it this one might be almost more accurate than the German one.
Disabling ClearType is not an option? Have you already tried it?
And you say that you do OCR on the complete screen. May you provide a complete screenshot of such a screen?
Can I get your mail-address?
We changed to the E-Mail "channel" to debug further. Will add a summary here shortly.
It turned out that @PfrLisBerndt's use cases need quite some individual fine tuning.
What did not really help:
- disabling Clear Type
- Use tessdata_best
- Fiddling around with the
tr.optimumDPIsetting
What did finally help:
- For some screens we had do go back to the Tesseract legacy model since LSTM seems to be very picky about font size and on @PfrLisBerndt's screens there are great variations of those (see also https://github.com/tesseract-ocr/tesseract/wiki/ImproveQuality#rescaling and https://groups.google.com/forum/#!msg/tesseract-ocr/Wdh_JJwnw94/24JHDYQbBQAJ). So we had to use traineddata from https://github.com/tesseract-ocr/tessdata and set OEM to 0 using
tr.setOEM(0). - On the other hand, on some other screens the legacy model didn't work at all. Here LSTM does a far better job.
To sum up, there seems to be no good "one-size-fits-all" setup. One can't even say that LSTM is superior to the legacy model in all cases.
Yes, thanks for your excellent support!
My personal best practice is to switch at runtime switch between oem=0 and oem=3 if a text could not be recognized to try it again.
Have your also tried OEM 2? This seems to combine those two.
Good idea. But no, it doesn´t help. - There is a difference obviously.
@PfrLisBerndt
My personal best practice is to switch at runtime switch between oem=0 and oem=3 if a text could not be recognized to try it again.
And what .traineddata version are you using in what oem case?
- https://github.com/tesseract-ocr/tessdata (legacy)
- https://github.com/tesseract-ocr/tessdata_best (best)
- https://github.com/tesseract-ocr/tessdata_fast (fast)
As far as I understand, only 1.(legacy) is useable with oem=0 and contains models for LSTM too.
So what does Tesseract use, when running with 1. and oem=3 and why does switching help?
@balmma
In the standard (2.0.x) we currently run with traineddata fast and oem=3, which IMO means, that internally the LSTM engine is used.
Should there be changed anything?
Do we need a better support for switching between the different combinations of traineddata - oem, or is it enough to provide some information about where to get the stuff, where to put it at runtime and how to use the oem-switching?
And what .traineddata version are you using in what oem case?
https://github.com/tesseract-ocr/tessdata for both
So what does Tesseract use, when running with 1. and oem=3 and why does switching help?
OEM 0: Legacy OEM 1: LSTM only OEM 2: combine LSTM and Legacy OEM 3: LSTM if available, legacy otherwise
Switching helps becuause 0 uses legacy model only.
In the standard (2.0.x) we currently run with traineddata fast and oem=3, which IMO means, that internally the LSTM engine is used.
correct
Should there be changed anything?
No, LSTM should be OK in most situations.
I recognized, often when virtual machine has restarted, the deu-traineddata in my tessdata-folder has vanished. I don´t know the reasons yet, and why it does not happen every time. Since on my developer pc with admin-permissions the data does not vanish, I´m assuming that this has something to do with that. Since it would be very annoying everytime the file has vanished to recopy it by hand into that hidden folder of currently eight VMs, it would be nice if the tessdata is packed into the project. Or is there any workaround?