tesseract Problems with special characters in the Polish version of tessdata

Problems with special characters in the Polish version of tessdata

Open krzysiekj94 opened this issue 2 years ago • 7 comments

Environment

Tesseract Version: 5.1.0
Platform: Windows 32-bit, compiled under MSVC 2017

Current Behavior:

I have the following problem..

I prepared a custom build for Tesseract 5.1.0, so as to generate dlls, which I then use in the project of a 32-bit .exe application. I prepared the following dependencies with CMake 3.23 (without SW build) a. tesseract 5.1.0, leptonica 1.82.0, libtiff 4.3.0, libjpeg-turbo 2.1.3, zlib 1.2.11, libpng 1.6.37. b. Links to src: tesseract 5.1.0 (https://github.com/tesseract-ocr/tesseract/releases) – 01.03.2022 leptonica 1.82.0 (http://www.leptonica.org/download.html) – 22.09.2021 libtiff 4.3.0 (http://download.osgeo.org/libtiff) – 20.04.2021 libjpeg-turbo 2.1.3 (https://github.com/libjpeg-turbo/libjpeg-turbo/releases ) – 25.02.2022 zlib 1.2.11 (https://github.com/madler/zlib/tags) – 15.01.2017 libpng 1.6.37 (https://github.com/glennrp/libpng/releases/tag/v1.6.37) – 14.04.2019
Then I used the model: pol.traineddata from https://github.com/tesseract-ocr/tessdata_fast for process image: a. I got the result: "janQkowalski". b. The 'Q' character was not recognized correctly. c. My problem is that I have to scan a lot of address data, incl. emails and no matter how large the font is, the '@' character is recognized as 'Q'. d. I checked on the version 4.1.1 it works properly. Maybe this is important info: I only use the "pol" language for the OCR parameterization.

Expected Behavior:

I expect special characters such as '@' to be recognized regardless of the language selected (especially for pol version, but maybe for another langugage problem still exist). For example in german version problem isn't reproducible.

Suggested Fix:

Is there any workaround to this behavior? If this is not the correct place, please give me info, but on the other hand, if something worked in 4.1.1 and now doesn't work, it should be treated as a bug.

Apr 20 '22 15:04 krzysiekj94

Are you using the same traineddata file with 4.1.1?

Apr 20 '22 16:04 Shreeshrii

Are you using the same traineddata file with 4.1.1?

Hello @Shreeshrii,

exactly the same. If that helps, here's my tessdata folder.

tessdata.zip

Apr 20 '22 16:04 krzysiekj94

1). Today I started doing tests combined with another language - ie. "pol+eng", "pol+deu". Surprisingly, for this case, the '@' sign started to be properly recognized in the first example. See below:

2). However, not always strings with '@' are correctly recognized for 'pol+eng', especially if they are smaller -> example below. File: test

Result for only 'pol' language:

Result for 'pol+eng' languages:

Result for 'pol+deu' languages:

Here, 'pol' + 'deu' is doing best here, but the problem is that some Polish letters, ie "ó", are recognized as German "ö", so I would have to think about another solution. Do you have any ideas what could be done to make the special characters correctly recognized?

Apr 22 '22 15:04 krzysiekj94

Hello @krzysiekj94. You could try the Latin model. At first glance it seems to produce good results, but maybe you need to finetune some character to get proper results for all polish character.

Apr 25 '22 11:04 JKamlah

Hello @JKamlah,
Thanx for your reply.

1). in the case Latin I found some mistakes by ocr process with PL letters - model "pl+Latin". Sometimes engine returns "t" instead of "ł".

2). in the case combined pl and eng model I found less mistakes by ocr process with PL letters - model "pl+eng". There are no problems with PL characters here, but latin is better at dealing with email addresses. Next problem is that "pl+latin" is slower than "pl+eng". For example, for a 30 page tiff, OCR time for Latin takes about 3 minutes, while for "pl + eng" 2 times less. I think here the "pl + eng" model wins a bit.

Do you have any other ideas what I could try?

May 06 '22 16:05 krzysiekj94

You could try: Finetuning This technique will work best if you want to work with a small set of font types and fix glyph ranges, which you want to detect properly. So it depends on your task.

Use "latin" model only and not "pl+latin" (it seems you work with mixed languages all based on latin script?)
Extend the dictionary with polish words (or just get rid of the existing one). If you have some fix keywords, add these to the dict(!). -> use combine_tessdata to (un-)pack the model and dawg2wordlist/ wordlist2dawg to convert dawg to wordlist and vice versa. The wordlist can be edit with a simple texteditor.
Find the characters which get most often confused ("t" instead of "ł" etc. )
Generate artificial or transcribe some ground truth and finetune the latin model Or find some polish GT..
Finetune the latin model (you can only train best-models) with tesstrain -> You can find some examples in the tesstrain-wiki

To find the best workflow, you could use some tooling to get proper values and more insights:

Generate a ground truth set for validation
Evaluate the OCR result with ocreval

May 09 '22 08:05 JKamlah

You could try: Finetuning This technique will work best if you want to work with a small set of font types and fix glyph ranges, which you want to detect properly. So it depends on your task.

1. Use "latin" model only and not "pl+latin" (it seems you work with mixed languages all based on latin script?)

2. Extend the dictionary with polish words (or just get rid of the existing one). If you have some fix keywords, add these to the dict(!).
   -> use [combine_tessdata](https://digi.bib.uni-mannheim.de/tesseract/manuals/combine_tessdata.1.html) to (un-)pack the model and [dawg2wordlist](https://digi.bib.uni-mannheim.de/tesseract/manuals/dawg2wordlist.1.html)/ [wordlist2dawg](https://digi.bib.uni-mannheim.de/tesseract/manuals/wordlist2dawg.1.html) to convert dawg to wordlist and vice versa. The wordlist can be edit with a simple texteditor.

3. Find the characters which get most often confused ("t" instead of "ł" etc. )

4. Generate artificial or transcribe some ground truth and finetune the latin model
   Or find [some polish GT](http://dl.psnc.pl/activities/projekty/impact/results/)..

5. Finetune the latin model (you can only train best-models) with [tesstrain](https://github.com/tesseract-ocr/tesstrain)
   -> You can find some examples in the [tesstrain-wiki](https://github.com/tesseract-ocr/tesstrain/wiki)

To find the best workflow, you could use some tooling to get proper values and more insights:

1. Generate a ground truth set for validation

2. Evaluate the OCR result with [ocreval](https://github.com/eddieantonio/ocreval)

Thanx for your suggestions. I will try it.

May 09 '22 22:05 krzysiekj94

tesseract tesseract copied to clipboard

Problems with special characters in the Polish version of tessdata

Environment

Current Behavior:

Expected Behavior:

Suggested Fix:

tesseract
tesseract copied to clipboard