tesseract icon indicating copy to clipboard operation
tesseract copied to clipboard

Problems with special characters in the Polish version of tessdata

Open krzysiekj94 opened this issue 2 years ago • 7 comments

Environment

  • Tesseract Version: 5.1.0
  • Platform: Windows 32-bit, compiled under MSVC 2017

Current Behavior:

I have the following problem..

  1. I prepared a custom build for Tesseract 5.1.0, so as to generate dlls, which I then use in the project of a 32-bit .exe application. I prepared the following dependencies with CMake 3.23 (without SW build) a. tesseract 5.1.0, leptonica 1.82.0, libtiff 4.3.0, libjpeg-turbo 2.1.3, zlib 1.2.11, libpng 1.6.37. b. Links to src: tesseract 5.1.0 (https://github.com/tesseract-ocr/tesseract/releases) – 01.03.2022 leptonica 1.82.0 (http://www.leptonica.org/download.html) – 22.09.2021 libtiff 4.3.0 (http://download.osgeo.org/libtiff) – 20.04.2021 libjpeg-turbo 2.1.3 (https://github.com/libjpeg-turbo/libjpeg-turbo/releases ) – 25.02.2022 zlib 1.2.11 (https://github.com/madler/zlib/tags) – 15.01.2017 libpng 1.6.37 (https://github.com/glennrp/libpng/releases/tag/v1.6.37) – 14.04.2019
  2. Then I used the model: pol.traineddata from https://github.com/tesseract-ocr/tessdata_fast for process image: test a. I got the result: "janQkowalski". b. The 'Q' character was not recognized correctly. c. My problem is that I have to scan a lot of address data, incl. emails and no matter how large the font is, the '@' character is recognized as 'Q'. d. I checked on the version 4.1.1 it works properly. Maybe this is important info: I only use the "pol" language for the OCR parameterization.

Expected Behavior:

I expect special characters such as '@' to be recognized regardless of the language selected (especially for pol version, but maybe for another langugage problem still exist). For example in german version problem isn't reproducible.

Suggested Fix:

Is there any workaround to this behavior? If this is not the correct place, please give me info, but on the other hand, if something worked in 4.1.1 and now doesn't work, it should be treated as a bug.

krzysiekj94 avatar Apr 20 '22 15:04 krzysiekj94

Are you using the same traineddata file with 4.1.1?

Shreeshrii avatar Apr 20 '22 16:04 Shreeshrii

Are you using the same traineddata file with 4.1.1?

Hello @Shreeshrii,

exactly the same. If that helps, here's my tessdata folder.

tessdata.zip

krzysiekj94 avatar Apr 20 '22 16:04 krzysiekj94

1). Today I started doing tests combined with another language - ie. "pol+eng", "pol+deu". Surprisingly, for this case, the '@' sign started to be properly recognized in the first example. See below:

image

2). However, not always strings with '@' are correctly recognized for 'pol+eng', especially if they are smaller -> example below. File: test

Result for only 'pol' language: image

Result for 'pol+eng' languages:

image

Result for 'pol+deu' languages:

image

Here, 'pol' + 'deu' is doing best here, but the problem is that some Polish letters, ie "ó", are recognized as German "ö", so I would have to think about another solution. Do you have any ideas what could be done to make the special characters correctly recognized?

krzysiekj94 avatar Apr 22 '22 15:04 krzysiekj94

Hello @krzysiekj94. You could try the Latin model. At first glance it seems to produce good results, but maybe you need to finetune some character to get proper results for all polish character.

JKamlah avatar Apr 25 '22 11:04 JKamlah

Hello @JKamlah,
Thanx for your reply.

1). in the case Latin I found some mistakes by ocr process with PL letters - model "pl+Latin". Sometimes engine returns "t" instead of "ł". image image

2). in the case combined pl and eng model I found less mistakes by ocr process with PL letters - model "pl+eng". There are no problems with PL characters here, but latin is better at dealing with email addresses. Next problem is that "pl+latin" is slower than "pl+eng". For example, for a 30 page tiff, OCR time for Latin takes about 3 minutes, while for "pl + eng" 2 times less. I think here the "pl + eng" model wins a bit.

Do you have any other ideas what I could try?

krzysiekj94 avatar May 06 '22 16:05 krzysiekj94

You could try: Finetuning This technique will work best if you want to work with a small set of font types and fix glyph ranges, which you want to detect properly. So it depends on your task.

  1. Use "latin" model only and not "pl+latin" (it seems you work with mixed languages all based on latin script?)
  2. Extend the dictionary with polish words (or just get rid of the existing one). If you have some fix keywords, add these to the dict(!). -> use combine_tessdata to (un-)pack the model and dawg2wordlist/ wordlist2dawg to convert dawg to wordlist and vice versa. The wordlist can be edit with a simple texteditor.
  3. Find the characters which get most often confused ("t" instead of "ł" etc. )
  4. Generate artificial or transcribe some ground truth and finetune the latin model Or find some polish GT..
  5. Finetune the latin model (you can only train best-models) with tesstrain -> You can find some examples in the tesstrain-wiki

To find the best workflow, you could use some tooling to get proper values and more insights:

  1. Generate a ground truth set for validation
  2. Evaluate the OCR result with ocreval

JKamlah avatar May 09 '22 08:05 JKamlah

You could try: Finetuning This technique will work best if you want to work with a small set of font types and fix glyph ranges, which you want to detect properly. So it depends on your task.

1. Use "latin" model only and not "pl+latin" (it seems you work with mixed languages all based on latin script?)

2. Extend the dictionary with polish words (or just get rid of the existing one). If you have some fix keywords, add these to the dict(!).
   -> use [combine_tessdata](https://digi.bib.uni-mannheim.de/tesseract/manuals/combine_tessdata.1.html) to (un-)pack the model and [dawg2wordlist](https://digi.bib.uni-mannheim.de/tesseract/manuals/dawg2wordlist.1.html)/ [wordlist2dawg](https://digi.bib.uni-mannheim.de/tesseract/manuals/wordlist2dawg.1.html) to convert dawg to wordlist and vice versa. The wordlist can be edit with a simple texteditor.

3. Find the characters which get most often confused ("t" instead of "ł" etc. )

4. Generate artificial or transcribe some ground truth and finetune the latin model
   Or find [some polish GT](http://dl.psnc.pl/activities/projekty/impact/results/)..

5. Finetune the latin model (you can only train best-models) with [tesstrain](https://github.com/tesseract-ocr/tesstrain)
   -> You can find some examples in the [tesstrain-wiki](https://github.com/tesseract-ocr/tesstrain/wiki)

To find the best workflow, you could use some tooling to get proper values and more insights:

1. Generate a ground truth set for validation

2. Evaluate the OCR result with [ocreval](https://github.com/eddieantonio/ocreval)

Thanx for your suggestions. I will try it.

krzysiekj94 avatar May 09 '22 22:05 krzysiekj94