TextRecognitionDataGenerator icon indicating copy to clipboard operation
TextRecognitionDataGenerator copied to clipboard

Corrupt symbols in dict files (de)

Open nisseb opened this issue 5 years ago • 4 comments

Hi,

I think the german dict file (https://github.com/Belval/TextRecognitionDataGenerator/blob/master/trdg/dicts/de.txt) is partially corrupt. The the german umlauts appear in emacs as \207, \154, etc and when I generate text (I use a python 3.6 conda env) all umlauts are omitted.

If I paste a "ü" in the beginning of the file, that umlaut is parces correctly as of the:

with open(file, 'r', 'utf8', ...) in https://github.com/Belval/TextRecognitionDataGenerator/blob/master/trdg/utils.py row 14 which does work.

I found the issue as EasyOCR (https://github.com/JaidedAI/EasyOCR, derived work) which uses deep-text-recognition which uses this repository to generate data, failed to recognize German umlauts such as "ü".

Best regards, Nils

nisseb avatar Sep 21 '20 14:09 nisseb

Current dict file: file -i trdg/dicts/de.txt trdg/dicts/de.txt: text/plain; charset=unknown-8bit

I managed to solve it by downloading a new new which was: trdg/dicts/de.txt: text/plain; charset=iso-8859-1

and then change the decoding from utf8 to iso-8859-1 in the utils.py file (row 14)

This might not be your preferred solution but it might help others or give suggestions for how to fix it if you experience the issue as well.

nisseb avatar Sep 21 '20 14:09 nisseb

This is an actual issue that should be addressed. I accepted the dictionaries without parsing them first so it's on me.

I'd rather have all dictionaries in UTF-8 so I will convert them all.

Thank you for reporting this.

Belval avatar Sep 21 '20 15:09 Belval

Should be fixed by #174. Can you confirm?

Belval avatar Sep 29 '20 15:09 Belval

Hi, the German (de.txt) is fixed but there are two more non-utf8 dictionaries:

cd TextRecognitionDataGenerator file -i trdg/dicts/*

gives

de.txt: text/plain; charset=utf-8 en.txt: text/plain; charset=us-ascii es.txt: text/plain; charset=unknown-8bit . . .

Not sure if they are also causing issues. Thanks for the fast response!

nisseb avatar Oct 01 '20 12:10 nisseb