TextRecognitionDataGenerator Corrupt symbols in dict files (de)

Hi,

I think the german dict file (https://github.com/Belval/TextRecognitionDataGenerator/blob/master/trdg/dicts/de.txt) is partially corrupt. The the german umlauts appear in emacs as \207, \154, etc and when I generate text (I use a python 3.6 conda env) all umlauts are omitted.

If I paste a "ü" in the beginning of the file, that umlaut is parces correctly as of the:

with open(file, 'r', 'utf8', ...) in https://github.com/Belval/TextRecognitionDataGenerator/blob/master/trdg/utils.py row 14 which does work.

I found the issue as EasyOCR (https://github.com/JaidedAI/EasyOCR, derived work) which uses deep-text-recognition which uses this repository to generate data, failed to recognize German umlauts such as "ü".

Best regards, Nils

Sep 21 '20 14:09 nisseb

Current dict file: file -i trdg/dicts/de.txt trdg/dicts/de.txt: text/plain; charset=unknown-8bit

I managed to solve it by downloading a new new which was: trdg/dicts/de.txt: text/plain; charset=iso-8859-1

and then change the decoding from utf8 to iso-8859-1 in the utils.py file (row 14)

This might not be your preferred solution but it might help others or give suggestions for how to fix it if you experience the issue as well.

Sep 21 '20 14:09 nisseb

This is an actual issue that should be addressed. I accepted the dictionaries without parsing them first so it's on me.

I'd rather have all dictionaries in UTF-8 so I will convert them all.

Thank you for reporting this.

Sep 21 '20 15:09 Belval

Should be fixed by #174. Can you confirm?

Sep 29 '20 15:09 Belval

Hi, the German (de.txt) is fixed but there are two more non-utf8 dictionaries:

cd TextRecognitionDataGenerator file -i trdg/dicts/*

gives

de.txt: text/plain; charset=utf-8 en.txt: text/plain; charset=us-ascii es.txt: text/plain; charset=unknown-8bit . . .

Not sure if they are also causing issues. Thanks for the fast response!

Oct 01 '20 12:10 nisseb