Corrupt symbols in dict files (de)
Hi,
I think the german dict file (https://github.com/Belval/TextRecognitionDataGenerator/blob/master/trdg/dicts/de.txt) is partially corrupt. The the german umlauts appear in emacs as \207, \154, etc and when I generate text (I use a python 3.6 conda env) all umlauts are omitted.
If I paste a "ü" in the beginning of the file, that umlaut is parces correctly as of the:
with open(file, 'r', 'utf8', ...) in https://github.com/Belval/TextRecognitionDataGenerator/blob/master/trdg/utils.py row 14 which does work.
I found the issue as EasyOCR (https://github.com/JaidedAI/EasyOCR, derived work) which uses deep-text-recognition which uses this repository to generate data, failed to recognize German umlauts such as "ü".
Best regards, Nils
Current dict file: file -i trdg/dicts/de.txt trdg/dicts/de.txt: text/plain; charset=unknown-8bit
I managed to solve it by downloading a new new which was: trdg/dicts/de.txt: text/plain; charset=iso-8859-1
and then change the decoding from utf8 to iso-8859-1 in the utils.py file (row 14)
This might not be your preferred solution but it might help others or give suggestions for how to fix it if you experience the issue as well.
This is an actual issue that should be addressed. I accepted the dictionaries without parsing them first so it's on me.
I'd rather have all dictionaries in UTF-8 so I will convert them all.
Thank you for reporting this.
Should be fixed by #174. Can you confirm?
Hi, the German (de.txt) is fixed but there are two more non-utf8 dictionaries:
cd TextRecognitionDataGenerator file -i trdg/dicts/*
gives
de.txt: text/plain; charset=utf-8 en.txt: text/plain; charset=us-ascii es.txt: text/plain; charset=unknown-8bit . . .
Not sure if they are also causing issues. Thanks for the fast response!