commonvoice-fr icon indicating copy to clipboard operation
commonvoice-fr copied to clipboard

Improve `ENGLISH_COMPATIBLE`

Open wasertech opened this issue 1 year ago • 0 comments

Use:

uni2ascii -q wiki_fr_lower_accents.txt > wiki_fr_lower.txt

Instead of: https://github.com/common-voice/commonvoice-fr/blob/5699e59244d14bb14d5b7603b91c934b761c9194/DeepSpeech/fr/prepare_lm.sh#L19

https://billposer.org/Software/uni2ascii.html

Why?

I've dump Wikipedia in English to make a custom scorer.

Here is the result with iconv:

+ build_lm.sh
+ '[' 1 = 1 ']'
+ OLD_LANG=C.UTF-8
+ export LANG=en_US.UTF-8
+ LANG=en_US.UTF-8
+ pushd /mnt/extracted/
/mnt/extracted ~
+ /home/trainer/en_custom/prepare_lm.sh
+ '[' '!' -f en/wiki_en_lower.txt ']'
+ curl -sSL 'https://gitlab.com/waser-technologies/data/lm/en/wiki-dump/-/raw/main/wiki.en.txt?inline=false'
+ tr '[:upper:]' '[:lower:]'
+ '[' 1 = 1 ']'
+ mv en/wiki_en_lower.txt en/wiki_en_lower_accents.txt
+ head -n 5 en/wiki_en_lower_accents.txt
beliefs on how to abolish the state also differ.
contemporary anarchists such as ward claim that state education serves to perpetuate socioeconomic inequality.
marxists state that this contradiction was responsible for their inability to act.
both positive feedback loops have long been recognized as important for global warming.
cloud albedo has substantial influence over atmospheric temperatures.
+ iconv -f utf-8 -t ascii//TRANSLIT//IGNORE
iconv: illegal input sequence at position 26095

     {!} : Aborted
         : Container exited with code 1.

If the iconv route works with our old wiki dump for french, i'm sure if I do a new one now, chances are we'll also get an illegal input sequence.

wasertech avatar Oct 19 '22 10:10 wasertech