commonvoice-fr
commonvoice-fr copied to clipboard
Improve `ENGLISH_COMPATIBLE`
Use:
uni2ascii -q wiki_fr_lower_accents.txt > wiki_fr_lower.txt
Instead of: https://github.com/common-voice/commonvoice-fr/blob/5699e59244d14bb14d5b7603b91c934b761c9194/DeepSpeech/fr/prepare_lm.sh#L19
https://billposer.org/Software/uni2ascii.html
Why?
I've dump Wikipedia in English to make a custom scorer.
Here is the result with iconv
:
+ build_lm.sh
+ '[' 1 = 1 ']'
+ OLD_LANG=C.UTF-8
+ export LANG=en_US.UTF-8
+ LANG=en_US.UTF-8
+ pushd /mnt/extracted/
/mnt/extracted ~
+ /home/trainer/en_custom/prepare_lm.sh
+ '[' '!' -f en/wiki_en_lower.txt ']'
+ curl -sSL 'https://gitlab.com/waser-technologies/data/lm/en/wiki-dump/-/raw/main/wiki.en.txt?inline=false'
+ tr '[:upper:]' '[:lower:]'
+ '[' 1 = 1 ']'
+ mv en/wiki_en_lower.txt en/wiki_en_lower_accents.txt
+ head -n 5 en/wiki_en_lower_accents.txt
beliefs on how to abolish the state also differ.
contemporary anarchists such as ward claim that state education serves to perpetuate socioeconomic inequality.
marxists state that this contradiction was responsible for their inability to act.
both positive feedback loops have long been recognized as important for global warming.
cloud albedo has substantial influence over atmospheric temperatures.
+ iconv -f utf-8 -t ascii//TRANSLIT//IGNORE
iconv: illegal input sequence at position 26095
{!} : Aborted
: Container exited with code 1.
If the iconv
route works with our old wiki dump for french, i'm sure if I do a new one now, chances are we'll also get an illegal input sequence
.