vosk-api icon indicating copy to clipboard operation
vosk-api copied to clipboard

Проблема адаптации словаря малой модели

Open Saivaks opened this issue 1 year ago • 2 comments

Привет! Для улучшения работы Vosk я решил добавить в словарь моделей некоторые специфичные слова. Для этого я пользовался этой инструкцией https://alphacephei.com/vosk/lm. Адаптация прошла успешно о чем свидетельствует вывод ниже, однако, я не нашел папки lgraph в которой по идее должны быть данные для малой модели. При этом большая модель адаптировалась и работает для новых слов.

+ rm -rf data/extra.lm.gz data/lang_local data/dict data/lang data/lang_test data/lang_test_rescore
+ rm -rf exp/lgraph
+ rm -rf exp/graph
+ mkdir -p data/dict
+ cp db/phone/extra_questions.txt db/phone/nonsilence_phones.txt db/phone/optional_silence.txt db/phone/silence_phones.txt data/dict
+ ./dict.py
+ ngram-count -wbdiscount -order 4 -text db/extra.txt -lm data/extra.lm.gz
+ ngram -order 4 -lm db/ru.lm.gz -mix-lm data/extra.lm.gz -lambda 0.95 -write-lm data/ru-mix.lm.gz
+ ngram -order 4 -lm data/ru-mix.lm.gz -prune 3e-8 -write-lm data/ru-mixp.lm.gz
+ ngram -lm data/ru-mixp.lm.gz -write-lm data/ru-mix-small.lm.gz
+ utils/prepare_lang.sh data/dict '[unk]' data/lang_local data/lang
utils/prepare_lang.sh data/dict [unk] data/lang_local data/lang
Checking data/dict/silence_phones.txt ...
--> reading data/dict/silence_phones.txt
--> text seems to be UTF-8 or ASCII, checking whitespaces
--> text contains only allowed whitespaces
--> data/dict/silence_phones.txt is OK

Checking data/dict/optional_silence.txt ...
--> reading data/dict/optional_silence.txt
--> text seems to be UTF-8 or ASCII, checking whitespaces
--> text contains only allowed whitespaces
--> data/dict/optional_silence.txt is OK

Checking data/dict/nonsilence_phones.txt ...
--> reading data/dict/nonsilence_phones.txt
--> text seems to be UTF-8 or ASCII, checking whitespaces
--> text contains only allowed whitespaces
--> data/dict/nonsilence_phones.txt is OK

Checking disjoint: silence_phones.txt, nonsilence_phones.txt
--> disjoint property is OK.

Checking data/dict/lexicon.txt
--> reading data/dict/lexicon.txt
--> text seems to be UTF-8 or ASCII, checking whitespaces
--> text contains only allowed whitespaces
--> data/dict/lexicon.txt is OK

Checking data/dict/extra_questions.txt ...
--> data/dict/extra_questions.txt is empty (this is OK)
--> SUCCESS [validating dictionary directory data/dict]

**Creating data/dict/lexiconp.txt from data/dict/lexicon.txt
fstaddselfloops data/lang/phones/wdisambig_phones.int data/lang/phones/wdisambig_words.int 
prepare_lang.sh: validating output directory
utils/validate_lang.pl data/lang
Checking existence of separator file
separator file data/lang/subword_separator.txt is empty or does not exist, deal in word case.
Checking data/lang/phones.txt ...
--> text seems to be UTF-8 or ASCII, checking whitespaces
--> text contains only allowed whitespaces
--> data/lang/phones.txt is OK

Checking words.txt: #0 ...
--> text seems to be UTF-8 or ASCII, checking whitespaces
--> text contains only allowed whitespaces
--> data/lang/words.txt is OK

Checking disjoint: silence.txt, nonsilence.txt, disambig.txt ...
--> silence.txt and nonsilence.txt are disjoint
--> silence.txt and disambig.txt are disjoint
--> disambig.txt and nonsilence.txt are disjoint
--> disjoint property is OK

Checking sumation: silence.txt, nonsilence.txt, disambig.txt ...
--> found no unexplainable phones in phones.txt

Checking data/lang/phones/context_indep.{txt, int, csl} ...
--> text seems to be UTF-8 or ASCII, checking whitespaces
--> text contains only allowed whitespaces
--> 10 entry/entries in data/lang/phones/context_indep.txt
--> data/lang/phones/context_indep.int corresponds to data/lang/phones/context_indep.txt
--> data/lang/phones/context_indep.csl corresponds to data/lang/phones/context_indep.txt
--> data/lang/phones/context_indep.{txt, int, csl} are OK

Checking data/lang/phones/nonsilence.{txt, int, csl} ...
--> text seems to be UTF-8 or ASCII, checking whitespaces
--> text contains only allowed whitespaces
--> 192 entry/entries in data/lang/phones/nonsilence.txt
--> data/lang/phones/nonsilence.int corresponds to data/lang/phones/nonsilence.txt
--> data/lang/phones/nonsilence.csl corresponds to data/lang/phones/nonsilence.txt
--> data/lang/phones/nonsilence.{txt, int, csl} are OK

Checking data/lang/phones/silence.{txt, int, csl} ...
--> text seems to be UTF-8 or ASCII, checking whitespaces
--> text contains only allowed whitespaces
--> 10 entry/entries in data/lang/phones/silence.txt
--> data/lang/phones/silence.int corresponds to data/lang/phones/silence.txt
--> data/lang/phones/silence.csl corresponds to data/lang/phones/silence.txt
--> data/lang/phones/silence.{txt, int, csl} are OK

Checking data/lang/phones/optional_silence.{txt, int, csl} ...
--> text seems to be UTF-8 or ASCII, checking whitespaces
--> text contains only allowed whitespaces
--> 1 entry/entries in data/lang/phones/optional_silence.txt
--> data/lang/phones/optional_silence.int corresponds to data/lang/phones/optional_silence.txt
--> data/lang/phones/optional_silence.csl corresponds to data/lang/phones/optional_silence.txt
--> data/lang/phones/optional_silence.{txt, int, csl} are OK

Checking data/lang/phones/disambig.{txt, int, csl} ...
--> text seems to be UTF-8 or ASCII, checking whitespaces
--> text contains only allowed whitespaces
--> 8 entry/entries in data/lang/phones/disambig.txt
--> data/lang/phones/disambig.int corresponds to data/lang/phones/disambig.txt
--> data/lang/phones/disambig.csl corresponds to data/lang/phones/disambig.txt
--> data/lang/phones/disambig.{txt, int, csl} are OK

Checking data/lang/phones/roots.{txt, int} ...
--> text seems to be UTF-8 or ASCII, checking whitespaces
--> text contains only allowed whitespaces
--> 50 entry/entries in data/lang/phones/roots.txt
--> data/lang/phones/roots.int corresponds to data/lang/phones/roots.txt
--> data/lang/phones/roots.{txt, int} are OK

Checking data/lang/phones/sets.{txt, int} ...
--> text seems to be UTF-8 or ASCII, checking whitespaces
--> text contains only allowed whitespaces
--> 50 entry/entries in data/lang/phones/sets.txt
--> data/lang/phones/sets.int corresponds to data/lang/phones/sets.txt
--> data/lang/phones/sets.{txt, int} are OK

Checking data/lang/phones/extra_questions.{txt, int} ...
--> text seems to be UTF-8 or ASCII, checking whitespaces
--> text contains only allowed whitespaces
--> 9 entry/entries in data/lang/phones/extra_questions.txt
--> data/lang/phones/extra_questions.int corresponds to data/lang/phones/extra_questions.txt
--> data/lang/phones/extra_questions.{txt, int} are OK

Checking data/lang/phones/word_boundary.{txt, int} ...
--> text seems to be UTF-8 or ASCII, checking whitespaces
--> text contains only allowed whitespaces
--> 202 entry/entries in data/lang/phones/word_boundary.txt
--> data/lang/phones/word_boundary.int corresponds to data/lang/phones/word_boundary.txt
--> data/lang/phones/word_boundary.{txt, int} are OK

Checking optional_silence.txt ...
--> reading data/lang/phones/optional_silence.txt
--> data/lang/phones/optional_silence.txt is OK

Checking disambiguation symbols: #0 and #1
--> data/lang/phones/disambig.txt has "#0" and "#1"
--> data/lang/phones/disambig.txt is OK

Checking topo ...

Checking word_boundary.txt: silence.txt, nonsilence.txt, disambig.txt ...
--> data/lang/phones/word_boundary.txt doesn't include disambiguation symbols
--> data/lang/phones/word_boundary.txt is the union of nonsilence.txt and silence.txt
--> data/lang/phones/word_boundary.txt is OK

Checking word-level disambiguation symbols...
--> data/lang/phones/wdisambig.txt exists (newer prepare_lang.sh)
Checking word_boundary.int and disambig.int
--> generating a 32 word/subword sequence
--> resulting phone sequence from L.fst corresponds to the word sequence
--> L.fst is OK
--> generating a 2 word/subword sequence
--> resulting phone sequence from L_disambig.fst corresponds to the word sequence
--> L_disambig.fst is OK

Checking data/lang/oov.{txt, int} ...
--> text seems to be UTF-8 or ASCII, checking whitespaces
--> text contains only allowed whitespaces
--> 1 entry/entries in data/lang/oov.txt
--> data/lang/oov.int corresponds to data/lang/oov.txt
--> data/lang/oov.{txt, int} are OK

--> data/lang/L.fst is olabel sorted
--> data/lang/L_disambig.fst is olabel sorted
--> SUCCESS [validating lang directory data/lang]
+ utils/format_lm.sh data/lang data/ru-mix-small.lm.gz data/dict/lexicon.txt data/lang_test
Converting 'data/ru-mix-small.lm.gz' to FST
arpa2fst --disambig-symbol=#0 --read-symbol-table=data/lang_test/words.txt - data/lang_test/G.fst 
LOG (arpa2fst[5.5.0~1-239d8]:Read():arpa-file-parser.cc:94) Reading \data\ section.
LOG (arpa2fst[5.5.0~1-239d8]:Read():arpa-file-parser.cc:149) Reading \1-grams: section.
LOG (arpa2fst[5.5.0~1-239d8]:Read():arpa-file-parser.cc:149) Reading \2-grams: section.
LOG (arpa2fst[5.5.0~1-239d8]:Read():arpa-file-parser.cc:149) Reading \3-grams: section.
LOG (arpa2fst[5.5.0~1-239d8]:RemoveRedundantStates():arpa-lm-compiler.cc:359) Reduced num-states from 5614266 to 633445
fstisstochastic data/lang_test/G.fst 
9.2693e-08 -2.07616
Succeeded in formatting LM: 'data/ru-mix-small.lm.gz'
+ utils/mkgraph.sh --self-loop-scale 1.0 data/lang_test exp/chain/tdnn exp/chain/tdnn/graph
tree-info exp/chain/tdnn/tree 
tree-info exp/chain/tdnn/tree 
fsttablecompose data/lang_test/L_disambig.fst data/lang_test/G.fst 
fstminimizeencoded 
fstpushspecial 
fstdeterminizestar --use-log=true 
fstisstochastic data/lang_test/tmp/LG.fst 
-0.0896801 -0.0903191
[info]: LG not stochastic.
fstcomposecontext --context-size=2 --central-position=1 --read-disambig-syms=data/lang_test/phones/disambig.int --write-disambig-syms=data/lang_test/tmp/disambig_ilabels_2_1.int data/lang_test/tmp/ilabels_2_1.82488 data/lang_test/tmp/LG.fst 
fstisstochastic data/lang_test/tmp/CLG_2_1.fst 
-0.0896801 -0.0903191
[info]: CLG not stochastic.
make-h-transducer --disambig-syms-out=exp/chain/tdnn/graph/disambig_tid.int --transition-scale=1.0 data/lang_test/tmp/ilabels_2_1 exp/chain/tdnn/tree exp/chain/tdnn/final.mdl 
fstrmepslocal 
fsttablecompose exp/chain/tdnn/graph/Ha.fst data/lang_test/tmp/CLG_2_1.fst 
fstminimizeencoded 
fstrmsymbols exp/chain/tdnn/graph/disambig_tid.int 
fstdeterminizestar --use-log=true 
fstisstochastic exp/chain/tdnn/graph/HCLGa.fst 
0.0934866 -0.353758
HCLGa is not stochastic
add-self-loops --self-loop-scale=1.0 --reorder=true exp/chain/tdnn/final.mdl exp/chain/tdnn/graph/HCLGa.fst 
fstisstochastic exp/chain/tdnn/graph/HCLG.fst 
0.0456512 -0.250562
[info]: final HCLG is not stochastic.
+ utils/build_const_arpa_lm.sh data/ru-mix.lm.gz data/lang_test data/lang_test_rescore
arpa-to-const-arpa --bos-symbol=805933 --eos-symbol=805934 --unk-symbol=2 'gunzip -c data/ru-mix.lm.gz | utils/map_arpa_lm.pl data/lang_test_rescore/words.txt|' data/lang_test_rescore/G.carpa 
LOG (arpa-to-const-arpa[5.5.0~1-239d8]:BuildConstArpaLm():const-arpa-lm.cc:1078) Reading gunzip -c data/ru-mix.lm.gz | utils/map_arpa_lm.pl data/lang_test_rescore/words.txt|
utils/map_arpa_lm.pl: Processing "\data\"
utils/map_arpa_lm.pl: Processing "\1-grams:\"
LOG (arpa-to-const-arpa[5.5.0~1-239d8]:Read():arpa-file-parser.cc:94) Reading \data\ section.
LOG (arpa-to-const-arpa[5.5.0~1-239d8]:Read():arpa-file-parser.cc:149) Reading \1-grams: section.
utils/map_arpa_lm.pl: Processing "\2-grams:\"
LOG (arpa-to-const-arpa[5.5.0~1-239d8]:Read():arpa-file-parser.cc:149) Reading \2-grams: section.
utils/map_arpa_lm.pl: Processing "\3-grams:\"
LOG (arpa-to-const-arpa[5.5.0~1-239d8]:Read():arpa-file-parser.cc:149) Reading \3-grams: section.
utils/map_arpa_lm.pl: Processing "\4-grams:\"
LOG (arpa-to-const-arpa[5.5.0~1-239d8]:Read():arpa-file-parser.cc:149) Reading \4-grams: section.
+ rnnlm/change_vocab.sh data/lang/words.txt exp/rnnlm exp/rnnlm_out
rnnlm/change_vocab.sh: Copying config directory.
rnnlm/change_vocab.sh: Re-generating words.txt, unigram_probs.txt, word_feats.txt and word_embedding.final.mat.
rnnlm/get_word_features.py: made features for 805936 words.
rnnlm-get-word-embedding exp/rnnlm_out/word_feats.txt exp/rnnlm_out/feat_embedding.final.mat exp/rnnlm_out/word_embedding.final.mat 

Saivaks avatar Aug 09 '22 21:08 Saivaks

К сожалению так и не удалось установить причину того что не получаю данных для адаптации малой модели. В чем может быть причина?

Saivaks avatar Aug 10 '22 22:08 Saivaks

Пакет для маленькой модели можно найти тут:

https://alphacephei.com/vosk/models/vosk-model-small-ru-0.22-compile.tar.gz

nshmyrev avatar Aug 10 '22 23:08 nshmyrev

Это действительно работает. Спасибо!

Saivaks avatar Aug 11 '22 17:08 Saivaks