vosk-api icon indicating copy to clipboard operation
vosk-api copied to clipboard

Unknown "words" in text.txt with Updating the language model

Open hoangkyanh7 opened this issue 3 years ago • 5 comments

I want to build a new grammar with a text.txt, all the commands are ok but the last one: farcompilestrings --fst_type=compact --symbols=words.txt --keep_symbols text.txt |
ngramcount | ngrammake |
fstconvert --fst_type=ngram > Gr.new.fst

  • If all the words in text.txt are in the words.txt => OK
  • If there are "new words" in the text.txt (unknown words) => there are errors like: FATAL: FarCompileStrings: Compiling string number 2 in file text.txt failed with token_type = symbol and entry_type = line I read the -help and use the new command: farcompilestrings --fst_type=compact --symbols=words.txt --unknown_symbol="" --keep_symbols text.txt | ngramcount | ngrammake | fstconvert --fst_type=ngram > Gr.new.fst Another error raised: FATAL: FarCompileStrings: Label "-1" missing from symbol table: words.txt FATAL: STListReader::STListReader: Wrong file type: I know that: "You can not introduce new words this way, that is something we will cover later.", but Are there any ways to deal with "new words" in a big text? Help me, plz! Thanks in advance!

hoangkyanh7 avatar Jun 29 '21 09:06 hoangkyanh7

Are there any ways to deal with "new words" in a big text?

This method can not introduce new words, you have to recompile whole graph (last section).

nshmyrev avatar Jun 30 '21 10:06 nshmyrev

Thank you! Can I use something like "unk" to replace the new words?

hoangkyanh7 avatar Jul 01 '21 09:07 hoangkyanh7

Yes, it is "[unk]" as in the example code.

nshmyrev avatar Jul 01 '21 11:07 nshmyrev

can you please share me the format of text.txt

makdatascientist avatar Sep 16 '22 08:09 makdatascientist

Yes, it is "[unk]" as in the example code.

can you please share me the format of text.txt

makdatascientist avatar Sep 16 '22 08:09 makdatascientist