vosk-api
vosk-api copied to clipboard
Update packages: how to delete a list of words from the lexicon?
I need to delete a long list of words (circa 180000 words) from the model calculated by an update package from Vosk (update packages are an excellent idea that I can recommend!). The list contains spelling errors, regional or historical spelling variants, or words that are never used in the given domain or unusual pronunciations. In other projects, there is a file neg.dic with entries like:
word1 pron1
word2
The first line will delete word1 with the given pronunciation pron1, while the second line will delete all entries for word2. (In some projects, an additional grep -F neg.dic -v ...
command will suffice; but it might be more complicated here.)
It should be easy with the scripts, you can delete from the main dictionary indeed and, optionally, restrict LM vocabulary. But not plug-and-play yet, sorry.
and, optionally, restrict LM vocabulary
Good idea, esp. in order to limit the model size. But I could not find a useful ngram
option. Is there one or is this the wrong tool for this task?
But I could not find a useful ngram option
ngram -lm source.lm.gz -vocab your.vocab -limit-vocab -write-lm target.lm.gz
Oh, yes. Reading carefully helps ... Thanks.
This reduced the model size by 30 %; really nice. (The underlying dictionary reduction was 22 %.)