vosk-api icon indicating copy to clipboard operation
vosk-api copied to clipboard

Update packages: how to delete a list of words from the lexicon?

Open svenha opened this issue 1 year ago • 4 comments

I need to delete a long list of words (circa 180000 words) from the model calculated by an update package from Vosk (update packages are an excellent idea that I can recommend!). The list contains spelling errors, regional or historical spelling variants, or words that are never used in the given domain or unusual pronunciations. In other projects, there is a file neg.dic with entries like:

word1 pron1
word2 

The first line will delete word1 with the given pronunciation pron1, while the second line will delete all entries for word2. (In some projects, an additional grep -F neg.dic -v ... command will suffice; but it might be more complicated here.)

svenha avatar Jul 10 '22 13:07 svenha

It should be easy with the scripts, you can delete from the main dictionary indeed and, optionally, restrict LM vocabulary. But not plug-and-play yet, sorry.

nshmyrev avatar Jul 10 '22 22:07 nshmyrev

and, optionally, restrict LM vocabulary

Good idea, esp. in order to limit the model size. But I could not find a useful ngram option. Is there one or is this the wrong tool for this task?

svenha avatar Jul 15 '22 20:07 svenha

But I could not find a useful ngram option

ngram -lm source.lm.gz -vocab your.vocab -limit-vocab -write-lm target.lm.gz

nshmyrev avatar Jul 15 '22 21:07 nshmyrev

Oh, yes. Reading carefully helps ... Thanks.

This reduced the model size by 30 %; really nice. (The underlying dictionary reduction was 22 %.)

svenha avatar Jul 16 '22 07:07 svenha