vosk-api
vosk-api copied to clipboard
Value of silence_weight
Hi Nickolay! I've started using the vosk-api, and I've had a few utterances where the output was worse compared to when doing decoding with kaldi's tcp binary (stuff like no instead of yes being recognized). I managed to track down what was causing the difference for most (unfortunately not all) to the silence_weight, which is hardcoded to 1e-3 in vosk.
As I understand training is done without this silence weight being used (source), so wouldn't it make more sense to set that to 1.? What do you think?
With 1.0 it will not downweight silence frames in ivector computation. As a result, the accuracy will degrade for test with longer silence (1-2 seconds) in the beginning. It might be worse for short examples without silence indeed, so not ideal thing.
Ideally we move to something like conformer encoder which properly detects context. Its a long-term plan though.
Are you sure using 1.0 will cause results to degrade? The model has never seen input without 1.0 so it seems to me that changing it could cause problems. The examples I have actually have ~1 second of silence at the start and that is what is being detected as "no".
The model has never seen input without 1.0
Model usually don't see long start silence during training so silence-weight value doesn't matter much.
The examples I have actually have ~1 second of silence at the start and that is what is being detected as "no".
It has to be not just a pure silence (ivectors extractor drops such frames) but something like quiet white noise for essential period of time (maybe 3-5 seconds), so ivector estimation breaks. There is also max-count which plays a role here. Honestly we didn't test it deep, it is mostly ad-hoc setting.
Hm okay, I will close this issue then.
I can run experiments a bit later and let you know
I did some more experiments and can confirm your statement that silence_weight 1.0 is worse. Honestly the experience just makes me want to throw out ivectors.
Honestly the experience just makes me want to throw out ivectors.
Yes, it is the right direction, they are inherently unstable.