pocketsphinx
pocketsphinx copied to clipboard
Bring back -remove_silence, maybe
I maintain that silently removing frames from the input in sphinx_fe is a super bad idea, but it seems that all of the public models out there were trained with this option, and the various users of the library expect it to exist.
I do not have any desire to develop a new VAD solution in the PocketSphinx library, and in the end it seems that debugging -remove_silence is the path of least resistance.
If this happens then the batch mode API (i.e. ps_process_raw(full_utt=True)
) will be fixed to return the correct word alignments.
On the other hand a lot of users of the library seem to do their own VAD (e.g. the Python speech_recognition
module) anyway. For batch mode -remove_silence is a terrible idea, one should do proper speaker diarization or use a real modern VAD.
Ideally we would retrain the models to not use this so they have proper silence models :(
Nope. Won't do this, didn't do this. We have a separate endpointer for reasons detailed here: https://cmusphinx.github.io/2022/08/vad/