sphinxbase
sphinxbase copied to clipboard
`fe_process_frames_ext` can discard speech data?
fe_interface.c Line 490-498
/* Try to read from prespeech buffer */
if (fe->vad_data->in_speech && fe_prespch_ncep(fe->vad_data->prespch_buf) > 0) {
outidx = fe_copy_from_prespch(fe, inout_nframes, buf_cep, outidx);
if ((*inout_nframes) < 1) {
/* mfcc buffer is filled from prespeech buffer */
*inout_nframes = outidx;
return 0;
}
}
If *inout_nframes
< prespch_buf's ncep, code will return from here, while the input of speech data is totally ignored.
I have verified this case, it seems a bug.
same problem happened at Line 525-535
/* Process all remaining frames. */
while (*inout_nframes > 0 && *inout_nsamps >= (size_t)fe->frame_shift) {
fe_shift_frame(fe, *inout_spch, fe->frame_shift);
fe_write_frame(fe, buf_cep[outidx], voiced_spch != NULL);
outidx = fe_check_prespeech(fe, inout_nframes, buf_cep, outidx, out_frameidx, inout_nsamps, orig_nsamps);
/* Update input-output pointers and counters. */
*inout_spch += fe->frame_shift;
*inout_nsamps -= fe->frame_shift;
}
If fe_write_frame
has changed vad_data->in_speech
(false -> true), fe_check_prespeech
can completely exhaust inout_nframes
with vad_data->prespch_buf
, then terminate this while loop halfway - remained speech data would be skipped, even though the following code try to handle overflow_samps.
I'm sure some speech data is skipped here.
Honestly there are so many issues here. Yes, sometimes data is skipped. We actually desperately need a frontend rework, not simply bug fixing, a totally new architecture with proper estimation of parameters is required. If you are interested to work on this, I can outline the design in a document.
Good to hear that. Yes, I'm interested, however, probably lack of experience on this work. I can't promise, but I'll try my best.
In my opinion, despite what is claimed on https://cmusphinx.github.io/wiki/faq/, noise suppression should be done externally. The VAD and noise removal code has added even more complexity to the frontend which was already too complex. Particularly since for a live application we do not want to even manage the audio input at all as it will be done by some external audio graph/pipeline like GStreamer, and this is how it is done on all platforms for quite some time now. Putting VAD in the gst-plugin was the right idea.
Given that PocketSphinx development is essentially abandoned we should revert to the 0.8 frontend code, particularly since alignment in batch mode is actually a common use case, and we do not want to ever discard any input in that case.
We should also discard the audio library entirely as its API is backwards for any modern platform where audio is always pushed to a processing node. The feature extractor should extract features and do nothing else. This is what I have done in SoundSwallower for instance: https://github.com/ReadAlongs/SoundSwallower