wenet
                                
                                 wenet copied to clipboard
                                
                                    wenet copied to clipboard
                            
                            
                            
                        Add a flush() method for ASR decoding
Is your feature request related to a problem? Please describe. Hi, we're using a streaming ASR model with chunk_size == 8. When a speaker pauses, our VAD system stops inputting audio-chunks to ASR. Many times, the ASR transcript will be 2-3 words behind what was actually spoken because the current audio inputs have not reached the boundary of chunk_size==8. Once the speaker starts speaking again, the lagging 2-3 words will appear, so the ASR system did not lose any words, it just delayed them a lot.
Describe the solution you'd like It would be great if we could signal the decoder to flush() the current transcript, even if the chunk_size has not been reached. Or to be able to set a parameter in milliseconds that would instruct the ASR system to flush() after N milliseconds even if it has not received any input.
Describe alternatives you've considered I've tried sending zero-ed out wav-chunks to the ASR system when VAD detects silence, but the zero-ed out chunks produce spurious transcripts, like "oh. yes. um.".
Additional context Add any other context or screenshots about the feature request here.
One possible reason is that you did not set finish signal when your VAD system triggers. please see https://github.com/wenet-e2e/wenet/blob/main/runtime/core/websocket/websocket_server.cc#L59
@robin1001 any comments on ..
Describe alternatives you've considered
I've tried sending zero-ed out wav-chunks to the ASR system when VAD detects silence, but the zero-ed out chunks produce spurious transcripts, like "oh. yes. um.".
?
@madkote I believe the reason zero-ed out wav-data produce spurious transcripts such as "And", "yes", "oh", "um" is due to the fact that the ASR model was not trained on silence, or zeroed-out wav-data. So the ASR model is seeing an input that was not seen in training, and may be falling back to the outputs which have the highest prior, e.g. "and", "yes", "oh", "um"
One possible reason is that you did not set finish signal when your VAD system triggers. please see https://github.com/wenet-e2e/wenet/blob/main/runtime/core/websocket/websocket_server.cc#L59
Hi Robin yes that would be a solution and I will try it. The only issue with this is that a new websocket connection has to be established every time VAD detects silence, which could be very often. Would your team consider implementing a different websocket signal like "end" to flush ASR except that it Reset()'s the ASR decoder and doesn't close the websocket connection, similar to continuous_decoding mode?
Actually, maybe I will submit a PR for implementing a "flush" signal for the websocket server
@robin1001 Are you sure that the current code supports flushing ASR outputs when signaling "end" to the websocket with chunk_size > 0?
https://github.com/wenet-e2e/wenet/blob/main/runtime/core/decoder/torch_asr_decoder.cc#L106 Here we set state to DecodeState::kEndFeats via FeaturePipeline::set_input_finished().
https://github.com/wenet-e2e/wenet/blob/main/runtime/core/decoder/torch_asr_decoder.cc#L115
But here it seems that if we didn't reach a certain chunk-size "num_frames >= right_context + 1", decoding will not be called.
@robin1001 Are you sure that the current code supports flushing ASR outputs when signaling "end" to the websocket with chunk_size > 0?
https://github.com/wenet-e2e/wenet/blob/main/runtime/core/decoder/torch_asr_decoder.cc#L106 Here we set state to DecodeState::kEndFeats via FeaturePipeline::set_input_finished().
https://github.com/wenet-e2e/wenet/blob/main/runtime/core/decoder/torch_asr_decoder.cc#L115 But here it seems that if we didn't reach a certain chunk-size "num_frames >= right_context + 1", decoding will not be called.
"num_frames >= right_context + 1" is not related to end signal, it's for the case when input frames is not enough to produce one output
I implemented a working version of a "flush" signal that does not end the decoding thread, will submit a PR soon
I also encountered this problem, can you share your solution? @scotfang
This issue has been automatically closed due to inactivity.