wenet Add a flush() method for ASR decoding

Is your feature request related to a problem? Please describe. Hi, we're using a streaming ASR model with chunk_size == 8. When a speaker pauses, our VAD system stops inputting audio-chunks to ASR. Many times, the ASR transcript will be 2-3 words behind what was actually spoken because the current audio inputs have not reached the boundary of chunk_size==8. Once the speaker starts speaking again, the lagging 2-3 words will appear, so the ASR system did not lose any words, it just delayed them a lot.

Describe the solution you'd like It would be great if we could signal the decoder to flush() the current transcript, even if the chunk_size has not been reached. Or to be able to set a parameter in milliseconds that would instruct the ASR system to flush() after N milliseconds even if it has not received any input.

Describe alternatives you've considered I've tried sending zero-ed out wav-chunks to the ASR system when VAD detects silence, but the zero-ed out chunks produce spurious transcripts, like "oh. yes. um.".

Additional context Add any other context or screenshots about the feature request here.

Feb 12 '22 01:02 scotfang

One possible reason is that you did not set finish signal when your VAD system triggers. please see https://github.com/wenet-e2e/wenet/blob/main/runtime/core/websocket/websocket_server.cc#L59

Feb 12 '22 03:02 robin1001

@robin1001 any comments on ..

Describe alternatives you've considered
I've tried sending zero-ed out wav-chunks to the ASR system when VAD detects silence, but the zero-ed out chunks produce spurious transcripts, like "oh. yes. um.".

?

Feb 14 '22 09:02 madkote

@madkote I believe the reason zero-ed out wav-data produce spurious transcripts such as "And", "yes", "oh", "um" is due to the fact that the ASR model was not trained on silence, or zeroed-out wav-data. So the ASR model is seeing an input that was not seen in training, and may be falling back to the outputs which have the highest prior, e.g. "and", "yes", "oh", "um"

Feb 15 '22 17:02 scotfang

One possible reason is that you did not set finish signal when your VAD system triggers. please see https://github.com/wenet-e2e/wenet/blob/main/runtime/core/websocket/websocket_server.cc#L59

Hi Robin yes that would be a solution and I will try it. The only issue with this is that a new websocket connection has to be established every time VAD detects silence, which could be very often. Would your team consider implementing a different websocket signal like "end" to flush ASR except that it Reset()'s the ASR decoder and doesn't close the websocket connection, similar to continuous_decoding mode?

Feb 15 '22 23:02 scotfang

Actually, maybe I will submit a PR for implementing a "flush" signal for the websocket server

Feb 16 '22 01:02 scotfang

@robin1001 Are you sure that the current code supports flushing ASR outputs when signaling "end" to the websocket with chunk_size > 0?

https://github.com/wenet-e2e/wenet/blob/main/runtime/core/decoder/torch_asr_decoder.cc#L106 Here we set state to DecodeState::kEndFeats via FeaturePipeline::set_input_finished().

https://github.com/wenet-e2e/wenet/blob/main/runtime/core/decoder/torch_asr_decoder.cc#L115
But here it seems that if we didn't reach a certain chunk-size "num_frames >= right_context + 1", decoding will not be called.

Feb 16 '22 01:02 scotfang

@robin1001 Are you sure that the current code supports flushing ASR outputs when signaling "end" to the websocket with chunk_size > 0?

https://github.com/wenet-e2e/wenet/blob/main/runtime/core/decoder/torch_asr_decoder.cc#L106 Here we set state to DecodeState::kEndFeats via FeaturePipeline::set_input_finished().

https://github.com/wenet-e2e/wenet/blob/main/runtime/core/decoder/torch_asr_decoder.cc#L115 But here it seems that if we didn't reach a certain chunk-size "num_frames >= right_context + 1", decoding will not be called.

"num_frames >= right_context + 1" is not related to end signal, it's for the case when input frames is not enough to produce one output

Feb 16 '22 02:02 robin1001

I implemented a working version of a "flush" signal that does not end the decoding thread, will submit a PR soon

Feb 16 '22 02:02 scotfang

I also encountered this problem, can you share your solution? @scotfang

Jul 26 '23 10:07 raycool

This issue has been automatically closed due to inactivity.

Jan 31 '24 01:01 github-actions[bot]

wenet wenet copied to clipboard

Add a flush() method for ASR decoding

wenet
wenet copied to clipboard