wenet icon indicating copy to clipboard operation
wenet copied to clipboard

Add a flush() method for ASR decoding

Open scotfang opened this issue 3 years ago • 8 comments

Is your feature request related to a problem? Please describe. Hi, we're using a streaming ASR model with chunk_size == 8. When a speaker pauses, our VAD system stops inputting audio-chunks to ASR. Many times, the ASR transcript will be 2-3 words behind what was actually spoken because the current audio inputs have not reached the boundary of chunk_size==8. Once the speaker starts speaking again, the lagging 2-3 words will appear, so the ASR system did not lose any words, it just delayed them a lot.

Describe the solution you'd like It would be great if we could signal the decoder to flush() the current transcript, even if the chunk_size has not been reached. Or to be able to set a parameter in milliseconds that would instruct the ASR system to flush() after N milliseconds even if it has not received any input.

Describe alternatives you've considered I've tried sending zero-ed out wav-chunks to the ASR system when VAD detects silence, but the zero-ed out chunks produce spurious transcripts, like "oh. yes. um.".

Additional context Add any other context or screenshots about the feature request here.

scotfang avatar Feb 12 '22 01:02 scotfang

One possible reason is that you did not set finish signal when your VAD system triggers. please see https://github.com/wenet-e2e/wenet/blob/main/runtime/core/websocket/websocket_server.cc#L59

robin1001 avatar Feb 12 '22 03:02 robin1001

@robin1001 any comments on ..

Describe alternatives you've considered
I've tried sending zero-ed out wav-chunks to the ASR system when VAD detects silence, but the zero-ed out chunks produce spurious transcripts, like "oh. yes. um.".

?

madkote avatar Feb 14 '22 09:02 madkote

@madkote I believe the reason zero-ed out wav-data produce spurious transcripts such as "And", "yes", "oh", "um" is due to the fact that the ASR model was not trained on silence, or zeroed-out wav-data. So the ASR model is seeing an input that was not seen in training, and may be falling back to the outputs which have the highest prior, e.g. "and", "yes", "oh", "um"

scotfang avatar Feb 15 '22 17:02 scotfang

One possible reason is that you did not set finish signal when your VAD system triggers. please see https://github.com/wenet-e2e/wenet/blob/main/runtime/core/websocket/websocket_server.cc#L59

Hi Robin yes that would be a solution and I will try it. The only issue with this is that a new websocket connection has to be established every time VAD detects silence, which could be very often. Would your team consider implementing a different websocket signal like "end" to flush ASR except that it Reset()'s the ASR decoder and doesn't close the websocket connection, similar to continuous_decoding mode?

scotfang avatar Feb 15 '22 23:02 scotfang

Actually, maybe I will submit a PR for implementing a "flush" signal for the websocket server

scotfang avatar Feb 16 '22 01:02 scotfang

@robin1001 Are you sure that the current code supports flushing ASR outputs when signaling "end" to the websocket with chunk_size > 0?

https://github.com/wenet-e2e/wenet/blob/main/runtime/core/decoder/torch_asr_decoder.cc#L106 Here we set state to DecodeState::kEndFeats via FeaturePipeline::set_input_finished().

https://github.com/wenet-e2e/wenet/blob/main/runtime/core/decoder/torch_asr_decoder.cc#L115
But here it seems that if we didn't reach a certain chunk-size "num_frames >= right_context + 1", decoding will not be called.

scotfang avatar Feb 16 '22 01:02 scotfang

@robin1001 Are you sure that the current code supports flushing ASR outputs when signaling "end" to the websocket with chunk_size > 0?

https://github.com/wenet-e2e/wenet/blob/main/runtime/core/decoder/torch_asr_decoder.cc#L106 Here we set state to DecodeState::kEndFeats via FeaturePipeline::set_input_finished().

https://github.com/wenet-e2e/wenet/blob/main/runtime/core/decoder/torch_asr_decoder.cc#L115 But here it seems that if we didn't reach a certain chunk-size "num_frames >= right_context + 1", decoding will not be called.

"num_frames >= right_context + 1" is not related to end signal, it's for the case when input frames is not enough to produce one output

robin1001 avatar Feb 16 '22 02:02 robin1001

I implemented a working version of a "flush" signal that does not end the decoding thread, will submit a PR soon

scotfang avatar Feb 16 '22 02:02 scotfang

I also encountered this problem, can you share your solution? @scotfang

raycool avatar Jul 26 '23 10:07 raycool

This issue has been automatically closed due to inactivity.

github-actions[bot] avatar Jan 31 '24 01:01 github-actions[bot]