espnet_onnx Question on stream_asr.end() function for streaming asr

Hi @Masao-Someki,

In the readme the example for streaming asr shows the use of start() and end() methods:

from espnet_onnx import StreamingSpeech2Text

stream_asr = StreamingSpeech2Text(tag_name)

# start streaming asr
stream_asr.start()
while streaming:
  wav = <some code to get wav>
  assert len(wav) == stream_asr.hop_size
  stream_text = stream_asr(wav)[0][0]

# You can get non-streaming asr result with end function
nbest = stream_asr.end()

In a real streaming scenario should the start() and end() methods be called whenever the microphone is opened and closed?

I am asking because I noticed that the end() function in https://github.com/espnet/espnet_onnx/blob/master/espnet_onnx/asr/asr_streaming.py#151 calls the self.batch_beam_search() function which will restart decoding from postion 0 again causing a rather large delay for longer speech inputs. If I change https://github.com/espnet/espnet_onnx/blob/master/espnet_onnx/asr/asr_streaming.py#151 to use self.beam_search() method instead it avoids decoding the entire utterance at the end again and thus the delay.

Could you please clarify why self.batch_beam_search() is used in stream_asr.end() function?

Thanks!

Dec 13 '22 09:12 espnetUser

Hi @espnetUser, When I implemented the streaming model, the batch_beam_search was faster, so I chose this function. However, I fixed some bugs related to beam search after I made the comparison, so maybe we need to replace the beam search function...

Dec 17 '22 14:12 Masao-Someki

Hi @Masao-Someki,

my question only concerns the beam search call in stream_asr.end() method (https://github.com/espnet/espnet_onnx/blob/master/espnet_onnx/asr/asr_streaming.py#151) not the beam search function in general.

Let me explain with an example. Here is a list of debug logs that show the times and beam searches used as well as the position (output index) in the beam search plus some comments about when stream_asr.start()/end() calls were made:

### open microphone/start of audio (18.4 seconds duration in total) 
### --> call stream_asr.start()
2022-12-20 01:56:58,831 (batch_beam_search_online_sim:90) DEBUG: Position: 0
2022-12-20 01:56:59,677 (batch_beam_search_online_sim:90) DEBUG: Position: 0
2022-12-20 01:57:00,500 (batch_beam_search_online_sim:90) DEBUG: Position: 0
2022-12-20 01:57:00,515 (batch_beam_search_online_sim:90) DEBUG: Position: 1
2022-12-20 01:57:00,577 (batch_beam_search_online_sim:90) DEBUG: Position: 2
2022-12-20 01:57:01,467 (batch_beam_search_online_sim:90) DEBUG: Position: 1
2022-12-20 01:57:01,539 (batch_beam_search_online_sim:90) DEBUG: Position: 2
...
2022-12-20 01:57:16,181 (batch_beam_search_online_sim:90) DEBUG: Position: 37
2022-12-20 01:57:16,493 (batch_beam_search_online_sim:90) DEBUG: Position: 36
2022-12-20 01:57:16,723 (batch_beam_search_online_sim:90) DEBUG: Position: 37
2022-12-20 01:57:17,768 (batch_beam_search_online_sim:90) DEBUG: Position: 36
2022-12-20 01:57:17,998 (batch_beam_search_online_sim:90) DEBUG: Position: 37
### after ~18 seconds, at this point complete hypo (streamed text return) is shown on screen
### --> microphone closed/end of audio 
### --> call stream_asr.end() which uses different beam_search call and starts decoding from position 0 again
2022-12-20 01:57:18,284 (beam_search:333) DEBUG: position 0
2022-12-20 01:57:18,313 (beam_search:333) DEBUG: position 1
2022-12-20 01:57:18,427 (beam_search:333) DEBUG: position 2
2022-12-20 01:57:18,541 (beam_search:333) DEBUG: position 3
2022-12-20 01:57:18,653 (beam_search:333) DEBUG: position 4
2022-12-20 01:57:18,772 (beam_search:333) DEBUG: position 5
...
2022-12-20 01:57:25,451 (beam_search:333) DEBUG: position 42
2022-12-20 01:57:25,682 (beam_search:333) DEBUG: position 43
2022-12-20 01:57:25,948 (beam_search:333) DEBUG: position 44
### --> stream_asr.end() returns after another ~8 seconds of delay

As you can see the asr_stream.end() function which calls self.batch_beam_search() will restart decoding at position 0 again causing a 8 sec delay after end of speech.

So I am wondering if the following line https://github.com/espnet/espnet_onnx/blob/master/espnet_onnx/asr/asr_streaming.py#151 should be changed from

best_hyps = self.batch_beam_search(
            np.array(self.enc_feats, dtype=np.float32))

to

best_hyps = self.beam_search(
            np.array(self.enc_feats, dtype=np.float32))

making stream_asr.end() work with online beam search? Or am I using stream_asr.end() incorrectly here?

Hope this makes my question more clear?

Dec 20 '22 14:12 espnetUser

Hi @espnetUser and @Masao-Someki , Did you verify if the onnx streaming inference and the original model streaming inference are the same. Apparently I am getting some different outputs for some cases. Is it possible? Please reply.

Feb 08 '23 11:02 sanjuktasr

@Masao-Someki @espnetUser @ShigekiKarita @Fhrozen any one of you can you please tell if the batch beam search online is equivalent to the espnet bin inference streaming? Or atleast help me by giving a way to find that.

Feb 09 '23 12:02 sanjuktasr

Hello My output for pth model is not coming same for onnx streaming asr model. PLease help.

Mar 01 '23 12:03 sanjuktasr

What do you mean that is not same. Could you share the logs so we could look at the any error.

Mar 02 '23 13:03 Fhrozen

What do you mean that is not same. Could you share the logs so we could look at the any error.

the code is running fine, but for the output for some sentences in my dataset, the onnx model inference output is not same as the pth model output. Here are some example below. appx 15% of output is not matching.

PTH_op: βCλə ζJ @Bθəμə OCWCJ ONNX_op: βCλə βCλə ζJ @Bθəμə OCWCJ

PTH_op: ζB&ə ∞əF ∇əΩə ∇əΩə ψF∞θə ∞əF B!ə ζF∞C ζF∞C ψF∞θə OL λC@Bθəμə OəλL ONNX_op:ζB&ə ∞əF ∇əΩə ∇əΩə ψF∞θə ∞əF B!ə ψF∞θə ψF∞θə OL λC@Bθəμə OəλL

PTH_op: OθB Bαə ∞əF VBλə ∞əF VBλə ψF∞θə ψF∞θə αB⊃Və ζαC∞ə JOə πL OL @Bθəμə Oəλə ζəOə&J ΩK⊃ ONNX_op:OθB Bαə ∞əF VBλə ∞əF VBλə ζF∞θə ζF∞θə αB⊃Və &C∞ə JOə πL OL @Bθəμə Oəλə ζəOə&J ΩK⊃

I will share the logs ASAP.

Mar 03 '23 04:03 sanjuktasr

Hello @Fhrozen any idea. Actually I had switched off logs. Also there are no errors in the code. only the ONNX output mismatches with pth model output as I have given example before.

Mar 06 '23 10:03 sanjuktasr

Mismatch between pth and onnx models are common, and could be larger depending on the language. Just in case, try to change the hyperparameters for decoding, such as beam size, ctc-weight, and similars. You may find the config file in the same folder where the onnx models is located.

Mar 06 '23 10:03 Fhrozen

Hi @sanjuktasr, sorry for the late reply. Just for clarification, which version of espnet_onnx do you use? If you use the latest PyPI version, would you clone this repository and check if the accuracy drop still occurs with the latest version on GitHub? And please check the decoding configuration as @Fhrozen mentioned. Hyper parameters are defined in ~/.cache/espnet_onnx/<tag_name> in default. The output of ONNX and PyTorch is not completely the same, but with CI tests we assume the difference is small enough to get the same hypothesis.

Mar 06 '23 11:03 Masao-Someki

Hey thank you. I have verified the configurations several times. Some of the output are not matching. w.r.t to pth model inference. I will get back to you ASAP on the other things like ONNX version. We can connect if possible to understand the whats going wrong.

Mar 06 '23 12:03 sanjuktasr

@Masao-Someki @Fhrozen espnet-onnx 0.1.10 onnx 1.12.0 onnxruntime 1.13.1

Mar 08 '23 05:03 sanjuktasr

I do not think that kind of details are enough. You need to modify the values in your config.yml file, and should be something like this:

beam_search:
  beam_size: 5
  maxlenratio: 0.0
  minlenratio: 0.0
  pre_beam_ratio: 1.5
  pre_beam_score_key: full
ctc:
  model_path: full/ctc.onnx
  quantized_model_path: quantize/ctc_qt.onnx
decoder:
  dec_type: XformerDecoder
  model_path: full/xformer_decoder.onnx
  n_layers: 6
  odim: 512
  quantized_model_path: quantize/xformer_decoder_qt.onnx

You may see the details about beam_search, which are required for you to change.

Mar 08 '23 07:03 Fhrozen

@Fhrozen beam_search: beam_size: 10 maxlenratio: 0.0 minlenratio: 0.0 pre_beam_ratio: 1.5 pre_beam_score_key: full

This part is for beam search and I have checked the configuration for other modules also. They look fine.

What could be the probable causes of error other than configuration

frontend
encoder
decoder
ctc
beam search How should I debug this problem can you please help on that. Also FYI my model is not quantized.

Mar 08 '23 09:03 sanjuktasr

Hello @Fhrozen @Masao-Someki , I have checked the configs thoroughly several times. but there are no issues there. Can you tell me what are the possible reasons for this issue. I am using the original available code base.

Mar 09 '23 10:03 sanjuktasr

@sanjuktasr Is the frontend output the same? We fixed a librosa issue before (#71), so this might be a cause. If the frontend output is the same, maybe we have some issues with beam search. I cannot work on this project on weekdays, so I will see if there is any bug with streaming asr this weekend.

Mar 09 '23 12:03 Masao-Someki

@Masao-Someki I checked for the available code it was not the same. I enforced the frontend to be same using the original pth model frontend values and apply it to onnx configuration. But still no improvements. Although the same sentences are not giving errors for this modification. Thanks a lot @Masao-Someki and do let me know if you find any bugs or issues for which this might be the issue.

Mar 13 '23 04:03 sanjuktasr

@Masao-Someki @Fhrozen I have used tried to maintain the same code for inference streaming for frontend and beamsearch, and changed only the encoder to ONNX. The results didint match again. The output for ONNX part was almost gibberish. Please tell me if I can modify this strategy or implement some other strategy.

Mar 14 '23 11:03 sanjuktasr

@sanjuktasr

I checked for the available code it was not the same. I enforced the frontend to be same using the original pth model frontend values and apply it to onnx configuration. But still no improvements.

Would you check if the stft configuration is using the correct padding mode as follows in stft.py:

stft_kwargs = dict(
            n_fft=self.config.n_fft,
            win_length=self.config.win_length,
            hop_length=self.config.hop_length,
            center=self.config.center,
            window=self.config.window,
            pad_mode="reflect", # <- check this line
        )

do let me know if you find any bugs or issues for which this might be the issue.

I've found an index issue during the inference, and I'm working on this. You can fix the issue by deleting the +1 in streaming.py like:

offset = (
                self.config.block_size
                - self.config.look_ahead
                - self.config.hop_size
            ) # delete +1 here

The model output would be the same with this bugfix, but the resulting sentence might differ. I have changed the beam search in the end() function, so I think this change is the cause.

Mar 15 '23 11:03 Masao-Someki

@Masao-Someki ok thanks will check and let you know. Thanks for the update.

Mar 15 '23 12:03 sanjuktasr

I made 2 changes: 1. fixed the issue in streaming.py, offset variable 2. changed the batch_beam_search to beam_search in end function still no changes in accuracy. I am using the features from pth model version output.

Mar 15 '23 13:03 sanjuktasr

@sanjuktasr To obtain the same result, I think we need to use the batch_beam_search_online (https://github.com/espnet/espnet/blob/master/espnet/nets/batch_beam_search_online.py)

Mar 15 '23 15:03 Masao-Someki

Thanks @Masao-Someki, will try that and update.

Mar 16 '23 11:03 sanjuktasr

@Masao-Someki Have implemented the batch beam search online for the code. still no improvement. is it possible that the onnx export might cause these deviations? Please do let me know. Thanks and regards. :-)

Mar 17 '23 06:03 sanjuktasr

Hi @sanjuktasr and @espnetUser, thank you for your reports; I fixed streaming-related bugs in #83. I removed BatchedBeamSearch in the end function in this PR because we do not necessarily need this as @espnetUser pointed out.

Mar 21 '23 11:03 Masao-Someki

HI @Masao-Someki @espnetUser , After implementing the fixes there are still issues FYI, 0 φCμə θF @Bθəμə !F ζJφJ∞ə βLλə ζCOζə ζJφJ∞ə ρCλL ρCλL ζCOζə ζCOζə ⊂λC ⊂λC φCμə θF @Bθəμə !F ζJφJ∞ə βLλə ζCOζə ζJφJ∞ə ρCλL ρCλL ζCOζə ⊂λC ⊂λC 1 OMμə αλCφCθəζə ∞əεγəλə OMμə αλCφCθəζə ∞əε 2 BC φB∞!ə !F Oə∞JO!ə !F ζJφJ∞ə ζJφJ∞ə ζCOζə ρCλL !F !F !F !F ζJφJ∞ə βLλə BC φB∞!ə !F Oə∞JO!ə !F ζJφJ∞ə ζJφJ∞ə ζCOζə ρCλL !F !F !F !F ζJ 3 OBC∞@μC μCSOə !F ζCOζə βLλə ζJφJ∞ə ρCλL !F ζJφJ∞ə ρCλL !F φə∞ə !F OBC∞@μC μCSOə !F ζCOζə βLλə ζJφJ∞ə ρCλL !F ζJφJ∞ə ρCλL !F φə∞ə !F 4 φF@ə θF αμCρə Oə∞JO!ə εC !F ζJφJ∞ə βLλə ⊂λC βLλə βLλə βLλə βLλə ⊂λə ζJφJ∞ə βLλə φF@ə θF αμCρə Oə∞JO!ə εC !F ζJφJ∞ə βLλə ⊂λC βLλə βLλə βLλə βBCφə ⊂λC ζJφJ∞ə βLλ 5 OF@ə θF OMμə μBζ!ə @Bθəμ@ə ∞əεγəλə OF@ə θF OMμə μBζ!ə @Bθəμ@ə ∞əεγəλ 6 φCμə θF @Bθəμə !F ζJφJ∞ə βLλə ζCOζə ζJφJ∞ə ρCλL ρCλL ζCOζə ζCOζə ⊂λC ⊂λC φCμə θF @Bθəμə !F ζJφJ∞ə βLλə ζCOζə ζJφJ∞ə WCλL ρCλL ζCOζə ζCOζə ⊂λ 7 OMμə αλCφCθəζə ∞əεγəλə OMμə αλCφCθəζə ∞əεγəλ 8 BC φB∞!ə !F Oə∞JO!ə !F ζJφJ∞ə ζJφJ∞ə ζCOζə ρCλL !F !F !F ζJφJ∞ə βLλə BC φB∞!ə !F Oə∞JO!ə !F ζJφJ∞ə ζJφJ∞ə ζCOζə ρCλL !F !F !F !F ζJφJ∞ə 9 φF@ə θF αμCρə Oə∞JO!ə εC !F ζJφJ∞ə βLλə ⊂λC βLλə βLλə βLλə βBCφə ⊂λC ζJφJ∞ə βLλə φF@ə θF αμCρə Oə∞JO!ə εC !F ζJφJ∞ə βLλə ⊂λC βLλə βLλə βLλə βBCφə ⊂λC ζJφJ∞ə βLλə . The issues mostly consists of incomplete speech. Also some other issues are there. Thanks for the fix any way. The onnx modules(encoder and decoder) are working fine. Please help me fix them. Thanks a lot for your help. :-)

Mar 21 '23 13:03 sanjuktasr

@sanjuktasr Thank you, it looks like the final look-ahead tensor is not recognized. I think we need to modify the following line to include the final look-ahead tensor. https://github.com/espnet/espnet_onnx/blob/46b06f129167c8e27fb36e4ddf15bfe50420f5f2/espnet_onnx/asr/asr_streaming.py#L132-L136 to

 process_num = (len(speech) - self.initial_wav_length + look_ahead_wav_len) // self.hop_size + 1

where

look_ahead_wav_len = (
            self.config.encoder.frontend.stft.hop_length
            * self.config.encoder.subsample
            * self.config.encoder.look_ahead
            + (
                self.config.encoder.frontend.stft.n_fft
                // self.config.encoder.frontend.stft.hop_length
            )
            * self.config.encoder.frontend.stft.hop_length
        )

Mar 21 '23 15:03 Masao-Someki

HI @Masao-Someki , The issue still persists,

pth : αμCρə OMμə ∞BC∞ə ζJφJ∞ə ⊂λC ζCOζə ∞BC∞ə βBCφə J!ə !F ⊂λC ζJφJ∞ə hyp : αμCρə OMμə ∞BC∞ə ζJφJ∞ə ⊂λC ζCOζə ∞BC∞ə βBCφə J!ə !F ⊂λC ζJφJ∞

pth : OF@ə θF αμCρə Oə∞JO!ə εC !F ζCOə ζJφJ∞ə J!ə βLλə J!J@ə ρCλLζCOζə ζJφJ∞ə βLλə hyp : OF@ə θF αμCρə Oə∞JO!ə εC !F ζCOζə ζJφJ∞ə J!ə βLλə J!JJ@ə ζCOζə ζJφJ∞ə βLλə

last character is still a issue, also some characters are substituted. ideally speaking there is degradation of accuracy in this model. please let me know if there is anything can be done to resolve this issue. Also since the encoder has 2 dec places precision can these kind of anomalies be expected? Thanks and Regards

Mar 22 '23 11:03 sanjuktasr

Hi @Masao-Someki , The issue of onnx encoder-decoder module is solved as I have checked, now the precision is also fine. but still the mismatch pertains with similar kind of issues. Please kindly help me in identifying the issue. also how could padding the speech impact in any manner?

Mar 23 '23 12:03 sanjuktasr

@sanjuktasr

Also since the encoder has 2 dec places precision can these kind of anomalies be expected?

Am I correct that abs(torch_output - onnx_output) is larger than 0.01 for your encoder? Usually there is a little difference between the pytorch output and onnx output, but 0.01 is too large. I think it should be smaller than 1e-4 ~ 1e-5. We have a parity test and check if the MSE is smaller than 1e-10, which is small enough to get the same result. (The test checks if the MSE is smaller than 1e-10, but usually it is smaller than 1e-12~1e-13)

https://github.com/espnet/espnet_onnx/blob/46b06f129167c8e27fb36e4ddf15bfe50420f5f2/tests/unit_tests/test_inference_asr.py#L65-L72

If your model has parity issues, would you re-export your model and check again? And if there is no parity issue or difference in decoding configuration, then beam search may still have some problems...

I added a padding process to calculate the final part of the audio file. Usually, in the contextual cfm/trf block model, we use a look-ahead tensor, which is future information. I thought that the final word was included in the look-ahead and was not calculated in the encoder layer.

Mar 23 '23 13:03 Masao-Someki

espnet_onnx espnet_onnx copied to clipboard

Question on stream_asr.end() function for streaming asr

This part is for beam search and I have checked the configuration for other modules also. They look fine.

espnet_onnx
espnet_onnx copied to clipboard