wenet
wenet copied to clipboard
Streaming performance issues on upgrading to release v2.0.0
Describe the bug
On updating to release v2.0.0, I've been noticing some performance issues when running real-time audio streams against a quantized e2e model (no LM) via runtime/server/x86/bin/websocket_server_main
. For some stretches of time, performance may be comparable between v1 and v2, but there are points where I can expect to see upwards of 20s delay on a given response. Outside of a few minor updates related to the switch, nothing else (e.g. resource allocations) has been changed on my end.
Thus far, I haven't been able to pinpoint much of a pattern to the lag, except that it seems to consistently happen (in addition to other times) at the start of the stream. Have you observed any similar performance issues between v1 and v2, or is there some v2-specific runtime configuration I may have missed?
Expected behavior Comparable real-time performance between releases v1 and v2.
Screenshots The following graphs show the results from a single test. The x-axes represent the progression of the audio file being tested, and the y-axes represent round-trip response times from wenet minus some threshold, i.e. any data points above 0 indicate additional round-trip latency above an acceptable threshold (in my case, 500ms). As you can see, in the v1 graph responses are largely generated and returned below the threshold time (with the exception of a few final-marked transcripts). However, in the v2 graph, there are several lengthy periods during which responses take an unusually long time to return (I've capped the graph at 2s for clearer viewing, but in reality responses are taking up to 20s to return).
Wenet v1
Wenet v2
Additional context Both tests were run with wenet hosted via AWS ECS/EC2. So far as I've seen, increasing CPU + memory allocations to the wenet container doesn't seem to resolve the issue.
There is no such problem in our experiment.
@Mddct @xingchensong @day9011 Any idea on the problem?
Any difference of the test data on v1/v2 test? Is there any long audio in the test?
Same test data used between v1/v2, sent in the same way (chunks of audio sent at 250ms intervals). Not 100% sure what you mean by long audio, but the test data I've been using has generally been within the 2-10 min range, with endpoints forced at 15s of continuous transcription.
Maybe it is just a warmup
issue, in https://github.com/wenet-e2e/wenet/pull/942, we can clearly see that warmup round(round-1) is significantly slower than the forllowing rounds. We can draw a conclusion that both encoder (processing audio chunks) and decoder (processing text with various length) need warmup, the outliers at the very begining might be caused by encoder warmup and the outliers at the middle stage might be caused by decoder warmup (i.e., the decoder might be rescoring a hyp text whose length is different from all previous hyps). You can run this test twice and check if there still exist such phenomenon in the second round.
@xingchensong It looks like it may be related, reviewing the following results.
Cold (up to 25s delay)
Warmed
I ran the test a few times to ensure the system is properly warmed up prior to the second screenshot, and we can see that the response times are significantly lower. Given this, I have a few questions:
- Is this intended behavior for wenet 2 onwards?
- Is there a guideline on how to warm up the system prior to usage (other than passing through arbitrary audio files)? In my case users stream audio to our system and rely on real-time transcripts to make critical decisions, so a 25s delay would be unacceptable in a production scenario. Related, it's critical for my system to minimize the time it takes to get to ready-state (from launching a wenet container to having it ready to accept audio stream), so it would be non-ideal to have to spend at least a minute or more prepping the server.
- yes, and it is mostly caused by different versions of libtorch
- I think the time spend for warmup is not a big issue in production env since we can set the upper limit of concurrent requests for the current container and when the concurrency reaches 80% (or less) of the threshold, just pre-launch a new container and do warmup for it.
1 https://pytorch.org/tutorials/recipes/recipes/tuning_guide.html :
2 https://www.tensorflow.org/tfx/serving/saved_model_warmup
Perhaps it would be better to provide a warmup interface when deploy server
@xingchensong I also encountered this problem. warmup
is effective, but I found a problem. when the length of the data changes in decode()
, it needs to be warmup
again. Is there a simpler way?
log like this:
asr_decoder.cc:104] Required 60 get 60
asr_decoder.cc:113] forward takes 12152 ms, search takes 10 ms
asr_decoder.cc:104] Required 59 get 59
asr_decoder.cc:113] forward takes 17248 ms, search takes 6 ms ## warmup fail
asr_decoder.cc:104] Required 52 get 52
asr_decoder.cc:113] forward takes 4708 ms, search takes 6 ms ## warmup fail
asr_decoder.cc:104] Required 48 get 13
asr_decoder.cc:113] forward takes 17446 ms, search takes 1 ms ## warmup fail
asr_decoder.cc:104] Required 47 get 47
asr_decoder.cc:113] forward takes 8462 ms, search takes 5 ms ## warmup fail
....
asr_decoder.cc:104] Required 52 get 52
asr_decoder.cc:113] forward takes 56 ms, search takes 7 ms ## warmed
asr_decoder.cc:104] Required 60 get 60
asr_decoder.cc:113] forward takes 137 ms, search takes 16 ms ## warmed
I confirm it takes time at auto outputs = model_->get_method("forward_encoder_chunk")(inputs).toTuple()->elements();
.
Rescoring
I haven't confirmed whether there is this problem
I tested python recognize.py
and observed the log of chunk size change, but I didn't find this problem.
log like this:
#################take time:0.4461658000946045 chunk size:67
#################take time:0.03869509696960449 chunk size:67
...
#################take time:0.03129744529724121 chunk size:18
2022-09-13 07:36:45,817 INFO test_data_20200811_1_825e7573_29a2_4b49_9f8b_1c863332ab7b_8_11.wav 你肯定只有到店里去了才能了才能了解
#################take time:0.04446840286254883 chunk size:67
#################take time:0.04583621025085449 chunk size:67
...
#################take time:0.03129100799560547 chunk size:7
@robin1001 Could you please help me? I don't have any ideas.
Just use multiple utterances with different lengths to warmup, it should work.
May we can bump libtorch to v1.11 or higher
Xingchen Song(宋星辰) @.***> 于 2022年7月20日周三 22:48写道:
- yes, and it is mostly caused by different versions of libtorch
- I think the time spend for warmup is not a big issue in production env since we can set the upper limit of concurrent requests for the current container and when the concurrency reaches 80% (or less) of the threshold, just pre-launch a new container and do warmup for it.
— Reply to this email directly, view it on GitHub https://github.com/wenet-e2e/wenet/issues/1296#issuecomment-1190383274, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABFN3QZHZPLWXRU5IYGYVITVVAGVRANCNFSM53HW4TGQ . You are receiving this because you were mentioned.Message ID: @.***>
cc @robin1001 , Do we need recipe checking if we want to upgrade torch from 1.0 to 1.11?
What kind of checking? It's better to upgrade to 1.12 since it is the lastest stable version.
training recipe, wer metrics, etc
No, it will take too mush efforts to check, just go.