sherpa RTFX and Latency numbers for streaming pruned transducer stateless X

trafficstars

Hi, Do we have standard RTFX & Latency numbers for streaming & non streaming pruned transducer stateless X ? I am configuring triton perf benchmarking. Let me know if any specific steps to be followed for benchmarking apart from perf benchmarker.

Feb 15 '23 12:02 raikarsagar

@yuekaizhang Could you please have a look?

Mar 01 '23 02:03 csukuangfj

Hi, you could use sherpa/trion/client/decode_manifest.py to decode a whole dataset. This is a reference for benchmarking a Chinese dataset https://k2-fsa.github.io/sherpa/triton/client/index.html#decode-manifests.

When server launched, you could use this soar97/triton-k2:22.12.1 pre-built docker for client. You also need to prepare dataset by yourself. This is a reference https://colab.research.google.com/drive/1JX5Ph2onYm1ZjNP_94eGqZ-DIRMLlIca?usp=sharing.

Mar 01 '23 03:03 yuekaizhang

@yuekaizhang @csukuangfj Hi, I was able to setup the triton server with zipformer streaming model successfully. But there seems to be some disconnect between the RTF numbers we are able to achieve using client.py with a custom cutset VS the perf-analyzer throughput numbers we are seeing. So here are the initial Throughput, RTF and latency number we are able to achieve:

Triton - perf-analyzer:
Using client.py RTF: 0.0082 -> RTFX=121.95
total_duration: 70140.002 seconds (19.48 hours)
processing time: 574.573 seconds (0.16 hours)
latency_variance: 55.60 latency_50_percentile: 1058.63 latency_90_percentile: 1144.97 latency_99_percentile: 1280.65 average_latency_ms: 1029.45 NOTE: num_tasks was set to 200 which is comparable to concurrency 200 in the above perf_benchmark test.

I have some questions:

What is the difference between Throughput in perf_analyzer VS RTF in client.py? Can we compare the perf-analyzer throughput with RTFX?
If both thoughput and RTFX are comparable then why are we seeing such a difference in testing across 2 setups. Am I missing some setting here? or the client has to be modified in someway?

Thanks in advance

Jul 13 '23 06:07 uni-sagar-raikar

Throughput is not RTFx. Throughput computation is a little bit complex.
Difference between perf_analyzer and https://github.com/yuekaizhang/Triton-ASR-Client/blob/main/client.py: perf_analyzer --streaming using a single wav file, however, client.py could use a dataset. Also, with --simulate-streaming option in client.py, you are able to send audio chunks considering the chunk time.

For triton ASR benchmark, I strongly recomand you to use https://github.com/yuekaizhang/Triton-ASR-Client/blob/main/client.py.

Perf_analyzer is useful if you are interested to detailed module (e.g. encoder, decoder, queue time, infer time)cost.

Would you mind sharing the stats_summary.txt generated by client.py here also?

Jul 13 '23 08:07 yuekaizhang

Understood, I am sharing the stats_summary here: This summary is for a run with num_workers as 100. Looking at this, it seems like initial inferences are taking more time. Would you recommend explicit warmup? stats_summary.txt

Jul 13 '23 09:07 uni-sagar-raikar

Understood, I am sharing the stats_summary here: This summary is for a run with num_workers as 100. Looking at this, it seems like initial inferences are taking more time. Would you recommend explicit warmup? stats_summary.txt

Triton config has a warmup option. For benchmark, you may discard the intial results and use the stable executation runs.

Jul 13 '23 10:07 yuekaizhang

Warmup has to be configured in the model repo dir? or specific config.pbtxt? It would be great if you could point me. Also, in the stats file, I see there are batch sizes varying from 1 to N , is this profiling for all various batch sizes? or each inference run times?

Jul 13 '23 10:07 uni-sagar-raikar

For warmup https://github.com/triton-inference-server/server/blob/main/docs/user_guide/model_configuration.md#model-warmup

For stats.json https://github.com/triton-inference-server/server/blob/main/docs/user_guide/metrics.md

For stats_summary.txt, I just convert it from stats.json.

e.g. "batch_size 19, 18 times, infer 7875.14 ms, avg 437.51 ms, 23.03 ms input 47.86 ms, avg 2.66 ms, output 35.48 ms, avg 1.97 ms "

Since the service was started, there are total 18 times execuation were conducted with batch_size 19. The 18 times cost 7875.14 ms to finish. Avg is 7875.14/18, 7875.14/18/19, input output are for host to device, device to host time.

Jul 13 '23 10:07 yuekaizhang

@yuekaizhang Do you have any standard benchmark tests which were done for conformer-transducer/zipformer-transducer models? I am not seeing any improvement with warmup or other configurations. With sherpa infact, the non-triton setup is working far better than triton.

Jul 13 '23 13:07 uni-sagar-raikar

We have not started benchmark and profilling yet. How do you cofing your warmup setting? Also, later we will support tensorrt backend which should have less time to warmup comparing with onnx.

Jul 14 '23 01:07 yuekaizhang

@yuekaizhang Since the zipformer steaming model is sequential, I just warmed up with some dry runs. Also, wanted to check about the logs on github client repo where RTF numbers look good here. May I know what model is this? We would like to reproduce these results on a GPU instance.

Jul 21 '23 09:07 uni-sagar-raikar

@yuekaizhang Since the zipformer steaming model is sequential, I just warmed up with some dry runs. Also, wanted to check about the logs on github client repo where RTF numbers look good here. May I know what model is this? We would like to reproduce these results on a GPU instance.

https://huggingface.co/yuekai/model_repo_streaming_conformer_wenetspeech_icefall/tree/main It's from this model_repo. To reproduce, you may try """ git-lfs install git clone ... """ However, it used aishell1 test set, which is for Chinese.

Jul 24 '23 07:07 yuekaizhang

sherpa sherpa copied to clipboard

RTFX and Latency numbers for streaming pruned transducer stateless X

sherpa
sherpa copied to clipboard