sherpa-onnx icon indicating copy to clipboard operation
sherpa-onnx copied to clipboard

Best configuration of online websocket server for optimal RTF.

Open uni-sagar-raikar opened this issue 2 years ago • 7 comments

Hi,

I am trying to benchmark the zipformer (old) model with online websocket server in sherpa-onnx. In the websocket server, there are some new config parameters compared to sherpa. So would like to know what they mean and optimal values we can set to achieve better concurrency.

Would like to know about the following parameters: --max-batch-size --loop-interval-ms --num_threads --num-work-threads --num-io-threads

Thanks

uni-sagar-raikar avatar Aug 25 '23 04:08 uni-sagar-raikar

So would like to know what they mean and optimal values we can set to achieve better concurrency.

Sorry, you have to tune them on your computer and we don't have a recommended setting.

--max-batch-size specifies the maximum size per batch for inference. If there are not enough clients at the current time, it will wait for --loop-interval-ms. After that, it will do inference with the current connections even if there are fewer clients than --max-batch-size.

--num-threads specifies the number of threads to use for onnxruntime to run the neural network.

--num-work-threads specifies the thread pool size to run the recognizer. Each thread in the pool can run the recognizer independently.

--num-io-threads specifies the thread pool size, where each thread can accept connections from the clients.

csukuangfj avatar Aug 25 '23 06:08 csukuangfj

Hi @csukuangfj ,

Is there any benchmarking already done for sherpa-onnx? We are finding latency numbers are higher than expected compared to sherpa setup.

Also, whats the exact difference between --num-threads and --num-work-threads ? --num-work-threads -> refers to the feature extraction as well?

uni-sagar-raikar avatar Sep 01 '23 03:09 uni-sagar-raikar

Also, whats the exact difference between --num-threads and --num-work-threads ?

Please see the above comment. If you think it is not clear, please point it out which part is not clear.


--num-work-threads -> refers to the feature extraction as well?

Yes, you are right; please see the code below https://github.com/k2-fsa/sherpa-onnx/blob/a0a747a0c0df93cad346144d0f8f9c43bcacca83/sherpa-onnx/csrc/online-websocket-server-impl.cc#L304


Could you show the exact commands you are using for benchmarking?


Is there any benchmarking already done for sherpa-onnx

Sorry, we have not done that.

csukuangfj avatar Sep 01 '23 04:09 csukuangfj

--num-threads specifies the number of threads to use for onnxruntime to run the neural network. -> Does this mean just the onnxruntime level neural network threads and nothing on feature extraction? If you could point me to the code where this is taken care at ort level, would be great.

Here is the command we are using for the websocket server:

/workspace/sherpa-onnx/build/bin/sherpa-onnx-online-websocket-server \ --port=6006 \ --tokens=/opt/k2_sherpa/ai_model/tokens.txt \ --encoder=/opt/k2_sherpa/ai_model/encoder.onnx \ --decoder=/opt/k2_sherpa/ai_model/decoder.onnx \ --joiner=/opt/k2_sherpa/ai_model/joiner.onnx \ --max-batch-size=100 \ --loop-interval-ms=10 \ --num_threads=100 \ --num-work-threads=100 \ --num-io-threads=16 \ --sample-rate=8000 \ --provider=cuda

Additionally, I was looking at IO binding in onnxruntime which generally contributes to higher latency. Since feature extractions are happening on CPU, is this taken care?

Thanks

uni-sagar-raikar avatar Sep 01 '23 04:09 uni-sagar-raikar

Does this mean just the onnxruntime level neural network threads and nothing on feature extraction?

Yes, you are right.


If you could point me to the code where this is taken care at ort level, would be great.

Please see https://github.com/k2-fsa/sherpa-onnx/blob/a0a747a0c0df93cad346144d0f8f9c43bcacca83/sherpa-onnx/csrc/session.cc#L24-L26

By the way, in the future, you can use

git grep num_threads

or

git grep num-threads

to search the code.


Here is the command we are using for the websocket server:

Thanks! What is the command for k2-fsa/sherpa?

Have you tested the CPU performance?

onnxruntime has better performance on CPU than PyTorch.


I was looking at IO binding in onnxruntime which generally contributes to higher latency

Sorry, we have no experience about IO binding.

csukuangfj avatar Sep 01 '23 04:09 csukuangfj

And how did you start the client and how many clients are you using for testing?

csukuangfj avatar Sep 01 '23 04:09 csukuangfj

We have a async streaming client with varying number of parallel streams. Trying from 50 to 300 parallel streams.

uni-sagar-raikar avatar Sep 01 '23 04:09 uni-sagar-raikar