sherpa-onnx
sherpa-onnx copied to clipboard
Best configuration of online websocket server for optimal RTF.
Hi,
I am trying to benchmark the zipformer (old) model with online websocket server in sherpa-onnx. In the websocket server, there are some new config parameters compared to sherpa. So would like to know what they mean and optimal values we can set to achieve better concurrency.
Would like to know about the following parameters: --max-batch-size --loop-interval-ms --num_threads --num-work-threads --num-io-threads
Thanks
So would like to know what they mean and optimal values we can set to achieve better concurrency.
Sorry, you have to tune them on your computer and we don't have a recommended setting.
--max-batch-size specifies the maximum size per batch for inference. If there are not enough clients at the current time, it will wait for --loop-interval-ms. After that, it will do inference with the current connections even if there are fewer clients than --max-batch-size.
--num-threads specifies the number of threads to use for onnxruntime to run the neural network.
--num-work-threads specifies the thread pool size to run the recognizer. Each thread in the pool can run the recognizer independently.
--num-io-threads specifies the thread pool size, where each thread can accept connections from the clients.
Hi @csukuangfj ,
Is there any benchmarking already done for sherpa-onnx? We are finding latency numbers are higher than expected compared to sherpa setup.
Also, whats the exact difference between --num-threads and --num-work-threads ?
--num-work-threads -> refers to the feature extraction as well?
Also, whats the exact difference between --num-threads and --num-work-threads ?
Please see the above comment. If you think it is not clear, please point it out which part is not clear.
--num-work-threads -> refers to the feature extraction as well?
Yes, you are right; please see the code below https://github.com/k2-fsa/sherpa-onnx/blob/a0a747a0c0df93cad346144d0f8f9c43bcacca83/sherpa-onnx/csrc/online-websocket-server-impl.cc#L304
Could you show the exact commands you are using for benchmarking?
Is there any benchmarking already done for sherpa-onnx
Sorry, we have not done that.
--num-threads specifies the number of threads to use for onnxruntime to run the neural network. -> Does this mean just the onnxruntime level neural network threads and nothing on feature extraction? If you could point me to the code where this is taken care at ort level, would be great.
Here is the command we are using for the websocket server:
/workspace/sherpa-onnx/build/bin/sherpa-onnx-online-websocket-server \ --port=6006 \ --tokens=/opt/k2_sherpa/ai_model/tokens.txt \ --encoder=/opt/k2_sherpa/ai_model/encoder.onnx \ --decoder=/opt/k2_sherpa/ai_model/decoder.onnx \ --joiner=/opt/k2_sherpa/ai_model/joiner.onnx \ --max-batch-size=100 \ --loop-interval-ms=10 \ --num_threads=100 \ --num-work-threads=100 \ --num-io-threads=16 \ --sample-rate=8000 \ --provider=cuda
Additionally, I was looking at IO binding in onnxruntime which generally contributes to higher latency. Since feature extractions are happening on CPU, is this taken care?
Thanks
Does this mean just the onnxruntime level neural network threads and nothing on feature extraction?
Yes, you are right.
If you could point me to the code where this is taken care at ort level, would be great.
Please see https://github.com/k2-fsa/sherpa-onnx/blob/a0a747a0c0df93cad346144d0f8f9c43bcacca83/sherpa-onnx/csrc/session.cc#L24-L26
By the way, in the future, you can use
git grep num_threads
or
git grep num-threads
to search the code.
Here is the command we are using for the websocket server:
Thanks! What is the command for k2-fsa/sherpa?
Have you tested the CPU performance?
onnxruntime has better performance on CPU than PyTorch.
I was looking at IO binding in onnxruntime which generally contributes to higher latency
Sorry, we have no experience about IO binding.
And how did you start the client and how many clients are you using for testing?
We have a async streaming client with varying number of parallel streams. Trying from 50 to 300 parallel streams.