sherpa-onnx
sherpa-onnx copied to clipboard
ONNX inference optimization
Hello, development team!
At the moment, I’m experimenting with giga-rnnt-v2, focusing on parallel inference of the model.
What has been done so far: 0. The model sherpa-onnx-nemo-transducer-giga-am-v2-russian-2025-04-19.tar.bz2 was downloaded from here: https://github.com/k2-fsa/sherpa-onnx/releases/tag/asr-models
- ONNX inference was launched in Python using onnx-sherpa on both CPU and GPU. A one-minute-long audio file was transcribed. The test ran with 1 pool and 8 threads.
- ONNX inference was also launched in Go using onnx-sherpa-go on CPU. It was tested with 1 to 20 threads using Go coroutines to process 1 to 100 audio samples (each ~12 seconds long) in parallel.
Here are some questions that came up:
- In Go, changing num_threads in the ONNX config doesn't affect CPU utilization — it remains at 100%, whether 1 or 20 threads are used. What could be the reason?
- In Python, inference of the one-minute recording takes 7 seconds on GPU and 10 seconds on CPU, with num_threads=8 in a single pool. It seems GPU inference should be significantly faster — but if I’m wrong, please clarify.
- What are some standard ways to increase the model’s throughput at the expense of latency?
Also asked gigaam dev team about it (dunno if suitable or not) - https://github.com/salute-developers/GigaAM/issues/34#issue-3019358942
@csukuangfj can you please help with this issue?