sherpa-onnx zipformer-ctc onnx multi-thread inference is slow...

zipformer-ctc onnx multi-thread inference is slow...

Open dohe0342 opened this issue 2 months ago • 2 comments

Hello. Thank you for great project.

I trained zipformer-CTC streaming model using icefall toolkit and export onnx model using icefall code. I implemented zipformer-CTC streaming in multi-thread version. I have enough CPU core but if I increase num-threads, It's too slow..

Do you have any idea for this issue?

Thank you.

Sep 22 '25 08:09 dohe0342

Hi @dohe0342! I'm Dosu and I’m helping the sherpa-onnx team.

Multi-threaded inference with zipformer-CTC ONNX models doesn't always scale well, even if you have many CPU cores. The documentation shows that the num_threads parameter is supported, but most examples use single-threaded or low-thread-count settings, and there are no official tips for optimizing multi-threaded performance or troubleshooting slowdowns when increasing threads [source].

Common causes for poor scaling include thread contention, memory bandwidth limits, and model architecture bottlenecks. For zipformer models, increasing threads can sometimes make things slower due to these factors [source].

You can try profiling CPU and memory usage, experimenting with different threading settings (like intra-op vs inter-op in ONNX Runtime), and adjusting chunk sizes—larger chunks may improve throughput but add latency. Also, compare single-threaded and multi-threaded runs to see where the bottleneck is.

There are no documented solutions specific to zipformer-CTC multi-threaded inference, so sharing your findings could help others facing similar issues. If you want more details on ONNX Runtime threading or export options, let me know!

To reply, just mention @dosu.

^{How did I do? Good | Irrelevant | Incorrect | Verbose | Hallucination | Report 🐛 | Other}

Sep 22 '25 09:09 dosubot[bot]

Sorry, the information you provided is too limited.

Sep 22 '25 09:09 csukuangfj

sherpa-onnx sherpa-onnx copied to clipboard

zipformer-ctc onnx multi-thread inference is slow...

sherpa-onnx
sherpa-onnx copied to clipboard