candle No Rayon mode

No Rayon mode

Open XiangpengHao opened this issue 1 month ago • 8 comments

Hi candle devs!

I'm using candle to inference a small model (Qwen-0.6B-4b) using CPU. Profiling the inference show that there's significant overhead in rayon processing, primarily because too fine-grained multi-thread partitions, i.e., the overhead of thread communication is more expensive than the benefits it brings. Even if I set RAYON_NUM_THREADS=1 the overhead of thread management, cache affinity penalty, etc is still high.

I made a small prototype that removes most of the rayon use here: https://github.com/XiangpengHao/candle/tree/ed0fcc9428d8a0b11838d8b488dc40a7ac89fcc1

I run a small benchmark with qwen-0.6B-4b, using the current main I got 53.62 token/s (use all threads); without rayon (so it's single threaded), I got 69.54 token/s.

I suggest we (1) offer a way to opt-out rayon, and (2) optimize multi-thread cpu inference so that it can fully use the CPU resources.

Related: #2499 #1103 #2877

If you are open to the changes, I'm happy to work on it!

Oct 15 '25 17:10 XiangpengHao

candle candle copied to clipboard

No Rayon mode

candle
candle copied to clipboard