candle icon indicating copy to clipboard operation
candle copied to clipboard

No Rayon mode

Open XiangpengHao opened this issue 2 months ago โ€ข 8 comments

Hi candle devs!

I'm using candle to inference a small model (Qwen-0.6B-4b) using CPU. Profiling the inference show that there's significant overhead in rayon processing, primarily because too fine-grained multi-thread partitions, i.e., the overhead of thread communication is more expensive than the benefits it brings. Even if I set RAYON_NUM_THREADS=1 the overhead of thread management, cache affinity penalty, etc is still high.

I made a small prototype that removes most of the rayon use here: https://github.com/XiangpengHao/candle/tree/ed0fcc9428d8a0b11838d8b488dc40a7ac89fcc1

I run a small benchmark with qwen-0.6B-4b, using the current main I got 53.62 token/s (use all threads); without rayon (so it's single threaded), I got 69.54 token/s.

I suggest we (1) offer a way to opt-out rayon, and (2) optimize multi-thread cpu inference so that it can fully use the CPU resources.

Related: #2499 #1103 #2877

If you are open to the changes, I'm happy to work on it!

XiangpengHao avatar Oct 15 '25 17:10 XiangpengHao

It wouldn't be too hard to create our own iterator impl that lets you choose between using rayon or not, or even better refactor our methods so we can pass in parallelism as an argument. But we need to cement first that issue is not rayon. The issue is how it is being used. And we can fix it! We can use the gemm crate as a north star for rayon usage as it uses it with great success. The gemm function takes parallelism as input so that it can decide based on the problem size what level of parallelism to use (example)

As you know parallelism is only useful for sufficiently large data. When we use it on inner loops or where there is overlapping reads from the same source it will definitely slow things down.

There are a lot of easy wins to be made in the CPU implementation. Let me know if you're interested in contributing. Imo yanking out rayon is, unfortunately, not going to fix it โ˜บ๏ธ

ivarflakstad avatar Oct 16 '25 21:10 ivarflakstad

That makes sense -- I'm coming from a slightly different context where inference runs in a resource-contrained environment (e.g., a single core is allowed), and the higher level orchestrator wants to have precise control of how to use concurrency. In that case, rayon is pure overhead and currently there's no way to opt out.

I think improving how we use Rayon and allowing an opt-out are two parallel efforts that can complement each other nicely. ๐Ÿ˜Š

XiangpengHao avatar Oct 16 '25 22:10 XiangpengHao

I don't expect any difference in performance, but I'm just curious - how does RAYON_NUM_THREADS=0 perform?

ivarflakstad avatar Oct 17 '25 13:10 ivarflakstad

I think we already have our solution here โ˜บ๏ธ

ivarflakstad avatar Oct 21 '25 14:10 ivarflakstad

Hi there, rayon has a with_max_len that configures tha max length of iterators desired to process in each rayon job, this could be helpful to reduce the amount of communication overhead. I can take a look at multithreaded perf of candle for this specific model and see if it helps ๐Ÿ‘๐Ÿผ

AmineDiro avatar Nov 07 '25 14:11 AmineDiro

Great! ๐Ÿ˜Š We do use with_min_len and with_max_len a couple of places (cpu quantized matmul for example) but par_iter is all over so definitely more cases to cover.

ivarflakstad avatar Nov 07 '25 14:11 ivarflakstad

I don't expect any performance difference, but I'm curious โ€” how does RAYON_NUM_THREADS=0 perform?

I initially opened (and quickly closed ๐Ÿ˜… ) a PR thinking I had fixed the issue with RAYON_NUM_THREADS=0 not being respected. However, after checking the matmul call, it turns out the correct environment variable value to disable Rayon parallelism in gemm call is actually RAYON_NUM_THREADS=1.

This happens because the get_num_threads function defaults to using the number of cores when RAYON_NUM_THREADS is unset or set to 0.

I also profiled the Qwen-0.5B model on CPU (6 cores, 12 threads โ€” Ryzen 5 3600X) to see if thereโ€™s any rayon communication overhead elsewhere. The main parallelism overhead appears to come from the gemm call.

Image

On CPU, I get around 13 tokens/s normally and 12.5 tokens/s with RAYON_NUM_THREADS=1.

@XiangpengHao โ€” could you try running the model again with RAYON_NUM_THREADS=1 and see if you observe similar results?

I think we already have our solution here โ˜บ๏ธ

@ivarflakstad โ€” I can look into implementing a more general solution for disabling rayon across the codebase. That said, it depends on how much overhead rayon actually introduces in other code paths beyond the comm overhead in gemm calls.

AmineDiro avatar Nov 08 '25 12:11 AmineDiro

That said, it depends on how much overhead rayon actually introduces

Exactly. Ideally it would be zero cost, but that doesn't seem to be the case. Some isolated testing / benchmarking would be great here.

ivarflakstad avatar Nov 08 '25 16:11 ivarflakstad