ray
ray copied to clipboard
[core][experimental] Higher than expected overhead for shared memory channels with NCCL
What happened + What you expected to happen
Microbenchmark results for a single-actor accelerated DAG shows about 30k calls/s, or about 30us/call. That is consistent with other microbenchmarks that @jackhumphries ran for channel performance, showing low 10s of us / channel op.
However, a microbenchmark for the recently added NCCL transport shows about 5.8k calls/s for NCCL alone and 3.2k calls/s for DAG+NCCL. This translates to about 130us / DAG call, more than 4x what's expected.
Versions / Dependencies
3.0dev
Reproduction script
See linked microbenchmarks.
Issue Severity
None
Discussed during standup today > goal is to improve the NCCL performance to be within 50% versus 4x. This will also help with vLLM performance (may or may not impact) but it will also draw that closer as well.
Environment setup - about to dive into debugging the overhead.
There are actually two things - one is the perf regression in our existing perf dashboard. The other one is the overheads around our seralizer (cloud pickl) > the first one we should fix for Developer Preview.
#45953 has been merged, which was the P0 regression. The remaining overheads have been tracked down to python serialization of the tensor metadata, which we should be able to optimize later.
Ray overhead has regressed to ~500us / NCCL op.
Just to recap - where @jackhumphries left off is we know some of the extraneous overhead is in our serializer so we need to build a custom serializer to optimize those paths (pretty long hanging fruit).
dropping bug so it doesn't pop in our dashboards cc @stephanie-wang @rkooo567
Did some benchmark comparing Tensor Transfer Latency between ADAG and Pure NCCL across 4 GPUs.
InputNode -> W1 -> W2 -> W3 -> W4
~~ADAG seems to be 3-5x slower than pure NCCL send/recv.~~
Update[08/02/2024]
According to the experiment results, ADAG seems to have comparable performance with pure NCCL send/recv (~5% extra overhead).
- The experiment results: https://docs.google.com/spreadsheets/d/1G8xmJmuMJBPqBDvkJzjh5boAscJZ-lpt2nVdDyVx7H8/edit?usp=sharing
- The code for benchmarking: https://gist.github.com/woshiyyya/31181c98f818f136ac275e188d48b528
Interesting... @woshiyyya what changed?
@anyscalesam it was the original benchmark script has some issue
btw we should prioritize fixing this soon. maybe let's take it next week.
Need for vLLM
Interesting... @woshiyyya what changed?
@anyscalesam I changed the return value of the last DAG node to None, this reduced an extra communication from that actor to driver.
@stephanie-wang I will take a look at this issue next week. Would you mind pointing me out which tests are you referring to? Thanks!
I ran accelerated_dag_gpu_microbenchmark.py on my GPU machine, and got the following results:
exec_nccl_gpu per second 5798.01 +- 16.49
exec_ray_dag_gpu_nccl_static_shape_direct_return per second 3041.9 +- 3.4
benchmark_nccl.pyuses the link provided https://github.com/ray-project/ray/issues/45319#issuecomment-2263685630.- benchmark_adag.py
python3 benchmark_nccl.py
# torch.Size([4, 24576]) Avg Time elapsed: 2.4376929737627506 ms
python3 benchmark_adag.py
# Avg execution time: 1.6622302401810884 ms
The main performance difference between pure NCCL and RayCG is due to benchmarking two different aspects.
- For pure NCCL, we only measure the time for NCCL I/O.
- For RayCG, we benchmark the entire execute process, including overheads such as sending input to a DAG node via shared memory and returning the output to the driver via shared memory.
I performed experiments based on the following two scripts:
The experimental results are shown in the screenshots below.
- Ray CG = NCCL I/O (a) + Overhead (b)
- Pure NCCL = NCCL I/O (a)
If the above equations are correct, increasing the data transfer size will make the average execution times of Ray CG and Pure NCCL more similar.
- When shape is small (4, 24576), the execution time of
Ray CG(0.2506 ms) is 95% higher thanPure NCCL(0.128 ms).
If the above equations are correct, the execution time of Ray CG should be approximately equal to Pure NCCL + Overhead.
We use a very small NCCL data transfer size, shape = (1, 1), to simulate the Overhead.
RayCG (0.2506 ms) ~= Pure NCCL (0.128 ms) + Overhead (0.14 ms)
I documented the process of verifying the calculation in https://github.com/ray-project/ray/pull/48860#issuecomment-2512348571. Closing this issue. @stephanie-wang, feel free to reopen the issue if I missed anything.