ray [core][experimental] Higher than expected overhead for shared memory channels with NCCL

What happened + What you expected to happen

Microbenchmark results for a single-actor accelerated DAG shows about 30k calls/s, or about 30us/call. That is consistent with other microbenchmarks that @jackhumphries ran for channel performance, showing low 10s of us / channel op.

However, a microbenchmark for the recently added NCCL transport shows about 5.8k calls/s for NCCL alone and 3.2k calls/s for DAG+NCCL. This translates to about 130us / DAG call, more than 4x what's expected.

Versions / Dependencies

3.0dev

Reproduction script

See linked microbenchmarks.

Issue Severity

None

May 14 '24 00:05 stephanie-wang

Discussed during standup today > goal is to improve the NCCL performance to be within 50% versus 4x. This will also help with vLLM performance (may or may not impact) but it will also draw that closer as well.

May 23 '24 22:05 anyscalesam

Environment setup - about to dive into debugging the overhead.

Jun 10 '24 22:06 anyscalesam

There are actually two things - one is the perf regression in our existing perf dashboard. The other one is the overheads around our seralizer (cloud pickl) > the first one we should fix for Developer Preview.

Jun 27 '24 00:06 anyscalesam

#45953 has been merged, which was the P0 regression. The remaining overheads have been tracked down to python serialization of the tensor metadata, which we should be able to optimize later.

Jun 27 '24 22:06 stephanie-wang

Ray overhead has regressed to ~500us / NCCL op.

Jul 25 '24 22:07 stephanie-wang

Just to recap - where @jackhumphries left off is we know some of the extraneous overhead is in our serializer so we need to build a custom serializer to optimize those paths (pretty long hanging fruit).

Jul 26 '24 17:07 anyscalesam

dropping bug so it doesn't pop in our dashboards cc @stephanie-wang @rkooo567

Jul 30 '24 07:07 anyscalesam

Did some benchmark comparing Tensor Transfer Latency between ADAG and Pure NCCL across 4 GPUs.

InputNode -> W1 -> W2 -> W3 -> W4

~~ADAG seems to be 3-5x slower than pure NCCL send/recv.~~

Update[08/02/2024]

According to the experiment results, ADAG seems to have comparable performance with pure NCCL send/recv (~5% extra overhead).

output (7)

The experiment results: https://docs.google.com/spreadsheets/d/1G8xmJmuMJBPqBDvkJzjh5boAscJZ-lpt2nVdDyVx7H8/edit?usp=sharing
The code for benchmarking: https://gist.github.com/woshiyyya/31181c98f818f136ac275e188d48b528

Aug 01 '24 18:08 woshiyyya

Interesting... @woshiyyya what changed?

Aug 05 '24 20:08 anyscalesam

@anyscalesam it was the original benchmark script has some issue

Aug 05 '24 22:08 rkooo567

btw we should prioritize fixing this soon. maybe let's take it next week.

Aug 08 '24 07:08 rkooo567

Need for vLLM

Aug 19 '24 17:08 anyscalesam

Interesting... @woshiyyya what changed?

@anyscalesam I changed the return value of the last DAG node to None, this reduced an extra communication from that actor to driver.

Aug 27 '24 17:08 woshiyyya

@stephanie-wang I will take a look at this issue next week. Would you mind pointing me out which tests are you referring to? Thanks!

Oct 29 '24 19:10 kevin85421

I ran accelerated_dag_gpu_microbenchmark.py on my GPU machine, and got the following results:

exec_nccl_gpu per second 5798.01 +- 16.49
exec_ray_dag_gpu_nccl_static_shape_direct_return per second 3041.9 +- 3.4

benchmark_nccl.py uses the link provided https://github.com/ray-project/ray/issues/45319#issuecomment-2263685630.
benchmark_adag.py

python3 benchmark_nccl.py
# torch.Size([4, 24576]) Avg Time elapsed: 2.4376929737627506 ms

python3 benchmark_adag.py
# Avg execution time: 1.6622302401810884 ms

Nov 06 '24 00:11 kevin85421

The main performance difference between pure NCCL and RayCG is due to benchmarking two different aspects.

For pure NCCL, we only measure the time for NCCL I/O.
For RayCG, we benchmark the entire execute process, including overheads such as sending input to a DAG node via shared memory and returning the output to the driver via shared memory.

I performed experiments based on the following two scripts:

The experimental results are shown in the screenshots below.

Ray CG = NCCL I/O (a) + Overhead (b)
Pure NCCL = NCCL I/O (a)

If the above equations are correct, increasing the data transfer size will make the average execution times of Ray CG and Pure NCCL more similar.

When shape is small (4, 24576), the execution time of Ray CG (0.2506 ms) is 95% higher than Pure NCCL (0.128 ms).

If the above equations are correct, the execution time of Ray CG should be approximately equal to Pure NCCL + Overhead.

We use a very small NCCL data transfer size, shape = (1, 1), to simulate the Overhead.

RayCG (0.2506 ms) ~= Pure NCCL (0.128 ms) + Overhead (0.14 ms)

Screenshot 2024-11-09 at 4 06 22 PM

Nov 10 '24 00:11 kevin85421

I documented the process of verifying the calculation in https://github.com/ray-project/ray/pull/48860#issuecomment-2512348571. Closing this issue. @stephanie-wang, feel free to reopen the issue if I missed anything.

Dec 02 '24 18:12 kevin85421

ray ray copied to clipboard

[core][experimental] Higher than expected overhead for shared memory channels with NCCL

What happened + What you expected to happen

Versions / Dependencies

Reproduction script

Issue Severity

ray
ray copied to clipboard