Performance issue: allreduce_benchmark slower than ncclAllReduce

Open haolinyan opened this issue 10 months ago • 1 comments

First of all, I'd like to express my sincere gratitude to all the contributors of this repository! I'm able to run the allreduce_benchmark smoothly, but unfortunately, its performance is significantly inferior to that of NCCL. I'm reaching out in the hope of getting some assistance from you.

Experimental Setup

I launched 4 Docker containers on a DGX-1 machine. The Docker image was built from rdma.dockerfile. The command to start the containers is as follows:

sudo docker run -dit --gpus device=$id \  # Each container can only use one V100 GPU
            --net=host --cap-add=IPC_LOCK \
            --shm-size=32768m \
            --cap-add SYS_ADMIN \
            --cap-add SYS_RESOURCE  \
            --device=/dev/infiniband/uverbs$id \
            --name switchml-rdma$id \
            my_image_name:tag

I compiled all the files using the command:

make RDMA=1 TIMEOUTS=0 VCL=1 DEBUG=0

I modified the configuration file as follows:

num_workers = 4
num_worker_threads = 8
max_outstanding_packets = 256
packet_numel = 64
backend = rdma
prepostprocessor = bypass 
msg_numel = 64

After that, I ran the allreduce_benchmark test:

./bin/allreduce_benchmark --tensor-numel=1048576 --tensor-type=int32

The output is shown below:

Signal handler thread started. Waiting for any signals.
Submitting 5 warmup jobs.
Warmup finished.
Submitting 10 jobs.
Job(s) #0# finished. Duration: #531972764# ns Goodput: #0.0630755# Gbps.
Job(s) #1# finished. Duration: #539886204# ns Goodput: #0.0621509# Gbps.
Job(s) #2# finished. Duration: #575978341# ns Goodput: #0.0582564# Gbps.
Job(s) #3# finished. Duration: #567985377# ns Goodput: #0.0590762# Gbps.
Job(s) #4# finished. Duration: #587951282# ns Goodput: #0.0570701# Gbps.
Job(s) #5# finished. Duration: #572039624# ns Goodput: #0.0586575# Gbps.
Job(s) #6# finished. Duration: #560061712# ns Goodput: #0.059912# Gbps.
Job(s) #7# finished. Duration: #575842023# ns Goodput: #0.0582702# Gbps.
Job(s) #8# finished. Duration: #583945036# ns Goodput: #0.0574616# Gbps.
Job(s) #9# finished. Duration: #556026482# ns Goodput: #0.0603468# Gbps.
All jobs finished.


Min 531972764 ns 0.0630755 Gbps
Max 587951282 ns 0.0570701 Gbps
Median 572039624 ns 0.0586575 Gbps
Mean 565168884 ns 0.0593706 Gbps
Std dev 1.73446e+07 ns
Cleaning up.
Signal handler thread is exiting

As you can see, the results are quite poor (I'm using four mlx5 NICs, each with a 100Gbps port).

ports 3/0, 5/0, 7/0, 9/0 are used in our experiments.

NCCL Test

To ensure the fairness of the experiment as much as possible, I used 4 containers with a similar configuration. I specifically set --net=none and manually modified the network namespace of the network card so that each container could only use one MLX5 NIC.

sudo docker run --name $container_name  \
                --rm \
                --gpus $gpu_device \ # Each container can only use one V100 GPU
                --net=none \
                --cap-add IPC_LOCK \
                --cap-add NET_ADMIN \
                --shm-size=32768m \
                --cap-add SYS_ADMIN \
                --cap-add SYS_RESOURCE  \
                --device /dev/infiniband/rdma_cm \
                --device /dev/infiniband/issm$id \
                --device /dev/infiniband/umad$id \
                --device /dev/infiniband/uverbs$id \
                --hostname worker-$id \
                -v $share_dir:/shared \
                -dit $image /bin/bash

Then I conducted the NCCL allreduce experiment. Each time, it processed 4MB of data. I first performed 5 warmup runs and then the official test:

mpirun -np 4 -hostfile hostfile.txt -x NCCL_IB_GID_INDEX=3 ./nccl_allreduce_test int32

The results are as follows:

[MPI Rank 0] Success with data type int32, Time taken: 0.002519 seconds, Throughput: 13.320537 Gbps
[MPI Rank 2] Success with data type int32, Time taken: 0.002435 seconds, Throughput: 13.780054 Gbps
[MPI Rank 3] Success with data type int32, Time taken: 0.002456 seconds, Throughput: 13.662228 Gbps
[MPI Rank 1] Success with data type int32, Time taken: 0.002563 seconds, Throughput: 13.091858 Gbps

I've ensured that all containers communicate through the switch:

In the figure, each container received a total of 40156402 bytes. In the NCCL ring allreduce with 4 ranks, each rank needs to receive 1MB * 2 * 3 = 6MB of data in one task, and a total of 36MB in 6 tasks. This is close to the actual result, which proves that NCCL is indeed performing inter-machine communication across 4 ranks rather than intra-machine communication.

I'm really looking forward to your help in resolving the performance issue of allreduce_benchmark. Thank you!

Mar 03 '25 07:03 haolinyan

Or please provide the corresponding NCCL script in the figure below. I will re-verify. Thanks again!

Mar 05 '25 05:03 haolinyan