rccl [Issue]: RCCL collective call Alltoall is performing way worse than normal MPI Alltoall on Frontier.

Problem Description

I ran my code on Frontier for scaling on AMD GPUS. It scaled fine with MPI . But as soon as i replace the MPI_Alltoall call with nccl_Alltoall, it is behaving way worse than MPI. why??

Operating System

SLES (Frontier)

CPU

AMD EPYC 7763 64-Core Processor

GPU

AMD Instinct MI250X

ROCm Version

ROCm 5.7.1

ROCm Component

rccl

Steps to Reproduce

No response

(Optional for Linux users) Output of /opt/rocm/bin/rocminfo --support

No response

Additional Information

No response

Jun 08 '24 16:06 manver-iitk

@manver-iitk a couple of questions:

how many nodes did you use ( 1 node vs. more than 1 node) ?
if the answer is more than 1 node, how did you configure RCCL on Frontier? Did you use the RCCL libfabric plugin for the inter-node communication? If not, RCCL will end up using tcp sockets as far as I know (since Frontier does not support verbs API), which might explain why RCCL is so much slower than MPI Alltoall.

Jul 03 '24 16:07 edgargabriel

Hello, @manver-iitk . Has this issue been resolved for you?

Sep 19 '24 17:09 corey-derochie-amd

Hello , @corey-derochie-amd my issue still persists.

@edgargabriel i have installed the aws_ofi_rccl driver also for inter node communication. But still timmings is almost 2x to 3x of normal MPI. I'm using 4 to 8 nodes

Oct 07 '24 13:10 manver-iitk

Hi, for alltoall, RCCL uses fan-out algorithm which is very crude (everyone send and recv from everyone). Whereas MPI is doing this in a more algorithmic way. This is the area where we acknowledge NCCL/RCCL lacks. Unfortunately optimizing alltoall for multi-node is not high on our priority list.

Oct 29 '24 15:10 thananon

We are not expecting to match Alltoall performance of MPI. Closing ticket.

May 12 '25 15:05 ppanchad-amd