rccl [Issue]: low performance all_gather

Problem Description

I use the aws ofi rccl pluggin ith libfabric 1.23 on a cray SS11.

I run on 4 nodes, each with 4 mi250x. See cpu/gpu details.

I use slurm:

srun --nodes=4 --ntasks-per-node=8 --cpus-per-task=8 --threads-per-core=1 --label \
    -- all_gather_perf -b 64K -e 4G -f 2 -g 1

There is no cgroup getting in DMA's way.

All gather performance seems low until 16777216.

 0: #       size         count      type   redop    root     time   algbw   busbw #wrong     time   algbw   busbw #wrong
 0: #        (B)    (elements)                               (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)
 0:        65536           512     float    none      -1    107.5    0.61    0.59      0    105.6    0.62    0.60      0
 0:       131072          1024     float    none      -1    210.0    0.62    0.60      0    202.6    0.65    0.63      0
 0:       262144          2048     float    none      -1    211.1    1.24    1.20      0    200.9    1.30    1.26      0
 0:       524288          4096     float    none      -1    692.3    0.76    0.73      0    678.9    0.77    0.75      0
 0:      1048576          8192     float    none      -1   1456.6    0.72    0.70      0   1304.0    0.80    0.78      0
 0:      2097152         16384     float    none      -1    70248    0.03    0.03      0    73593    0.03    0.03      0
 0:      4194304         32768     float    none      -1    71951    0.06    0.06      0    59139    0.07    0.07      0
 0:      8388608         65536     float    none      -1    72576    0.12    0.11      0    25554    0.33    0.32      0
 0:     16777216        131072     float    none      -1   1071.2   15.66   15.17      0   1057.5   15.86   15.37      0
 0:     33554432        262144     float    none      -1   1223.2   27.43   26.58      0   1223.3   27.43   26.57      0
 0:     67108864        524288     float    none      -1   1273.4   52.70   51.05      0   1268.6   52.90   51.25      0
 0:    134217728       1048576     float    none      -1   1457.6   92.08   89.20      0   1453.4   92.34   89.46      0
 0:    268435456       2097152     float    none      -1   2864.2   93.72   90.79      0   2853.3   94.08   91.14      0
 0:    536870912       4194304     float    none      -1   5695.6   94.26   91.32      0   5674.5   94.61   91.65      0
 0:   1073741824       8388608     float    none      -1    11366   94.47   91.52      0    11319   94.86   91.90      0
 0:   2147483648      16777216     float    none      -1    22549   95.24   92.26      0    22498   95.45   92.47      0
 0:   4294967296      33554432     float    none      -1    44904   95.65   92.66      0    44867   95.73   92.74      0
 0: # Errors with asterisks indicate errors that have exceeded the maximum threshold.
 0: # Out of bounds values : 0 OK
 0: # Avg bus bandwidth    : 37.9868

The low performance up to 8388608 bytes seems unexpected. The same node pool on all_reduce_perf gives:

 0: #                                                              out-of-place                       in-place
 0: #       size         count      type   redop    root     time   algbw   busbw #wrong     time   algbw   busbw #wrong
 0: #        (B)    (elements)                               (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)
 0:        65536         16384     float     sum      -1    78.91    0.83    1.61      0    77.42    0.85    1.64      0
 0:       131072         32768     float     sum      -1    101.6    1.29    2.50      0    102.0    1.29    2.49      0
 0:       262144         65536     float     sum      -1    157.8    1.66    3.22      0    161.5    1.62    3.15      0
 0:       524288        131072     float     sum      -1    202.4    2.59    5.02      0    202.2    2.59    5.02      0
 0:      1048576        262144     float     sum      -1    289.5    3.62    7.02      0    289.1    3.63    7.03      0
 0:      2097152        524288     float     sum      -1    265.7    7.89   15.29      0    330.2    6.35   12.31      0
 0:      4194304       1048576     float     sum      -1    366.0   11.46   22.20      0    364.7   11.50   22.28      0
 0:      8388608       2097152     float     sum      -1   1184.5    7.08   13.72      0    912.9    9.19   17.80      0
 0:     16777216       4194304     float     sum      -1    760.7   22.05   42.73      0    820.6   20.45   39.61      0
 0:     33554432       8388608     float     sum      -1   1326.7   25.29   49.00      0   1290.3   26.01   50.39      0
 0:     67108864      16777216     float     sum      -1   2444.7   27.45   53.19      0   2457.6   27.31   52.91      0
 0:    134217728      33554432     float     sum      -1   3486.7   38.49   74.58      0   3876.5   34.62   67.08      0
 0:    268435456      67108864     float     sum      -1   6326.1   42.43   82.21      0   5652.8   47.49   92.01      0
 0:    536870912     134217728     float     sum      -1    11268   47.65   92.32      0    11273   47.62   92.27      0
 0:   1073741824     268435456     float     sum      -1    22500   47.72   92.46      0    22504   47.71   92.45      0
 0:   2147483648     536870912     float     sum      -1    44867   47.86   92.74      0    44876   47.85   92.72      0
 0:   4294967296    1073741824     float     sum      -1    89585   47.94   92.89      0    89596   47.94   92.88      0
 0: # Errors with asterisks indicate errors that have exceeded the maximum threshold.
 0: # Out of bounds values : 0 OK
 0: # Avg bus bandwidth    : 43.7269

Operating System

NAME="Red Hat Enterprise Linux" VERSION="8.10 (Ootpa)"

CPU

AMD EPYC 7A53 64-Core Processor

GPU

Name: AMD EPYC 7A53 64-Core Processor Marketing Name: AMD EPYC 7A53 64-Core Processor Name: AMD EPYC 7A53 64-Core Processor Marketing Name: AMD EPYC 7A53 64-Core Processor Name: AMD EPYC 7A53 64-Core Processor Marketing Name: AMD EPYC 7A53 64-Core Processor Name: AMD EPYC 7A53 64-Core Processor Marketing Name: AMD EPYC 7A53 64-Core Processor Name: gfx90a Marketing Name: AMD Instinct MI250X Name: amdgcn-amd-amdhsa--gfx90a:sramecc+:xnack- Name: gfx90a Marketing Name: AMD Instinct MI250X Name: amdgcn-amd-amdhsa--gfx90a:sramecc+:xnack- Name: gfx90a Marketing Name: AMD Instinct MI250X Name: amdgcn-amd-amdhsa--gfx90a:sramecc+:xnack- Name: gfx90a Marketing Name: AMD Instinct MI250X Name: amdgcn-amd-amdhsa--gfx90a:sramecc+:xnack- Name: gfx90a Marketing Name: AMD Instinct MI250X Name: amdgcn-amd-amdhsa--gfx90a:sramecc+:xnack- Name: gfx90a Marketing Name: AMD Instinct MI250X Name: amdgcn-amd-amdhsa--gfx90a:sramecc+:xnack- Name: gfx90a Marketing Name: AMD Instinct MI250X Name: amdgcn-amd-amdhsa--gfx90a:sramecc+:xnack- Name: gfx90a Marketing Name: AMD Instinct MI250X Name: amdgcn-amd-amdhsa--gfx90a:sramecc+:xnack-

ROCm Version

ROCm 6.2.1

ROCm Component

rccl

Steps to Reproduce

No response

(Optional for Linux users) Output of /opt/rocm/bin/rocminfo --support

No response

Additional Information

No response

Jan 17 '25 15:01 etiennemlb

Hi @etiennemlb. Internal ticket has been created to investigate your issue. Thanks!

Jan 17 '25 18:01 ppanchad-amd

Hi @etiennemlb,

Thank you for posting the question. As you noticed, there are differences between the all_gather_perf and all_reduce_perf tests, especially for smaller data amounts. The main reason lies in the nature of these operations:

Reduction Operation: This allows for the overlap of computation (e.g., “sum” in your log) and communication. As data is received from other processes, it can be immediately computed with local data, and the result can be sent to the next process. This pipelining effect can lead to more efficient use of the communication bus.
Gather Operation: This involves collecting data from all processes and distributing the combined data to all processes. Each process ends up with a complete set of data from all other processes. This operation primarily involves data movement with minimal computation.

For both operations, it is also noted that the larger the data, the higher the efficiency. This is due to the overhead of initiating transfers being amortized over more data, leading to better utilization of the communication pipeline.

If you have further comments, please let us know.

Jan 23 '25 22:01 huanrwan-amd

Another flag can be used: https://rocm.docs.amd.com/en/latest/how-to/rocm-for-ai/inference-optimization/workload.html#disable-numa-auto-balancing "If the output is 1, you can disable NUMA auto-balancing by running the following command: sudo sysctl kernel.numa_balancing=0."

May 13 '25 06:05 huanrwan-amd

[Issue]: low performance all_gather_perf

Problem Description

Operating System

CPU

GPU

ROCm Version

ROCm Component

Steps to Reproduce

(Optional for Linux users) Output of /opt/rocm/bin/rocminfo --support

Additional Information