rccl icon indicating copy to clipboard operation
rccl copied to clipboard

[Issue]: Unexpected Behavior LL Protocol

Open arkhadem opened this issue 1 year ago • 2 comments

Problem Description

Hi Everyone,

I have found a very strange behavior in rccl-rocm-6.1.2 that I cannot understand based on my limited knowledge of LL implementation. The behavior is for AllGather - RING - LL test.

In the LL implementation, each channel has 256 threads. Each thread in each trip, sends/receives 8B data. So each trip of any primitive of LL transfers 256 threads x 8B = 2KB data. I found that if the data size is not divisible by 128B (16 threads), the latency is very high.

In the following experiment, I increase the data by 16B in each step, meaning that 2 more threads will transfer data. Every 8 steps (x 2 threads = 16 threads or 1/4 of warp size), the latency is low. Otherwise, latency is huge (~150us difference). Can anyone understand why this is happening?

3MI300X

Sincerely,

  • Alireza

Operating System

Ubuntu 22.04.3 LTS (Jammy Jellyfish)

CPU

Intel(R) Xeon(R) Platinum 8480C

GPU

AMD Instinct MI300X

ROCm Version

ROCm 6.1.0

ROCm Component

No response

Steps to Reproduce

Using 3 fully-connected GPUs:

RCCL_MSCCL_ENABLE=0 NCCL_PROTO=LL NCCL_ALGO=RING NCCL_MIN_NRINGS=16 NCCL_MAX_NRINGS=16 LD_LIBRARY_PATH=rccl-rocm-6.1.2/build/release/:$LD_LIBRARY_PATH ./build/all_gather_perf -g 3 -b 50331648 -e 50334720 -i 48 -s 1

(Optional for Linux users) Output of /opt/rocm/bin/rocminfo --support

No response

Additional Information

No response

arkhadem avatar Jun 28 '24 18:06 arkhadem

Hi @arkhadem,

Thanks for reporting this - I've created an internal ticket to look into this and will update this ticket when we have some information about it.

gilbertlee-amd avatar Jul 03 '24 16:07 gilbertlee-amd

Hi @arkhadem,

I'm unable to reproduce this behavior with the latest ROCm release (6.3) and rccl-develop branch. Could you please try it out and let me know if you still see this issue?

rahulvaidya20 avatar Feb 03 '25 23:02 rahulvaidya20

@arkhadem. Closing ticket due to lack of response. Please feel free to re-open issue if you still see the issue with the latest ROCm. Thanks!

ppanchad-amd avatar May 12 '25 15:05 ppanchad-amd