rccl icon indicating copy to clipboard operation
rccl copied to clipboard

Measuring time spent for reduction operation in AllReduce.

Open jain-jainendra opened this issue 11 months ago • 4 comments

I am trying to measure time spent in reduction operation of RCCL Allreduce. I found that eventually it calls this part of code in common_kernel.h.
#pragma unroll Unroll for (int u=0; u < Unroll; u++) { if (s < PreOpSrcs) tmp[u] = applyPreOp(preFn, tmp[u]); acc[u] = applyReduce(redFn, acc[u], tmp[u]); }

How can we measure time spent in applyReduce function? tried _clock64, wall_clock64. They are not helpful

jain-jainendra avatar Jan 21 '25 08:01 jain-jainendra

Hi @jain-jainendra. Internal ticket has been created to assist with your issue. Thanks!

ppanchad-amd avatar Jan 24 '25 15:01 ppanchad-amd

Hi @jain-jainendra , In rccl-test examples (https://github.com/ROCm/rccl-tests), ./all_reduce_perf test could help to benchmark the time spent in reduction operation.

e.g. https://rocm.docs.amd.com/en/develop/how-to/rocm-for-ai/training/train-a-model.html#running-the-rccl-bandwidth-test

Image It shows the out-of-place and in-place time for reduce operation.

Please let us know for any further help.

huanrwan-amd avatar Feb 03 '25 21:02 huanrwan-amd

I am using RCCL tests only. But this gives complete time for allreduce application which involves communication and computation. I want to measure only time spent in computation i.e. reduction operation.

jain-jainendra avatar Feb 07 '25 04:02 jain-jainendra

Hi @jain-jainendra, Just on the code level you could add:

         unsigned long long start = clock64();
         acc[u] = applyReduce(redFn, acc[u], tmp[u]); 
         unsigned long long end = clock64();

To track the time in applyReduce(). As you may be aware in large-scale systems, reduction operations are performed asynchronously and in parallel. As a result, the computation and communication phases are interdependent and are highly optimized in libraries such as RCCL and NCCL.

huanrwan-amd avatar Feb 10 '25 23:02 huanrwan-amd