veScale icon indicating copy to clipboard operation
veScale copied to clipboard

[QUESTION] implementation of `get_p2p_cuda_stream_id` and `get_coll_cuda_stream_id`

Open nooblyh opened this issue 1 year ago • 4 comments

In the README of ndtimeline, you mentioned implementing interfaces to obtain the streams for NCCL communication, specifically get_p2p_cuda_stream_id and get_coll_cuda_stream_id. However, these interfaces seem not present in the patches directory. Do you plan to release the specific implementation of these interfaces?

nooblyh avatar Aug 12 '24 06:08 nooblyh

Thanks a lot for your comments. We will supplement these two functions soon.

vocaltract avatar Aug 13 '24 17:08 vocaltract

Thanks a lot for your comments. We will supplement these two functions soon.

Does the ndtimeline support async nccl stream like grad_reduce/params_gather overlap and interleaved p2p overlap? I add get_stream_id api in pytorch and use ndtimeline in megatron-lm trying to get the timeline of async nccl communication, but when I enable cudaEvent record I found the behavior of async nccl communication changed to sequential execution with computation kernel, is this normal? And how to fix it?

XLzed avatar Aug 18 '24 06:08 XLzed

Thanks a lot for your comments. We will supplement these two functions soon.

Does the ndtimeline support async nccl stream like grad_reduce/params_gather overlap and interleaved p2p overlap? I add get_stream_id api in pytorch and use ndtimeline in megatron-lm trying to get the timeline of async nccl communication, but when I enable cudaEvent record I found the behavior of async nccl communication changed to sequential execution with computation kernel, is this normal? And how to fix it?

I found that the problem is caused by environment CUDA_DEVICE_MAX_CONNECTIONS=1, which is required by TP/SP communication overlap. But I don't know why it makes the async nccl op with ndtimer changed to sequential.

XLzed avatar Aug 20 '24 06:08 XLzed

In the README of ndtimeline, you mentioned implementing interfaces to obtain the streams for NCCL communication, specifically get_p2p_cuda_stream_id and get_coll_cuda_stream_id. However, these interfaces seem not present in the patches directory. Do you plan to release the specific implementation of these interfaces?

You can check here

vocaltract avatar Aug 20 '24 10:08 vocaltract

Now PR is merged

vocaltract avatar Aug 26 '24 03:08 vocaltract