veScale [QUESTION] implementation of `get_p2p_cuda_stream_id` and `get_coll_cuda_stream

In the README of ndtimeline, you mentioned implementing interfaces to obtain the streams for NCCL communication, specifically get_p2p_cuda_stream_id and get_coll_cuda_stream_id. However, these interfaces seem not present in the patches directory. Do you plan to release the specific implementation of these interfaces?

Aug 12 '24 06:08 nooblyh

Thanks a lot for your comments. We will supplement these two functions soon.

Aug 13 '24 17:08 vocaltract

Thanks a lot for your comments. We will supplement these two functions soon.

Does the ndtimeline support async nccl stream like grad_reduce/params_gather overlap and interleaved p2p overlap? I add get_stream_id api in pytorch and use ndtimeline in megatron-lm trying to get the timeline of async nccl communication, but when I enable cudaEvent record I found the behavior of async nccl communication changed to sequential execution with computation kernel, is this normal? And how to fix it?

Aug 18 '24 06:08 XLzed

Thanks a lot for your comments. We will supplement these two functions soon.

Does the ndtimeline support async nccl stream like grad_reduce/params_gather overlap and interleaved p2p overlap? I add get_stream_id api in pytorch and use ndtimeline in megatron-lm trying to get the timeline of async nccl communication, but when I enable cudaEvent record I found the behavior of async nccl communication changed to sequential execution with computation kernel, is this normal? And how to fix it?

I found that the problem is caused by environment CUDA_DEVICE_MAX_CONNECTIONS=1, which is required by TP/SP communication overlap. But I don't know why it makes the async nccl op with ndtimer changed to sequential.

Aug 20 '24 06:08 XLzed

In the README of ndtimeline, you mentioned implementing interfaces to obtain the streams for NCCL communication, specifically get_p2p_cuda_stream_id and get_coll_cuda_stream_id. However, these interfaces seem not present in the patches directory. Do you plan to release the specific implementation of these interfaces?

You can check here

Aug 20 '24 10:08 vocaltract

Now PR is merged

Aug 26 '24 03:08 vocaltract

[QUESTION] implementation of `get_p2p_cuda_stream_id` and `get_coll_cuda_stream_id`