[QUESTION] implementation of `get_p2p_cuda_stream_id` and `get_coll_cuda_stream_id`
In the README of ndtimeline, you mentioned implementing interfaces to obtain the streams for NCCL communication, specifically get_p2p_cuda_stream_id and get_coll_cuda_stream_id. However, these interfaces seem not present in the patches directory. Do you plan to release the specific implementation of these interfaces?
Thanks a lot for your comments. We will supplement these two functions soon.
Thanks a lot for your comments. We will supplement these two functions soon.
Does the ndtimeline support async nccl stream like grad_reduce/params_gather overlap and interleaved p2p overlap? I add get_stream_id api in pytorch and use ndtimeline in megatron-lm trying to get the timeline of async nccl communication, but when I enable cudaEvent record I found the behavior of async nccl communication changed to sequential execution with computation kernel, is this normal? And how to fix it?
Thanks a lot for your comments. We will supplement these two functions soon.
Does the ndtimeline support async nccl stream like grad_reduce/params_gather overlap and interleaved p2p overlap? I add get_stream_id api in pytorch and use ndtimeline in megatron-lm trying to get the timeline of async nccl communication, but when I enable cudaEvent record I found the behavior of async nccl communication changed to sequential execution with computation kernel, is this normal? And how to fix it?
I found that the problem is caused by environment CUDA_DEVICE_MAX_CONNECTIONS=1, which is required by TP/SP communication overlap. But I don't know why it makes the async nccl op with ndtimer changed to sequential.
In the README of ndtimeline, you mentioned implementing interfaces to obtain the streams for NCCL communication, specifically
get_p2p_cuda_stream_idandget_coll_cuda_stream_id. However, these interfaces seem not present in the patches directory. Do you plan to release the specific implementation of these interfaces?
You can check here
Now PR is merged