Megatron-LM [QUESTION] Why should CUDA_DEVICE_MAX_CONNECTIONS=1 should be set when using seq

I have a question, why is it necessary to set CUDA_DEVICE_MAX_CONNECTIONS=1 after enabling seq_parallel? This note is written in the bwd_compute function, and it says it is to launch communication first, so as to achieve overlap with calculation. But I don't understand the inevitable connection between these two. Can anyone help explain? Thank you.

Oct 09 '23 12:10 Infi-zc

It enforces the order of kernel execution on GPU as the kernel queuing order from host. Its for GEMM and TP communication overlap it allows for scheduling the communication kernel in GPU ahead of the GEMM to have the communication kernel have GPU resources allocated before GEMM takes all of them. From my understanding its like essentially having a single stream

Oct 12 '23 18:10 wdykas

@wdykas Thanks for your reply! I have some further questions about that env var.

Does this environment variable have an upper limit?
What is the default value?

Oct 16 '23 12:10 Infi-zc

Marking as stale. No activity in 60 days.

Dec 15 '23 18:12 github-actions[bot]

@wdykas Thanks for your reply! I have some further questions about that env var.
1. Does this environment variable have an upper limit?

2. What is the default value?

See here: https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#cuda-environment-variables

Feb 23 '24 13:02 JigaoLuo

Marking as stale. No activity in 60 days.

Apr 24 '24 18:04 github-actions[bot]

[QUESTION] Why should CUDA_DEVICE_MAX_CONNECTIONS=1 should be set when using seq_parallel or async comm?