[QUESTION] Why should CUDA_DEVICE_MAX_CONNECTIONS=1 should be set when using seq_parallel or async comm?
I have a question, why is it necessary to set CUDA_DEVICE_MAX_CONNECTIONS=1 after enabling seq_parallel? This note is written in the bwd_compute function, and it says it is to launch communication first, so as to achieve overlap with calculation. But I don't understand the inevitable connection between these two. Can anyone help explain? Thank you.
It enforces the order of kernel execution on GPU as the kernel queuing order from host. Its for GEMM and TP communication overlap it allows for scheduling the communication kernel in GPU ahead of the GEMM to have the communication kernel have GPU resources allocated before GEMM takes all of them. From my understanding its like essentially having a single stream
@wdykas Thanks for your reply! I have some further questions about that env var.
- Does this environment variable have an upper limit?
- What is the default value?
Marking as stale. No activity in 60 days.
@wdykas Thanks for your reply! I have some further questions about that env var.
1. Does this environment variable have an upper limit? 2. What is the default value?
See here: https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#cuda-environment-variables
Marking as stale. No activity in 60 days.