[Question] FSDP+TP CUDA_DEVICE_MAX_CONNECTIONS
In Megatron repo https://github.com/NVIDIA/Megatron-LM/blob/4429e8ebe21fb011529d7401c370841ce530785a/megatron/training/arguments.py#L779
It’s recommended that FSDP should use larger values of CUDA_DEVICE_MAX_CONNECTIONS but Megatron TP requires it to be 1. Is it also the case for torch implementation of TP using DTensor?
How should I configure the environment variable when using torch implementation of FSDP(2) and/or TP/CP/SP?
@weifengpy Do you have insights on this?
@ChenchaoZhao @fegin for FSDP2 + torch native TP, we recommend setting CUDA_DEVICE_MAX_CONNECTIONS to number of cuda streams. for example, 16 or 32. This makes sure compute and nccl kernels can execute in parallel
Thanks for the quick answer. Does it mean that PyTorch native TP is superior to the Megatron TP which requires the variable to be 1 in order to turn on tp comm overlap (comm+GEMM)?