[Question] FSDP+TP CUDA_DEVICE_MAX_CONNECTIONS

Open ChenchaoZhao opened this issue 8 months ago • 3 comments

In Megatron repo https://github.com/NVIDIA/Megatron-LM/blob/4429e8ebe21fb011529d7401c370841ce530785a/megatron/training/arguments.py#L779

It’s recommended that FSDP should use larger values of CUDA_DEVICE_MAX_CONNECTIONS but Megatron TP requires it to be 1. Is it also the case for torch implementation of TP using DTensor?

How should I configure the environment variable when using torch implementation of FSDP(2) and/or TP/CP/SP?

Apr 27 '25 20:04 ChenchaoZhao

@weifengpy Do you have insights on this?

Apr 29 '25 01:04 fegin

@ChenchaoZhao @fegin for FSDP2 + torch native TP, we recommend setting CUDA_DEVICE_MAX_CONNECTIONS to number of cuda streams. for example, 16 or 32. This makes sure compute and nccl kernels can execute in parallel

Apr 29 '25 02:04 weifengpy

Thanks for the quick answer. Does it mean that PyTorch native TP is superior to the Megatron TP which requires the variable to be 1 in order to turn on tp comm overlap (comm+GEMM)?

Apr 29 '25 21:04 ChenchaoZhao