Alp Dener
Alp Dener
This primarily has to do with the communication launch overhead in NCCL. Fine-grained overlap at the problem sizes we deal with in Transformer Engine requires frequent movement of small chunks...
@umiswing The affect on GEMM performance depends on the particular overlap algorithm and its configuration. For example, layers that are overlapped with a `ring-exchange` method should not impact GEMM performance...
Hi @liuhatry -- I tested PR #986 earlier today on 2 nodes of 8xH100s and confirmed that the `examples/pytorch/comm_gemm_overlap/ln_mlp_with_overlap.py` is working correctly for the following use cases: - `tp_size =...
Hi @liuhatry -- I updated PR #986 to prefer Gloo backend over NCCL whenever possible for bootstrapping Userbuffers. The application code still has to initialize NCCL process groups for TE...
Hi @liuhatry — if the Gloo backend in PyTorch distributed can’t do an all-gather over processes on a single host CPU, that suggests something is broken outside of Transformer Engine....
Hi @liuhatry -- you're correct, Gloo supports `all_gather()` but not `all_gather_into_tensor()`. Can you confirm that the following snippet works? ```python import os import socket import torch import torch.distributed as dist...
The UDS (Unix Domain Socket) error you’re seeing is coming from the CUDA Multicast handle initialization. Userbuffers bootstrapping needs to communicate CUDA Multicast handles between processes, but these handles are...
Revisiting an issue from earlier: > I checked the code(https://github.com/denera/TransformerEngine/blob/userbuffers-missing-data-parallel-pg/transformer_engine/pytorch/module/base.py#L128), found socket.gethostname() return the same result in my env, and the local_size is 16. (Pdb) hostnames ['TENCENT64.site', 'TENCENT64.site', 'TENCENT64.site', 'TENCENT64.site',...
Hi @liuhatry -- I recently merged PR986 into TE/main after confirming that it is resolving multi-node issues for us in NeMo and Mcore. These changes also update the example problem...
Hi @liuhatry -- I've reproduced the issue with TE/main but I'm able to resolve it by adding `use_local_synchronization=True` to the group creation. This should eliminate the requirement for all ranks...