Alp Dener comments

Results 42 comments of


                                            Alp Dener

When will comm-gemm-overlap support multi nodes?

This primarily has to do with the communication launch overhead in NCCL. Fine-grained overlap at the problem sizes we deal with in Transformer Engine requires frequent movement of small chunks...

When will comm-gemm-overlap support multi nodes?

@umiswing The affect on GEMM performance depends on the particular overlap algorithm and its configuration. For example, layers that are overlapped with a `ring-exchange` method should not impact GEMM performance...

tp_overlap init failed when tp_size != world_size

Hi @liuhatry -- I tested PR #986 earlier today on 2 nodes of 8xH100s and confirmed that the `examples/pytorch/comm_gemm_overlap/ln_mlp_with_overlap.py` is working correctly for the following use cases: - `tp_size =...

tp_overlap init failed when tp_size != world_size

Hi @liuhatry -- I updated PR #986 to prefer Gloo backend over NCCL whenever possible for bootstrapping Userbuffers. The application code still has to initialize NCCL process groups for TE...

tp_overlap init failed when tp_size != world_size

Hi @liuhatry — if the Gloo backend in PyTorch distributed can’t do an all-gather over processes on a single host CPU, that suggests something is broken outside of Transformer Engine....

tp_overlap init failed when tp_size != world_size

Hi @liuhatry -- you're correct, Gloo supports `all_gather()` but not `all_gather_into_tensor()`. Can you confirm that the following snippet works? ```python import os import socket import torch import torch.distributed as dist...

tp_overlap init failed when tp_size != world_size

The UDS (Unix Domain Socket) error you’re seeing is coming from the CUDA Multicast handle initialization. Userbuffers bootstrapping needs to communicate CUDA Multicast handles between processes, but these handles are...

tp_overlap init failed when tp_size != world_size

Revisiting an issue from earlier: > I checked the code(https://github.com/denera/TransformerEngine/blob/userbuffers-missing-data-parallel-pg/transformer_engine/pytorch/module/base.py#L128), found socket.gethostname() return the same result in my env, and the local_size is 16. (Pdb) hostnames ['TENCENT64.site', 'TENCENT64.site', 'TENCENT64.site', 'TENCENT64.site',...

tp_overlap init failed when tp_size != world_size

Hi @liuhatry -- I recently merged PR986 into TE/main after confirming that it is resolving multi-node issues for us in NeMo and Mcore. These changes also update the example problem...

tp_overlap init failed when tp_size != world_size

Hi @liuhatry -- I've reproduced the issue with TE/main but I'm able to resolve it by adding `use_local_synchronization=True` to the group creation. This should eliminate the requirement for all ranks...