Megatron-LM
Megatron-LM copied to clipboard
[QUESTION]Why does Megatron-LM using gloo backend when Creating Parrallel Group ?
Your question Why does Megatron-LM using gloo backend not value paased by --distributed-backend when Creating Parrallel Group ? Ask a clear and concise question about Megatron-LM.
for i in range(pipeline_model_parallel_size):
start_rank = i * num_pipeline_model_parallel_groups
end_rank = (i + 1) * num_pipeline_model_parallel_groups
for j in range(context_parallel_size * tensor_model_parallel_size):
ranks = range(
start_rank + j, end_rank, context_parallel_size * tensor_model_parallel_size
)
group = torch.distributed.new_group(
ranks, pg_options=get_nccl_options('dp', nccl_comm_cfgs)
)
group_gloo = torch.distributed.new_group(ranks, backend="gloo")
if rank in ranks:
_DATA_PARALLEL_GROUP = group
_DATA_PARALLEL_GROUP_GLOO = group_gloo
_DATA_PARALLEL_GLOBAL_RANKS = ranks
Most of groups use the default backend. Only a few groups use gloo backend, because gloo is needed for the communication of CPU tensors.
Marking as stale. No activity in 60 days.
Most of groups use the default backend. Only a few groups use gloo backend, because gloo is needed for the communication of CPU tensors.
Hi, I found some codes fixed to use gloo , for example https://github.com/NVIDIA/Megatron-LM/blob/main/megatron/core/parallel_state.py#L557, But I met issue when creating gloo backend. If I manually set all the "gloo" to "nccl", it works. What is the influence? Will it be okay if we replace all the "gloo" to "nccl" ? Thank you.
Marking as stale. No activity in 60 days.