Megatron-LM [QUESTION]Why does Megatron-LM using gloo backend when Creating Parrallel Group ?

[QUESTION]Why does Megatron-LM using gloo backend when Creating Parrallel Group ?

Open wuyingjun-lucky opened this issue 11 months ago • 3 comments

Your question Why does Megatron-LM using gloo backend not value paased by --distributed-backend when Creating Parrallel Group ? Ask a clear and concise question about Megatron-LM.

 for i in range(pipeline_model_parallel_size):
        start_rank = i * num_pipeline_model_parallel_groups
        end_rank = (i + 1) * num_pipeline_model_parallel_groups
        for j in range(context_parallel_size * tensor_model_parallel_size):
            ranks = range(
                start_rank + j, end_rank, context_parallel_size * tensor_model_parallel_size
            )
            group = torch.distributed.new_group(
                ranks, pg_options=get_nccl_options('dp', nccl_comm_cfgs)
            )
            group_gloo = torch.distributed.new_group(ranks, backend="gloo")
            if rank in ranks:
                _DATA_PARALLEL_GROUP = group
                _DATA_PARALLEL_GROUP_GLOO = group_gloo
                _DATA_PARALLEL_GLOBAL_RANKS = ranks

Mar 21 '24 13:03 wuyingjun-lucky

Most of groups use the default backend. Only a few groups use gloo backend, because gloo is needed for the communication of CPU tensors.

Mar 27 '24 17:03 yuantailing

Marking as stale. No activity in 60 days.

May 26 '24 18:05 github-actions[bot]

Most of groups use the default backend. Only a few groups use gloo backend, because gloo is needed for the communication of CPU tensors.

Hi, I found some codes fixed to use gloo , for example https://github.com/NVIDIA/Megatron-LM/blob/main/megatron/core/parallel_state.py#L557, But I met issue when creating gloo backend. If I manually set all the "gloo" to "nccl", it works. What is the influence? Will it be okay if we replace all the "gloo" to "nccl" ? Thank you.

Jul 22 '24 02:07 yangfuwei

Marking as stale. No activity in 60 days.

Sep 20 '24 18:09 github-actions[bot]

Megatron-LM Megatron-LM copied to clipboard

[QUESTION]Why does Megatron-LM using gloo backend when Creating Parrallel Group ?

Megatron-LM
Megatron-LM copied to clipboard