Megatron-LM icon indicating copy to clipboard operation
Megatron-LM copied to clipboard

[QUESTION] Backend nccl does not support reduce_scatter_tensor_coalesced, how could I solve it

Open TeddLi opened this issue 10 months ago • 8 comments

**Your question**
Ask a clear and concise question about Megatron-LM.
/workspace/megatron/megatron/core/models/gpt/gpt_layer_specs.py:77: UserWarning: The fp8 argument in "get_gpt_layer_with_transformer_engine_spec" has been deprecated and will be removed soon. Please update your code accordingly.
  warnings.warn(
[rank7]: Traceback (most recent call last):
[rank7]:   File "/workspace/megatron/pretrain_gpt.py", line 300, in <module>
[rank7]:     pretrain(
[rank7]:   File "/workspace/megatron/megatron/training/training.py", line 386, in pretrain
[rank7]:     iteration, num_floating_point_operations_so_far = train(
[rank7]:   File "/workspace/megatron/megatron/training/training.py", line 1478, in train
[rank7]:     train_step(forward_step_func,
[rank7]:   File "/workspace/megatron/megatron/training/training.py", line 766, in train_step
[rank7]:     losses_reduced = forward_backward_func(
[rank7]:   File "/workspace/megatron/megatron/core/pipeline_parallel/schedules.py", line 1877, in forward_backward_pipelining_without_interleaving
[rank7]:     config.finalize_model_grads_func(
[rank7]:   File "/workspace/megatron/megatron/core/distributed/finalize_model_grads.py", line 225, in finalize_model_grads
[rank7]:     model_chunk.finish_grad_sync()
[rank7]:   File "/workspace/megatron/megatron/core/distributed/distributed_data_parallel.py", line 447, in finish_grad_sync
[rank7]:     bucket_group.finish_grad_sync()
[rank7]:   File "/workspace/megatron/megatron/core/distributed/param_and_grad_buffer.py", line 368, in finish_grad_sync
[rank7]:     self.start_grad_sync()
[rank7]:   File "/workspace/megatron/megatron/core/distributed/param_and_grad_buffer.py", line 306, in start_grad_sync
[rank7]:     with stream_context, _coalescing_manager(communication_group, async_ops=async_op) as cm:
[rank7]:   File "/usr/lib/python3.10/contextlib.py", line 142, in __exit__
[rank7]:     next(self.gen)
[rank7]:   File "/usr/local/lib/python3.10/dist-packages/torch/distributed/distributed_c10d.py", line 2031, in _coalescing_manager
[rank7]:     work = group.reduce_scatter_tensor_coalesced(outputs, inputs, reduce_opts)
[rank7]: RuntimeError: Backend nccl does not support reduce_scatter_tensor_coalesced
[rank4]: Traceback (most recent call last):

TeddLi avatar Jan 30 '25 03:01 TeddLi

@TeddLi , did you find the reason for this problem? I encounter the same issue.

Salmon-f42 avatar Mar 01 '25 12:03 Salmon-f42

Marking as stale. No activity in 60 days.

github-actions[bot] avatar Apr 30 '25 18:04 github-actions[bot]

Any progress? I also encounter the same issue

**Your question**
Ask a clear and concise question about Megatron-LM.
/workspace/megatron/megatron/core/models/gpt/gpt_layer_specs.py:77: UserWarning: The fp8 argument in "get_gpt_layer_with_transformer_engine_spec" has been deprecated and will be removed soon. Please update your code accordingly.
  warnings.warn(
[rank7]: Traceback (most recent call last):
[rank7]:   File "/workspace/megatron/pretrain_gpt.py", line 300, in <module>
[rank7]:     pretrain(
[rank7]:   File "/workspace/megatron/megatron/training/training.py", line 386, in pretrain
[rank7]:     iteration, num_floating_point_operations_so_far = train(
[rank7]:   File "/workspace/megatron/megatron/training/training.py", line 1478, in train
[rank7]:     train_step(forward_step_func,
[rank7]:   File "/workspace/megatron/megatron/training/training.py", line 766, in train_step
[rank7]:     losses_reduced = forward_backward_func(
[rank7]:   File "/workspace/megatron/megatron/core/pipeline_parallel/schedules.py", line 1877, in forward_backward_pipelining_without_interleaving
[rank7]:     config.finalize_model_grads_func(
[rank7]:   File "/workspace/megatron/megatron/core/distributed/finalize_model_grads.py", line 225, in finalize_model_grads
[rank7]:     model_chunk.finish_grad_sync()
[rank7]:   File "/workspace/megatron/megatron/core/distributed/distributed_data_parallel.py", line 447, in finish_grad_sync
[rank7]:     bucket_group.finish_grad_sync()
[rank7]:   File "/workspace/megatron/megatron/core/distributed/param_and_grad_buffer.py", line 368, in finish_grad_sync
[rank7]:     self.start_grad_sync()
[rank7]:   File "/workspace/megatron/megatron/core/distributed/param_and_grad_buffer.py", line 306, in start_grad_sync
[rank7]:     with stream_context, _coalescing_manager(communication_group, async_ops=async_op) as cm:
[rank7]:   File "/usr/lib/python3.10/contextlib.py", line 142, in __exit__
[rank7]:     next(self.gen)
[rank7]:   File "/usr/local/lib/python3.10/dist-packages/torch/distributed/distributed_c10d.py", line 2031, in _coalescing_manager
[rank7]:     work = group.reduce_scatter_tensor_coalesced(outputs, inputs, reduce_opts)
[rank7]: RuntimeError: Backend nccl does not support reduce_scatter_tensor_coalesced
[rank4]: Traceback (most recent call last):

charles9304 avatar May 27 '25 03:05 charles9304

Run without TORCH_DISTRIBUTED_DEBUG=DETAIL

DWarez avatar Jul 23 '25 06:07 DWarez

same error, anyone willing to share solution? thanks

HauffQian avatar Aug 26 '25 04:08 HauffQian

Same error while slime was using megatron to train model. Detailed logs:

[36m(MegatronTrainRayActor pid=57143)[0m rollout 0: {'rollout/raw_reward': 0.46875, 'rollout/total_lengths': 6880.0625, 'rollout/response_lengths': 6724.4375, 'rollout/rewards': 3.725290298461914e-09, 'rollout/truncated': 0.515625, 'rollout/rollout_log_probs': -0.3078222069889307, 'rollout/ref_log_probs': -0.30862119793891907, 'rollout/log_probs': -0.30862119793891907, 'rollout/advantages': 8.381903171539307e-09, 'rollout/returns': 8.381903171539307e-09} [36m(SGLangEngine pid=19781)[0m [2025-11-04 20:40:09] INFO: 33.212.71.18:55852 - "GET /health HTTP/1.1" 200 OK[32m [repeated 8x across cluster][0m [36m(SGLangEngine pid=19781)[0m [2025-11-04 20:40:19] INFO: 33.212.71.18:55366 - "GET /health HTTP/1.1" 200 OK[32m [repeated 8x across cluster][0m [36m(SGLangEngine pid=19781)[0m [2025-11-04 20:40:29] INFO: 33.212.71.18:47082 - "GET /health HTTP/1.1" 200 OK[32m [repeated 8x across cluster][0m [36m(SGLangEngine pid=19781)[0m [2025-11-04 20:40:39] INFO: 33.212.71.18:36594 - "GET /health HTTP/1.1" 200 OK[32m [repeated 8x across cluster][0m [36m(MegatronTrainRayActor pid=57672)[0m WARNING:megatron.core.rerun_state_machine:Implicit initialization of Rerun State Machine! [36m(SGLangEngine pid=19779)[0m [2025-11-04 20:38:34 TP0] Decode batch. #running-req: 5, #token: 41357, token usage: 0.08, cuda graph: True, gen throughput (token/s): 437.63, #queue-req: 0, [32m [repeated 3x across cluster][0m [36m(SGLangEngine pid=19786)[0m [2025-11-04 20:38:34 TP1] Cache flushed successfully![32m [repeated 28x across cluster][0m [36m(MegatronTrainRayActor pid=57667)[0m [rank14]:W1104 20:38:36.556000 57667 site-packages/torch/distributed/distributed_c10d.py:2960] _tensor_to_object size: 6065866 hash value: 5915022525618458291[32m [repeated 14x across cluster][0m Traceback (most recent call last): File "/root/slime/train.py", line 93, in train(args) File "/root/slime/train.py", line 55, in train ray.get(actor_model.async_train(rollout_id, rollout_data_ref)) File "/opt/conda/lib/python3.12/site-packages/ray/_private/auto_init_hook.py", line 22, in auto_init_wrapper return fn(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^ File "/opt/conda/lib/python3.12/site-packages/ray/_private/client_mode_hook.py", line 104, in wrapper return func(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^ File "/opt/conda/lib/python3.12/site-packages/ray/_private/worker.py", line 2882, in get values, debugger_breakpoint = worker.get_objects(object_refs, timeout=timeout) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/opt/conda/lib/python3.12/site-packages/ray/_private/worker.py", line 968, in get_objects raise value.as_instanceof_cause() ray.exceptions.RayTaskError(RuntimeError): [36mray::MegatronTrainRayActor.train()[39m (pid=57667, ip=aistudio-mnnfgqzf-ptjob-master-0, actor_id=4b4bc9dc6681a0e6485d591802000000, repr=<slime.backends.megatron_utils.actor.MegatronTrainRayActor object at 0x7f0580929190>) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/root/slime/slime/backends/megatron_utils/actor.py", line 257, in train return self.train_actor(rollout_id, rollout_data) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/root/slime/slime/backends/megatron_utils/actor.py", line 348, in train_actor train( File "/root/slime/slime/backends/megatron_utils/model.py", line 430, in train loss_dict, grad_norm = train_one_step( ^^^^^^^^^^^^^^^ File "/root/slime/slime/backends/megatron_utils/model.py", line 309, in train_one_step losses_reduced = forward_backward_func( ^^^^^^^^^^^^^^^^^^^^^^ File "/root/Megatron-LM/megatron/core/pipeline_parallel/schedules.py", line 641, in forward_backward_no_pipelining config.finalize_model_grads_func( File "/root/Megatron-LM/megatron/core/distributed/finalize_model_grads.py", line 422, in finalize_model_grads model_chunk.finish_grad_sync() File "/root/Megatron-LM/megatron/core/distributed/distributed_data_parallel.py", line 588, in finish_grad_sync bucket_group.finish_grad_sync() File "/root/Megatron-LM/megatron/core/distributed/param_and_grad_buffer.py", line 455, in finish_grad_sync self.start_grad_sync() File "/root/Megatron-LM/megatron/core/distributed/param_and_grad_buffer.py", line 382, in start_grad_sync with stream_context, _coalescing_manager(communication_group, async_ops=async_op) as cm: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/opt/conda/lib/python3.12/contextlib.py", line 144, in exit next(self.gen) File "/opt/conda/lib/python3.12/site-packages/torch/distributed/distributed_c10d.py", line 2578, in _coalescing_manager work = group.reduce_scatter_tensor_coalesced(outputs, inputs, reduce_opts) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ RuntimeError: Backend nccl does not support reduce_scatter_tensor_coalesced [36m(SGLangEngine pid=19780)[0m [2025-11-04 20:40:39] INFO: 33.212.71.18:61468 - "GET /health HTTP/1.1" 200 OK[32m [repeated 7x across cluster][0m [36m(MegatronTrainRayActor pid=57667)[0m WARNING:megatron.core.rerun_state_machine:Implicit initialization of Rerun State Machine! 2025-11-04 20:40:53,922 ERR cli.py:73 -- [31m---------------------------------------[39m 2025-11-04 20:40:53,923 ERR cli.py:74 -- [31mJob 'raysubmit_sP8WG1jwJmGBS9Fy' failed[39m 2025-11-04 20:40:53,923 ERR cli.py:75 -- [31m---------------------------------------[39m 2025-11-04 20:40:53,923 INFO cli.py:88 -- Status message: Job entrypoint command failed with exit code 1, last available logs (truncated to 20,000 chars): with stream_context, _coalescing_manager(communication_group, async_ops=async_op) as cm: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/opt/conda/lib/python3.12/contextlib.py", line 144, in exit next(self.gen) File "/opt/conda/lib/python3.12/site-packages/torch/distributed/distributed_c10d.py", line 2578, in _coalescing_manager work = group.reduce_scatter_tensor_coalesced(outputs, inputs, reduce_opts) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ RuntimeError: Backend nccl does not support reduce_scatter_tensor_coalesced [36m(SGLangEngine pid=19780)[0m [2025-11-04 20:40:39] INFO: 33.212.71.18:61468 - "GET /health HTTP/1.1" 200 OK[32m [repeated 7x across cluster][0m [36m(MegatronTrainRayActor pid=57667)[0m WARNING:megatron.core.rerun_state_machine:Implicit initialization of Rerun State Machine!

shyringo avatar Nov 05 '25 05:11 shyringo

Similar problem about group.reduce_scatter_tensor_coalesced(outputs, inputs, reduce_opts) https://github.com/modelscope/ms-swift/issues/6495

shuoyinn avatar Nov 08 '25 20:11 shuoyinn