Megatron-LM
Megatron-LM copied to clipboard
[QUESTION] Backend nccl does not support reduce_scatter_tensor_coalesced, how could I solve it
**Your question**
Ask a clear and concise question about Megatron-LM.
/workspace/megatron/megatron/core/models/gpt/gpt_layer_specs.py:77: UserWarning: The fp8 argument in "get_gpt_layer_with_transformer_engine_spec" has been deprecated and will be removed soon. Please update your code accordingly.
warnings.warn(
[rank7]: Traceback (most recent call last):
[rank7]: File "/workspace/megatron/pretrain_gpt.py", line 300, in <module>
[rank7]: pretrain(
[rank7]: File "/workspace/megatron/megatron/training/training.py", line 386, in pretrain
[rank7]: iteration, num_floating_point_operations_so_far = train(
[rank7]: File "/workspace/megatron/megatron/training/training.py", line 1478, in train
[rank7]: train_step(forward_step_func,
[rank7]: File "/workspace/megatron/megatron/training/training.py", line 766, in train_step
[rank7]: losses_reduced = forward_backward_func(
[rank7]: File "/workspace/megatron/megatron/core/pipeline_parallel/schedules.py", line 1877, in forward_backward_pipelining_without_interleaving
[rank7]: config.finalize_model_grads_func(
[rank7]: File "/workspace/megatron/megatron/core/distributed/finalize_model_grads.py", line 225, in finalize_model_grads
[rank7]: model_chunk.finish_grad_sync()
[rank7]: File "/workspace/megatron/megatron/core/distributed/distributed_data_parallel.py", line 447, in finish_grad_sync
[rank7]: bucket_group.finish_grad_sync()
[rank7]: File "/workspace/megatron/megatron/core/distributed/param_and_grad_buffer.py", line 368, in finish_grad_sync
[rank7]: self.start_grad_sync()
[rank7]: File "/workspace/megatron/megatron/core/distributed/param_and_grad_buffer.py", line 306, in start_grad_sync
[rank7]: with stream_context, _coalescing_manager(communication_group, async_ops=async_op) as cm:
[rank7]: File "/usr/lib/python3.10/contextlib.py", line 142, in __exit__
[rank7]: next(self.gen)
[rank7]: File "/usr/local/lib/python3.10/dist-packages/torch/distributed/distributed_c10d.py", line 2031, in _coalescing_manager
[rank7]: work = group.reduce_scatter_tensor_coalesced(outputs, inputs, reduce_opts)
[rank7]: RuntimeError: Backend nccl does not support reduce_scatter_tensor_coalesced
[rank4]: Traceback (most recent call last):
@TeddLi , did you find the reason for this problem? I encounter the same issue.
Marking as stale. No activity in 60 days.
Any progress? I also encounter the same issue
**Your question** Ask a clear and concise question about Megatron-LM. /workspace/megatron/megatron/core/models/gpt/gpt_layer_specs.py:77: UserWarning: The fp8 argument in "get_gpt_layer_with_transformer_engine_spec" has been deprecated and will be removed soon. Please update your code accordingly. warnings.warn( [rank7]: Traceback (most recent call last): [rank7]: File "/workspace/megatron/pretrain_gpt.py", line 300, in <module> [rank7]: pretrain( [rank7]: File "/workspace/megatron/megatron/training/training.py", line 386, in pretrain [rank7]: iteration, num_floating_point_operations_so_far = train( [rank7]: File "/workspace/megatron/megatron/training/training.py", line 1478, in train [rank7]: train_step(forward_step_func, [rank7]: File "/workspace/megatron/megatron/training/training.py", line 766, in train_step [rank7]: losses_reduced = forward_backward_func( [rank7]: File "/workspace/megatron/megatron/core/pipeline_parallel/schedules.py", line 1877, in forward_backward_pipelining_without_interleaving [rank7]: config.finalize_model_grads_func( [rank7]: File "/workspace/megatron/megatron/core/distributed/finalize_model_grads.py", line 225, in finalize_model_grads [rank7]: model_chunk.finish_grad_sync() [rank7]: File "/workspace/megatron/megatron/core/distributed/distributed_data_parallel.py", line 447, in finish_grad_sync [rank7]: bucket_group.finish_grad_sync() [rank7]: File "/workspace/megatron/megatron/core/distributed/param_and_grad_buffer.py", line 368, in finish_grad_sync [rank7]: self.start_grad_sync() [rank7]: File "/workspace/megatron/megatron/core/distributed/param_and_grad_buffer.py", line 306, in start_grad_sync [rank7]: with stream_context, _coalescing_manager(communication_group, async_ops=async_op) as cm: [rank7]: File "/usr/lib/python3.10/contextlib.py", line 142, in __exit__ [rank7]: next(self.gen) [rank7]: File "/usr/local/lib/python3.10/dist-packages/torch/distributed/distributed_c10d.py", line 2031, in _coalescing_manager [rank7]: work = group.reduce_scatter_tensor_coalesced(outputs, inputs, reduce_opts) [rank7]: RuntimeError: Backend nccl does not support reduce_scatter_tensor_coalesced [rank4]: Traceback (most recent call last):
Run without TORCH_DISTRIBUTED_DEBUG=DETAIL
same error, anyone willing to share solution? thanks
Same error while slime was using megatron to train model. Detailed logs:
[36m(MegatronTrainRayActor pid=57143)[0m rollout 0: {'rollout/raw_reward': 0.46875, 'rollout/total_lengths': 6880.0625, 'rollout/response_lengths': 6724.4375, 'rollout/rewards': 3.725290298461914e-09, 'rollout/truncated': 0.515625, 'rollout/rollout_log_probs': -0.3078222069889307, 'rollout/ref_log_probs': -0.30862119793891907, 'rollout/log_probs': -0.30862119793891907, 'rollout/advantages': 8.381903171539307e-09, 'rollout/returns': 8.381903171539307e-09}
[36m(SGLangEngine pid=19781)[0m [2025-11-04 20:40:09] INFO: 33.212.71.18:55852 - "GET /health HTTP/1.1" 200 OK[32m [repeated 8x across cluster][0m
[36m(SGLangEngine pid=19781)[0m [2025-11-04 20:40:19] INFO: 33.212.71.18:55366 - "GET /health HTTP/1.1" 200 OK[32m [repeated 8x across cluster][0m
[36m(SGLangEngine pid=19781)[0m [2025-11-04 20:40:29] INFO: 33.212.71.18:47082 - "GET /health HTTP/1.1" 200 OK[32m [repeated 8x across cluster][0m
[36m(SGLangEngine pid=19781)[0m [2025-11-04 20:40:39] INFO: 33.212.71.18:36594 - "GET /health HTTP/1.1" 200 OK[32m [repeated 8x across cluster][0m
[36m(MegatronTrainRayActor pid=57672)[0m WARNING:megatron.core.rerun_state_machine:Implicit initialization of Rerun State Machine!
[36m(SGLangEngine pid=19779)[0m [2025-11-04 20:38:34 TP0] Decode batch. #running-req: 5, #token: 41357, token usage: 0.08, cuda graph: True, gen throughput (token/s): 437.63, #queue-req: 0, [32m [repeated 3x across cluster][0m
[36m(SGLangEngine pid=19786)[0m [2025-11-04 20:38:34 TP1] Cache flushed successfully![32m [repeated 28x across cluster][0m
[36m(MegatronTrainRayActor pid=57667)[0m [rank14]:W1104 20:38:36.556000 57667 site-packages/torch/distributed/distributed_c10d.py:2960] _tensor_to_object size: 6065866 hash value: 5915022525618458291[32m [repeated 14x across cluster][0m
Traceback (most recent call last):
File "/root/slime/train.py", line 93, in
Similar problem about group.reduce_scatter_tensor_coalesced(outputs, inputs, reduce_opts) https://github.com/modelscope/ms-swift/issues/6495