LLaMA-Factory
LLaMA-Factory copied to clipboard
qwen-moe zero3训练卡住
Reminder
- [X] I have read the README and searched the existing issues.
Reproduction
torchrun启动
train_bash.py --deepspeed ${deepspeed_config_file}
--stage pt
--template qwen
--model_name_or_path ${pretrained_model}
--do_train
--dataset_dir ${dataset_dir}
--finetuning_type full
--output_dir ${output_dir}
--cache_path ${data_cache}
--per_device_train_batch_size ${per_device_train_batch_size}
--gradient_accumulation_steps ${gradient_accumulation_steps}
--num_train_epochs ${num_train_epochs}
--logging_steps 10
--lr_scheduler_type cosine
--learning_rate ${lr}
--warmup_ratio 0.05
--logging_strategy steps
--save_steps 1000
--plot_loss
--bf16
--bf16_full_eval
--report_to wandb
--overwrite_cache
--overwrite_output_dir
--preprocessing_num_workers 128
--flash_attn
--cutoff_len 8192
--save_total_limit 3
deepspeed配置:
json = {
"zero_optimization": {
"stage": 3,
"overlap_comm": true,
"contiguous_gradients": true,
"sub_group_size": 0,
"reduce_bucket_size": 4.194304e+06,
"stage3_prefetch_bucket_size": 3.774874e+06,
"stage3_param_persistence_threshold": 2.048000e+04,
"stage3_max_live_parameters": 0,
"stage3_max_reuse_distance": 0,
"stage3_gather_16bit_weights_on_model_save": true
},
"fp16": {
"enabled": false,
"auto_cast": false,
"loss_scale": 0,
"initial_scale_power": 32,
"loss_scale_window": 1000,
"hysteresis": 2,
"min_loss_scale": 1e-10
},
"bf16": {
"enabled": true
},
"optimizer": {
"type": "AdamW",
"params": {
"lr": 5e-06,
"betas": [0.9, 0.999],
"eps": 1e-08,
"weight_decay": 0.0
}
},
"scheduler": {
"type": "WarmupDecayLR",
"params": {
"last_batch_iteration": -1,
"total_num_steps": 1.154500e+04,
"warmup_min_lr": 0,
"warmup_max_lr": 5e-06,
"warmup_num_steps": 578
}
},
"gradient_accumulation_steps": 4,
"gradient_clipping": 1.0,
"steps_per_print": inf,
"train_batch_size": 576,
"train_micro_batch_size_per_gpu": 3,
"wall_clock_breakdown": false
}
遇到以下
0% 1/11545 [02:07<409:18:40, 127.64s/it]
0% 2/11545 [03:32<328:40:10, 102.50s/it]
0% 3/11545 [04:55<300:42:30, 93.79s/it]
0% 4/11545 [06:16<283:47:31, 88.52s/it]
0% 5/11545 [07:35<273:18:40, 85.26s/it]
0% 6/11545 [08:57<269:31:55, 84.09s/it]
0% 7/11545 [10:16<264:27:13, 82.51s/it][E ProcessGroupNCCL.cpp:828] [Rank 5] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=323081, OpType=_ALLGATHER_BASE, Timeout(ms)=1800000) ran for 1801028 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:828] [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=323081, OpType=_ALLGATHER_BASE, Timeout(ms)=1800000) ran for 1801028 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:828] [Rank 7] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=323081, OpType=_ALLGATHER_BASE, Timeout(ms)=1800000) ran for 1800951 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:828] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=323081, OpType=_ALLGATHER_BASE, Timeout(ms)=1800000) ran for 1801070 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:828] [Rank 6] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=323081, OpType=_ALLGATHER_BASE, Timeout(ms)=1800000) ran for 1801070 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:828] [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=323081, OpType=_ALLGATHER_BASE, Timeout(ms)=1800000) ran for 1801078 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:828] [Rank 4] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=323081, OpType=_ALLGATHER_BASE, Timeout(ms)=1800000) ran for 1801185 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:828] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=323081, OpType=_ALLGATHER_BASE, Timeout(ms)=1800000) ran for 1801412 milliseconds before timing out.
real-30:161:272 [0] NCCL INFO comm 0x55d52ef87f90 rank 7 nranks 48 cudaDev 7 busId e7000 - Abort COMPLETE
Traceback (most recent call last):
File "/LLaMA-Factory-main-0411/src/llmtuner/train/tuner.py", line 31, in run_exp
run_pt(model_args, data_args, training_args, finetuning_args, callbacks)
File "/LLaMA-Factory-main-0411/src/llmtuner/train/pt/workflow.py", line 217, in run_pt
train_result = trainer.train(resume_from_checkpoint=training_args.resume_from_checkpoint)
File "/opt/conda/lib/python3.8/site-packages/transformers/trainer.py", line 1849, in train
return inner_training_loop(
File "/opt/conda/lib/python3.8/site-packages/transformers/trainer.py", line 2193, in _inner_training_loop
tr_loss_step = self.training_step(model, inputs)
File "/opt/conda/lib/python3.8/site-packages/transformers/trainer.py", line 3128, in training_step
loss = self.compute_loss(model, inputs)
File "/opt/conda/lib/python3.8/site-packages/transformers/trainer.py", line 3151, in compute_loss
outputs = model(**inputs)
File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/opt/conda/lib/python3.8/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
ret_val = func(*args, **kwargs)
File "/opt/conda/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 1852, in forward
loss = self.module(*inputs, **kwargs)
File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1538, in _call_impl
result = forward_call(*args, **kwargs)
File "/opt/conda/lib/python3.8/site-packages/transformers/models/qwen2_moe/modeling_qwen2_moe.py", line 1352, in forward
outputs = self.model(
File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1538, in _call_impl
result = forward_call(*args, **kwargs)
File "/opt/conda/lib/python3.8/site-packages/transformers/models/qwen2_moe/modeling_qwen2_moe.py", line 1210, in forward
layer_outputs = self._gradient_checkpointing_func(
File "/opt/conda/lib/python3.8/site-packages/torch/utils/checkpoint.py", line 249, in checkpoint
return CheckpointFunction.apply(function, preserve, *args)
File "/opt/conda/lib/python3.8/site-packages/torch/autograd/function.py", line 506, in apply
return super().apply(*args, **kwargs) # type: ignore[misc]
File "/opt/conda/lib/python3.8/site-packages/torch/utils/checkpoint.py", line 107, in forward
outputs = run_function(*args)
File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1538, in _call_impl
result = forward_call(*args, **kwargs)
File "/opt/conda/lib/python3.8/site-packages/transformers/models/qwen2_moe/modeling_qwen2_moe.py", line 932, in forward
hidden_states = self.mlp(hidden_states)
File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1538, in _call_impl
result = forward_call(*args, **kwargs)
File "/opt/conda/lib/python3.8/site-packages/transformers/models/qwen2_moe/modeling_qwen2_moe.py", line 850, in forward
current_hidden_states = expert_layer(current_state) * routing_weights[top_x, idx, None]
File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1538, in _call_impl
result = forward_call(*args, **kwargs)
File "/opt/conda/lib/python3.8/site-packages/transformers/models/qwen2_moe/modeling_qwen2_moe.py", line 260, in forward
return self.down_proj(self.act_fn(self.gate_proj(x)) * self.up_proj(x))
File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
result = hook(self, args)
File "/opt/conda/lib/python3.8/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
ret_val = func(*args, **kwargs)
File "/opt/conda/lib/python3.8/site-packages/deepspeed/runtime/zero/parameter_offload.py", line 278, in _pre_forward_module_hook
self.pre_sub_module_forward_function(module)
File "/opt/conda/lib/python3.8/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "/opt/conda/lib/python3.8/site-packages/deepspeed/runtime/zero/parameter_offload.py", line 452, in pre_sub_module_forward_function
param_coordinator.fetch_sub_module(sub_module, forward=True)
File "/opt/conda/lib/python3.8/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
ret_val = func(*args, **kwargs)
File "/opt/conda/lib/python3.8/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "/opt/conda/lib/python3.8/site-packages/deepspeed/runtime/zero/partitioned_param_coordinator.py", line 290, in fetch_sub_module
self.__all_gather_params(params_to_fetch, forward)
File "/opt/conda/lib/python3.8/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
ret_val = func(*args, **kwargs)
File "/opt/conda/lib/python3.8/site-packages/deepspeed/runtime/zero/partitioned_param_coordinator.py", line 434, in __all_gather_params
self._all_gather_params(nonquantized_params, forward, quantize=self.zero_quantized_weights)
File "/opt/conda/lib/python3.8/site-packages/deepspeed/runtime/zero/partitioned_param_coordinator.py", line 463, in _all_gather_params
handle = param_group[0].all_gather_coalesced(param_group, quantize=quantize)
File "/opt/conda/lib/python3.8/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
ret_val = func(*args, **kwargs)
File "/opt/conda/lib/python3.8/site-packages/deepspeed/runtime/zero/partition_parameters.py", line 1217, in all_gather_coalesced
handles = _dist_allgather_fn(
File "/opt/conda/lib/python3.8/site-packages/deepspeed/runtime/zero/partition_parameters.py", line 93, in _dist_allgather_fn
return instrument_w_nvtx(dist.allgather_fn)(output_tensor, input_tensor, group=group, async_op=True)
File "/opt/conda/lib/python3.8/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
ret_val = func(*args, **kwargs)
File "/opt/conda/lib/python3.8/site-packages/deepspeed/comm/comm.py", line 320, in allgather_fn
return all_gather_into_tensor(output_tensor, input_tensor, group=group, async_op=async_op, debug=debug)
File "/opt/conda/lib/python3.8/site-packages/deepspeed/comm/comm.py", line 117, in log_wrapper
return func(*args, **kwargs)
File "/opt/conda/lib/python3.8/site-packages/deepspeed/comm/comm.py", line 305, in all_gather_into_tensor
return cdb.all_gather_into_tensor(output_tensor=output_tensor, input_tensor=tensor, group=group, async_op=async_op)
File "/opt/conda/lib/python3.8/site-packages/deepspeed/comm/torch.py", line 219, in all_gather_into_tensor
return self.all_gather_function(output_tensor=output_tensor,
File "/opt/conda/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 1451, in wrapper
return func(*args, **kwargs)
File "/opt/conda/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 2532, in all_gather_into_tensor
work = group._allgather_base(output_tensor, input_tensor)
RuntimeError: NCCL communicator was aborted on rank 7. Original reason for failure was: [Rank 7] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=323081, OpType=_ALLGATHER_BASE, Timeout(ms)=1800000) ran for 1800951 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:455] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[E ProcessGroupNCCL.cpp:460] To avoid data inconsistency, we are taking the entire process down.
terminate called after throwing an instance of 'std::runtime_error'
what(): [Rank 7] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=323081, OpType=_ALLGATHER_BASE, Timeout(ms)=1800000) ran for 1800951 milliseconds before timing out.
real-30:158:537 [4] NCCL INFO [Service thread] Connection closed by localRank 7
Expected behavior
No response
System Info
python3.8 torch==2.0.1+cu117 transformers==4.40.0.dev0 trl==0.8.1 accelerate==0.27.2 deepspeed=0.14.0 加载完模型。进入traner训练,设置batch_size=4,最大长度是8192,zero=3,多机多卡、A800资源、显存占用近80G export NCCL_SOCKET_IFNAME=bond0 export NCCL_DEBUG=INFO export NCCL_IB_HCA=mlx5_1 export NCCL_IB_GID_INDEX=3 export NCCL_IB_TC=106 export NCCL_NET_PLUGIN=none
Others
No response
和你一样的问题,这是加载数据太多,nccl timeout,我设置ddp_timeout=180000000,感觉没生效,你最后解决了没
我的数据加载是先处理完cache存起来加载的,目前已经是进入训练阶段,根据错误提示,应该是卡在了all_gather阶段 补充一下: 一样的环境跑其他的模型,如qwen1.5非moe模型没有问题,可以正常训练
Duplicate of https://github.com/hiyouga/LLaMA-Factory/issues/3147