LLaMA-Factory qwen-moe zero3训练卡住

Reminder

[X] I have read the README and searched the existing issues.

Reproduction

torchrun启动 train_bash.py --deepspeed ${deepspeed_config_file}
--stage pt
--template qwen
--model_name_or_path ${pretrained_model}
--do_train
--dataset_dir ${dataset_dir}
--finetuning_type full
--output_dir ${output_dir}
--cache_path ${data_cache}
--per_device_train_batch_size ${per_device_train_batch_size}
--gradient_accumulation_steps ${gradient_accumulation_steps}
--num_train_epochs ${num_train_epochs}
--logging_steps 10
--lr_scheduler_type cosine
--learning_rate ${lr}
--warmup_ratio 0.05
--logging_strategy steps
--save_steps 1000
--plot_loss
--bf16
--bf16_full_eval
--report_to wandb
--overwrite_cache
--overwrite_output_dir
--preprocessing_num_workers 128
--flash_attn
--cutoff_len 8192
--save_total_limit 3
deepspeed配置： json = { "zero_optimization": { "stage": 3, "overlap_comm": true, "contiguous_gradients": true, "sub_group_size": 0, "reduce_bucket_size": 4.194304e+06, "stage3_prefetch_bucket_size": 3.774874e+06, "stage3_param_persistence_threshold": 2.048000e+04, "stage3_max_live_parameters": 0, "stage3_max_reuse_distance": 0, "stage3_gather_16bit_weights_on_model_save": true }, "fp16": { "enabled": false, "auto_cast": false, "loss_scale": 0, "initial_scale_power": 32, "loss_scale_window": 1000, "hysteresis": 2, "min_loss_scale": 1e-10 }, "bf16": { "enabled": true }, "optimizer": { "type": "AdamW", "params": { "lr": 5e-06, "betas": [0.9, 0.999], "eps": 1e-08, "weight_decay": 0.0 } }, "scheduler": { "type": "WarmupDecayLR", "params": { "last_batch_iteration": -1, "total_num_steps": 1.154500e+04, "warmup_min_lr": 0, "warmup_max_lr": 5e-06, "warmup_num_steps": 578 } }, "gradient_accumulation_steps": 4, "gradient_clipping": 1.0, "steps_per_print": inf, "train_batch_size": 576, "train_micro_batch_size_per_gpu": 3, "wall_clock_breakdown": false } 遇到以下 0% 1/11545 [02:07<409:18:40, 127.64s/it] 0% 2/11545 [03:32<328:40:10, 102.50s/it] 0% 3/11545 [04:55<300:42:30, 93.79s/it] 0% 4/11545 [06:16<283:47:31, 88.52s/it] 0% 5/11545 [07:35<273:18:40, 85.26s/it] 0% 6/11545 [08:57<269:31:55, 84.09s/it] 0% 7/11545 [10:16<264:27:13, 82.51s/it][E ProcessGroupNCCL.cpp:828] [Rank 5] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=323081, OpType=_ALLGATHER_BASE, Timeout(ms)=1800000) ran for 1801028 milliseconds before timing out. [E ProcessGroupNCCL.cpp:828] [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=323081, OpType=_ALLGATHER_BASE, Timeout(ms)=1800000) ran for 1801028 milliseconds before timing out. [E ProcessGroupNCCL.cpp:828] [Rank 7] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=323081, OpType=_ALLGATHER_BASE, Timeout(ms)=1800000) ran for 1800951 milliseconds before timing out. [E ProcessGroupNCCL.cpp:828] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=323081, OpType=_ALLGATHER_BASE, Timeout(ms)=1800000) ran for 1801070 milliseconds before timing out. [E ProcessGroupNCCL.cpp:828] [Rank 6] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=323081, OpType=_ALLGATHER_BASE, Timeout(ms)=1800000) ran for 1801070 milliseconds before timing out. [E ProcessGroupNCCL.cpp:828] [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=323081, OpType=_ALLGATHER_BASE, Timeout(ms)=1800000) ran for 1801078 milliseconds before timing out. [E ProcessGroupNCCL.cpp:828] [Rank 4] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=323081, OpType=_ALLGATHER_BASE, Timeout(ms)=1800000) ran for 1801185 milliseconds before timing out. [E ProcessGroupNCCL.cpp:828] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=323081, OpType=_ALLGATHER_BASE, Timeout(ms)=1800000) ran for 1801412 milliseconds before timing out. real-30:161:272 [0] NCCL INFO comm 0x55d52ef87f90 rank 7 nranks 48 cudaDev 7 busId e7000 - Abort COMPLETE Traceback (most recent call last): File "/LLaMA-Factory-main-0411/src/llmtuner/train/tuner.py", line 31, in run_exp run_pt(model_args, data_args, training_args, finetuning_args, callbacks) File "/LLaMA-Factory-main-0411/src/llmtuner/train/pt/workflow.py", line 217, in run_pt train_result = trainer.train(resume_from_checkpoint=training_args.resume_from_checkpoint) File "/opt/conda/lib/python3.8/site-packages/transformers/trainer.py", line 1849, in train return inner_training_loop( File "/opt/conda/lib/python3.8/site-packages/transformers/trainer.py", line 2193, in _inner_training_loop tr_loss_step = self.training_step(model, inputs) File "/opt/conda/lib/python3.8/site-packages/transformers/trainer.py", line 3128, in training_step loss = self.compute_loss(model, inputs) File "/opt/conda/lib/python3.8/site-packages/transformers/trainer.py", line 3151, in compute_loss outputs = model(**inputs) File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl return forward_call(*args, **kwargs) File "/opt/conda/lib/python3.8/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn ret_val = func(*args, **kwargs) File "/opt/conda/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 1852, in forward loss = self.module(*inputs, **kwargs) File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1538, in _call_impl result = forward_call(*args, **kwargs) File "/opt/conda/lib/python3.8/site-packages/transformers/models/qwen2_moe/modeling_qwen2_moe.py", line 1352, in forward outputs = self.model( File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1538, in _call_impl result = forward_call(*args, **kwargs) File "/opt/conda/lib/python3.8/site-packages/transformers/models/qwen2_moe/modeling_qwen2_moe.py", line 1210, in forward layer_outputs = self._gradient_checkpointing_func( File "/opt/conda/lib/python3.8/site-packages/torch/utils/checkpoint.py", line 249, in checkpoint return CheckpointFunction.apply(function, preserve, *args) File "/opt/conda/lib/python3.8/site-packages/torch/autograd/function.py", line 506, in apply return super().apply(*args, **kwargs) # type: ignore[misc] File "/opt/conda/lib/python3.8/site-packages/torch/utils/checkpoint.py", line 107, in forward outputs = run_function(*args) File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1538, in _call_impl result = forward_call(*args, **kwargs) File "/opt/conda/lib/python3.8/site-packages/transformers/models/qwen2_moe/modeling_qwen2_moe.py", line 932, in forward hidden_states = self.mlp(hidden_states) File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1538, in _call_impl result = forward_call(*args, **kwargs) File "/opt/conda/lib/python3.8/site-packages/transformers/models/qwen2_moe/modeling_qwen2_moe.py", line 850, in forward current_hidden_states = expert_layer(current_state) * routing_weights[top_x, idx, None] File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1538, in _call_impl result = forward_call(*args, **kwargs) File "/opt/conda/lib/python3.8/site-packages/transformers/models/qwen2_moe/modeling_qwen2_moe.py", line 260, in forward return self.down_proj(self.act_fn(self.gate_proj(x)) * self.up_proj(x)) File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl result = hook(self, args) File "/opt/conda/lib/python3.8/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn ret_val = func(*args, **kwargs) File "/opt/conda/lib/python3.8/site-packages/deepspeed/runtime/zero/parameter_offload.py", line 278, in _pre_forward_module_hook self.pre_sub_module_forward_function(module) File "/opt/conda/lib/python3.8/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context return func(*args, **kwargs) File "/opt/conda/lib/python3.8/site-packages/deepspeed/runtime/zero/parameter_offload.py", line 452, in pre_sub_module_forward_function param_coordinator.fetch_sub_module(sub_module, forward=True) File "/opt/conda/lib/python3.8/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn ret_val = func(*args, **kwargs) File "/opt/conda/lib/python3.8/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context return func(*args, **kwargs) File "/opt/conda/lib/python3.8/site-packages/deepspeed/runtime/zero/partitioned_param_coordinator.py", line 290, in fetch_sub_module self.__all_gather_params(params_to_fetch, forward) File "/opt/conda/lib/python3.8/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn ret_val = func(*args, **kwargs) File "/opt/conda/lib/python3.8/site-packages/deepspeed/runtime/zero/partitioned_param_coordinator.py", line 434, in __all_gather_params self._all_gather_params(nonquantized_params, forward, quantize=self.zero_quantized_weights) File "/opt/conda/lib/python3.8/site-packages/deepspeed/runtime/zero/partitioned_param_coordinator.py", line 463, in _all_gather_params handle = param_group[0].all_gather_coalesced(param_group, quantize=quantize) File "/opt/conda/lib/python3.8/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn ret_val = func(*args, **kwargs) File "/opt/conda/lib/python3.8/site-packages/deepspeed/runtime/zero/partition_parameters.py", line 1217, in all_gather_coalesced handles = _dist_allgather_fn( File "/opt/conda/lib/python3.8/site-packages/deepspeed/runtime/zero/partition_parameters.py", line 93, in _dist_allgather_fn return instrument_w_nvtx(dist.allgather_fn)(output_tensor, input_tensor, group=group, async_op=True) File "/opt/conda/lib/python3.8/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn ret_val = func(*args, **kwargs) File "/opt/conda/lib/python3.8/site-packages/deepspeed/comm/comm.py", line 320, in allgather_fn return all_gather_into_tensor(output_tensor, input_tensor, group=group, async_op=async_op, debug=debug) File "/opt/conda/lib/python3.8/site-packages/deepspeed/comm/comm.py", line 117, in log_wrapper return func(*args, **kwargs) File "/opt/conda/lib/python3.8/site-packages/deepspeed/comm/comm.py", line 305, in all_gather_into_tensor return cdb.all_gather_into_tensor(output_tensor=output_tensor, input_tensor=tensor, group=group, async_op=async_op) File "/opt/conda/lib/python3.8/site-packages/deepspeed/comm/torch.py", line 219, in all_gather_into_tensor return self.all_gather_function(output_tensor=output_tensor, File "/opt/conda/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 1451, in wrapper return func(*args, **kwargs) File "/opt/conda/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 2532, in all_gather_into_tensor work = group._allgather_base(output_tensor, input_tensor) RuntimeError: NCCL communicator was aborted on rank 7. Original reason for failure was: [Rank 7] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=323081, OpType=_ALLGATHER_BASE, Timeout(ms)=1800000) ran for 1800951 milliseconds before timing out. [E ProcessGroupNCCL.cpp:455] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [E ProcessGroupNCCL.cpp:460] To avoid data inconsistency, we are taking the entire process down. terminate called after throwing an instance of 'std::runtime_error' what(): [Rank 7] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=323081, OpType=_ALLGATHER_BASE, Timeout(ms)=1800000) ran for 1800951 milliseconds before timing out. real-30:158:537 [4] NCCL INFO [Service thread] Connection closed by localRank 7

Expected behavior

No response

System Info

python3.8 torch==2.0.1+cu117 transformers==4.40.0.dev0 trl==0.8.1 accelerate==0.27.2 deepspeed=0.14.0 加载完模型。进入traner训练，设置batch_size=4，最大长度是8192，zero=3，多机多卡、A800资源、显存占用近80G export NCCL_SOCKET_IFNAME=bond0 export NCCL_DEBUG=INFO export NCCL_IB_HCA=mlx5_1 export NCCL_IB_GID_INDEX=3 export NCCL_IB_TC=106 export NCCL_NET_PLUGIN=none

Others

No response

Apr 11 '24 15:04 chensongcan

和你一样的问题，这是加载数据太多，nccl timeout，我设置ddp_timeout=180000000，感觉没生效，你最后解决了没

Apr 12 '24 08:04 Kk1984up

我的数据加载是先处理完cache存起来加载的，目前已经是进入训练阶段，根据错误提示，应该是卡在了all_gather阶段补充一下：一样的环境跑其他的模型，如qwen1.5非moe模型没有问题，可以正常训练

Apr 12 '24 09:04 chensongcan

Duplicate of https://github.com/hiyouga/LLaMA-Factory/issues/3147

Apr 16 '24 06:04 hiyouga

LLaMA-Factory LLaMA-Factory copied to clipboard

qwen-moe zero3训练卡住

Reminder

Reproduction

Expected behavior

System Info

Others

LLaMA-Factory
LLaMA-Factory copied to clipboard