LLaMA-Factory 多卡训练lora超时

您好，使用v100进行多卡训练总会遇到超时错误，4卡、2卡均报错。使用单卡似乎没有这种问题但是速度较慢。微调5w数据大约需要12小时。

[E ProcessGroupNCCL.cpp:828] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=80, OpType=ALLGATHER, Timeout(ms)=1800000) ran for 1805926 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:828] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=80, OpType=BROADCAST, Timeout(ms)=1800000) ran for 1805991 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:455] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[E ProcessGroupNCCL.cpp:460] To avoid data inconsistency, we are taking the entire process down.
[E ProcessGroupNCCL.cpp:455] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[E ProcessGroupNCCL.cpp:460] To avoid data inconsistency, we are taking the entire process down.

运行脚本

accelerate launch src/train_sft.py \
    --model_name_or_path ${model} \
    --do_train \
    --dataset my_dataset \
    --prompt_template alpaca \
    --finetuning_type lora --lora_target W_pack \
    --output_dir ${out_model} \
    --overwrite_cache \
    --per_device_train_batch_size 4 \ 
    --gradient_accumulation_steps 4 \ 
    --lr_scheduler_type cosine \
    --logging_steps 10 \
    --save_steps 1000 \
    --learning_rate 5e-5 \
    --num_train_epochs 3.0 \
    --plot_loss \
    --fp16 \
    --auto_find_batch_size true --per_device_train_batch_size 16

default_config.yaml

compute_environment: LOCAL_MACHINE
deepspeed_config:
  deepspeed_config_file: /home/work/data/codes/LLaMA-Efficient-Tuning/deepspeed_config_stage2.yaml
  zero3_init_flag: false
distributed_type: DEEPSPEED
downcast_bf16: 'no'
machine_rank: 0
main_training_function: main
num_machines: 1
num_processes: 2
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false

Jun 25 '23 03:06 Louis-y-nlp

是不是用了 nohup？

Jun 25 '23 03:06 hiyouga

没有起后台，在docker中直接运行的。

Jun 25 '23 03:06 Louis-y-nlp

尝试下 https://github.com/huggingface/accelerate/issues/223

Jun 25 '23 03:06 hiyouga

简单增加超时时间似乎不能解决问题，测试了下是卡在logging step上了，应该是其他rank等待rank 0计算loss时卡死了，暂时把logging steps 设置成1e9。另外运行日志看起来很奇怪，会有多个进度条。logging step设置为20的时候进度条为：

  0%|▏                                                                                                                                                                        | 1/1170 [00:18<6:02:19, 18.60s/it]06/25/2023 07:35:18 - INFO - torch.nn.parallel.distributed - Reducer buckets have been rebuilt in this iteration.
06/25/2023 07:35:18 - INFO - torch.nn.parallel.distributed - Reducer buckets have been rebuilt in this iteration.
  0%|▎                                                                                                                                                                        | 2/1170 [00:34<5:36:33, 17.29s/it][INFO|trainer.py:1779] 2023-06-25 07:35:55,332 >> ***** Running training *****
[INFO|trainer.py:1780] 2023-06-25 07:35:55,333 >>   Num examples = 50,000
[INFO|trainer.py:1781] 2023-06-25 07:35:55,333 >>   Num Epochs = 3
[INFO|trainer.py:1782] 2023-06-25 07:35:55,333 >>   Instantaneous batch size per device = 16
[INFO|trainer.py:1783] 2023-06-25 07:35:55,333 >>   Total train batch size (w. parallel, distributed & accumulation) = 128
[INFO|trainer.py:1784] 2023-06-25 07:35:55,334 >>   Gradient Accumulation steps = 4
[INFO|trainer.py:1785] 2023-06-25 07:35:55,334 >>   Total optimization steps = 2,343
[INFO|trainer.py:1786] 2023-06-25 07:35:55,336 >>   Number of trainable parameters = 4,194,304
  0%|▎                                                                                                                                                                        | 2/1170 [00:54<8:55:09, 27.49s/it]
  0%|                                                                                                                                                                                   | 0/2343 [00:00<?, ?it/s]
  1%|█▍                                                                                                                                                                      | 20/2343 [05:24<7:37:56, 11.83s/it]

同时gpu利用率也一直是100%

Jun 25 '23 07:06 Louis-y-nlp

关闭 deepspeed 试试，用普通的 accelerate config。

Jun 25 '23 14:06 hiyouga

依旧卡在logging step中。 config yaml 如下：

compute_environment: LOCAL_MACHINE
distributed_type: FSDP
downcast_bf16: 'no'
fsdp_config:
  fsdp_auto_wrap_policy: TRANSFORMER_BASED_WRAP
  fsdp_backward_prefetch_policy: BACKWARD_PRE
  fsdp_offload_params: true
  fsdp_sharding_strategy: 1
  fsdp_state_dict_type: FULL_STATE_DICT
  fsdp_transformer_layer_cls_to_wrap: ''
machine_rank: 0
main_training_function: main
mixed_precision: fp16
num_machines: 1
num_processes: 2
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false

另外昨天logging step 设置为无穷大之后，在一个save step时，成功保存了一个ckpt之后卡住了，经历了7200s（指定的超时时间）之后报了相同的错误。

RuntimeError: NCCL communicator was aborted on rank 0.  Original reason for failure was: [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=3056, OpType=ALLGATHER, Timeout(ms)=7200000) ran for 7207694 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:455] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[E ProcessGroupNCCL.cpp:460] To avoid data inconsistency, we are taking the entire process down.

Jun 26 '23 03:06 Louis-y-nlp

试试这个 config：

compute_environment: LOCAL_MACHINE                                                                                                    
distributed_type: MULTI_GPU                                                                                                           
downcast_bf16: 'no'
gpu_ids: all
machine_rank: 0
main_training_function: main
mixed_precision: fp16
num_machines: 1
num_processes: 你的GPU数量
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false

Jun 26 '23 04:06 hiyouga

还是会在logging step卡住

[INFO|trainer.py:1779] 2023-06-26 02:34:58,141 >> ***** Running training *****
[INFO|trainer.py:1780] 2023-06-26 02:34:58,142 >>   Num examples = 50,000
[INFO|trainer.py:1781] 2023-06-26 02:34:58,142 >>   Num Epochs = 3
[INFO|trainer.py:1782] 2023-06-26 02:34:58,142 >>   Instantaneous batch size per device = 16
[INFO|trainer.py:1783] 2023-06-26 02:34:58,142 >>   Total train batch size (w. parallel, distributed & accumulation) = 128
[INFO|trainer.py:1784] 2023-06-26 02:34:58,142 >>   Gradient Accumulation steps = 4
[INFO|trainer.py:1785] 2023-06-26 02:34:58,142 >>   Total optimization steps = 2,343
[INFO|trainer.py:1786] 2023-06-26 02:34:58,144 >>   Number of trainable parameters = 4,194,304
  0%|▎                                                                                                                                                                        | 2/1170 [00:54<8:54:55, 27.48s/it]
  0%|                                                                                                                                                                                   | 0/2343 [00:00<?, ?it/s[E ProcessGroupNCCL.cpp:828] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=80, OpType=ALLGATHER, Timeout(ms)=7200000) ran for 7205324 milliseconds before timing out.11:15:08, 17.35s/it]
f07b9fe29941:61323:61360 [1] NCCL INFO [Service thread] Connection closed by localRank 1
f07b9fe29941:61323:61344 [0] NCCL INFO comm 0x4724c640 rank 1 nranks 2 cudaDev 1 busId d0 - Abort COMPLETE
[E ProcessGroupNCCL.cpp:455] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[E ProcessGroupNCCL.cpp:460] To avoid data inconsistency, we are taking the entire process down.
[E ProcessGroupNCCL.cpp:828] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=80, OpType=BROADCAST, Timeout(ms)=7200000) ran for 7206742 milliseconds before timing out.
f07b9fe29941:61322:61361 [0] NCCL INFO [Service thread] Connection closed by localRank 0
f07b9fe29941:61322:61341 [0] NCCL INFO comm 0x48215190 rank 0 nranks 2 cudaDev 0 busId c0 - Abort COMPLETE
[E ProcessGroupNCCL.cpp:455] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[E ProcessGroupNCCL.cpp:460] To avoid data inconsistency, we are taking the entire process down.
[04:37:26] ERROR    failed (exitcode: -6) local_rank: 0 (pid: 61322) of binary: /root/anaconda3/envs/dolly/bin/python                                                                                  api.py:672
╭─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮
│ /root/anaconda3/envs/dolly/bin/accelerate:8 in <module>                                          │
│                                                                                                  │
│   5 from accelerate.commands.accelerate_cli import main                                          │
│   6 if __name__ == '__main__':                                                                   │
│   7 │   sys.argv[0] = re.sub(r'(-script\.pyw|\.exe)?$', '', sys.argv[0])                         │
│ ❱ 8 │   sys.exit(main())                                                                         │
│   9                                                                                              │
│                                                                                                  │
│ /root/anaconda3/envs/dolly/lib/python3.9/site-packages/accelerate/commands/accelerate_cli.py:45  │
│ in main                                                                                          │
│                                                                                                  │
│   42 │   │   exit(1)                                                                             │
│   43 │                                                                                           │
│   44 │   # Run                                                                                   │
│ ❱ 45 │   args.func(args)                                                                         │
│   46                                                                                             │
│   47                                                                                             │
│   48 if __name__ == "__main__":                                                                  │
│                                                                                                  │
│ /root/anaconda3/envs/dolly/lib/python3.9/site-packages/accelerate/commands/launch.py:928 in      │
│ launch_command                                                                                   │
│                                                                                                  │
│   925 │   │   args.deepspeed_fields_from_accelerate_config = ",".join(args.deepspeed_fields_fr   │
│   926 │   │   deepspeed_launcher(args)                                                           │
│   927 │   elif args.use_fsdp and not args.cpu:                                                   │
│ ❱ 928 │   │   multi_gpu_launcher(args)                                                           │
│   929 │   elif args.use_megatron_lm and not args.cpu:                                            │
│   930 │   │   multi_gpu_launcher(args)                                                           │
│   931 │   elif args.multi_gpu and not args.cpu:                                                  │
│                                                                                                  │
│ /root/anaconda3/envs/dolly/lib/python3.9/site-packages/accelerate/commands/launch.py:627 in      │
│ multi_gpu_launcher                                                                               │
│                                                                                                  │
│   624 │   )                                                                                      │
│   625 │   with patch_environment(**current_env):                                                 │
│   626 │   │   try:                                                                               │
│ ❱ 627 │   │   │   distrib_run.run(args)                                                          │
│   628 │   │   except Exception:                                                                  │
│   629 │   │   │   if is_rich_available() and debug:                                              │
│   630 │   │   │   │   console = get_console()                                                    │
│                                                                                                  │
│ /root/anaconda3/envs/dolly/lib/python3.9/site-packages/torch/distributed/run.py:785 in run       │
│                                                                                                  │
│   782 │   │   )                                                                                  │
│   783 │                                                                                          │
│   784 │   config, cmd, cmd_args = config_from_args(args)                                         │
│ ❱ 785 │   elastic_launch(                                                                        │
│   786 │   │   config=config,                                                                     │
│   787 │   │   entrypoint=cmd,                                                                    │
│   788 │   )(*cmd_args)                                                                           │
│                                                                                                  │
│ /root/anaconda3/envs/dolly/lib/python3.9/site-packages/torch/distributed/launcher/api.py:134 in  │
│ __call__                                                                                         │
│                                                                                                  │
│   131 │   │   self._entrypoint = entrypoint                                                      │
│   132 │                                                                                          │
│   133 │   def __call__(self, *args):                                                             │
│ ❱ 134 │   │   return launch_agent(self._config, self._entrypoint, list(args))                    │
│   135                                                                                            │
│   136                                                                                            │
│   137 def _get_entrypoint_name(                                                                  │
│                                                                                                  │
│ /root/anaconda3/envs/dolly/lib/python3.9/site-packages/torch/distributed/launcher/api.py:250 in  │
│ launch_agent                                                                                     │
│                                                                                                  │
│   247 │   │   │   # if the error files for the failed children exist                             │
│   248 │   │   │   # @record will copy the first error (root cause)                               │
│   249 │   │   │   # to the error file of the launcher process.                                   │
│ ❱ 250 │   │   │   raise ChildFailedError(                                                        │
│   251 │   │   │   │   name=entrypoint_name,                                                      │
│   252 │   │   │   │   failures=result.failures,                                                  │
│   253 │   │   │   )                                                                              │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
ChildFailedError: 
======================================================
src/train_sft.py FAILED
------------------------------------------------------
Failures:
[1]:
  time      : 2023-06-26_04:37:26
  host      : f07b9fe29941
  rank      : 1 (local_rank: 1)
  exitcode  : -6 (pid: 61323)
  error_file: <N/A>
  traceback : Signal 6 (SIGABRT) received by PID 61323
------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2023-06-26_04:37:26
  host      : f07b9fe29941
  rank      : 0 (local_rank: 0)
  exitcode  : -6 (pid: 61322)
  error_file: <N/A>
  traceback : Signal 6 (SIGABRT) received by PID 61322
======================================================

Jun 26 '23 06:06 Louis-y-nlp

把 NCCL 同步关了

Jun 26 '23 08:06 shaonianyr

加了NCCL_P2P_DISABLE=1之后第一步就会挂 @shaonianyr

Jun 27 '23 02:06 Louis-y-nlp

@Louis-y-nlp 多卡微调，跑通吗

Jul 06 '23 10:07 wuxiuxiunlp

没跑通，docker里一直卡死。

Jul 06 '23 12:07 Louis-y-nlp

简单增加超时时间似乎不能解决问题，测试了下是卡在logging step上了，应该是其他rank等待rank 0计算loss时卡死了，暂时把logging steps 设置成1e9。另外运行日志看起来很奇怪，会有多个进度条。logging step设置为20的时候进度条为：

  0%|▏                                                                                                                                                                        | 1/1170 [00:18<6:02:19, 18.60s/it]06/25/2023 07:35:18 - INFO - torch.nn.parallel.distributed - Reducer buckets have been rebuilt in this iteration.
06/25/2023 07:35:18 - INFO - torch.nn.parallel.distributed - Reducer buckets have been rebuilt in this iteration.
  0%|▎                                                                                                                                                                        | 2/1170 [00:34<5:36:33, 17.29s/it][INFO|trainer.py:1779] 2023-06-25 07:35:55,332 >> ***** Running training *****
[INFO|trainer.py:1780] 2023-06-25 07:35:55,333 >>   Num examples = 50,000
[INFO|trainer.py:1781] 2023-06-25 07:35:55,333 >>   Num Epochs = 3
[INFO|trainer.py:1782] 2023-06-25 07:35:55,333 >>   Instantaneous batch size per device = 16
[INFO|trainer.py:1783] 2023-06-25 07:35:55,333 >>   Total train batch size (w. parallel, distributed & accumulation) = 128
[INFO|trainer.py:1784] 2023-06-25 07:35:55,334 >>   Gradient Accumulation steps = 4
[INFO|trainer.py:1785] 2023-06-25 07:35:55,334 >>   Total optimization steps = 2,343
[INFO|trainer.py:1786] 2023-06-25 07:35:55,336 >>   Number of trainable parameters = 4,194,304
  0%|▎                                                                                                                                                                        | 2/1170 [00:54<8:55:09, 27.49s/it]
  0%|                                                                                                                                                                                   | 0/2343 [00:00<?, ?it/s]
  1%|█▍                                                                                                                                                                      | 20/2343 [05:24<7:37:56, 11.83s/it]

同时gpu利用率也一直是100%

@Louis-y-nlp How do you set the timeout value with accelerate launch?

Aug 02 '23 12:08 GitYCC

@GitYCC

torch.distributed.init_process_group(backend="nccl", timeout=datetime.timedelta(seconds=xxx))

Aug 03 '23 02:08 Louis-y-nlp

@Louis-y-nlp 请问您的多卡微调跑通了吗

Aug 03 '23 07:08 thugbobby

没啊，多卡一直卡死，主要没有任何报错也不知道怎么调，就单卡能跑。

Aug 03 '23 07:08 Louis-y-nlp

能跑

那有没有找到其他的解决方案，我试了好几个都不行。

Aug 03 '23 07:08 thugbobby

拉了最新版本代码跑通了。

Aug 08 '23 09:08 Louis-y-nlp

大佬神速啊，24小时高强度在线。

Aug 08 '23 09:08 Louis-y-nlp

是不是用了 nohup？

您好，我也遇到了同样的问题，我使用了nohup进行后台挂起训练，请问这是什么原因呀具体来说我的使用nohup在后台运行了一个使用deepspeed进行训练的代码，在运行了大概1000多个step后报错： Connection closed by localRank -1 然后就停掉了

Jan 11 '24 14:01 TianRuiHe

是不是用了 nohup？

想问一下，用了nohup就会有这个问题吗？

Feb 26 '24 02:02 homiec

init_process_group

这个应该加在哪呢？

Mar 01 '24 11:03 etoilestar

torch.distributed.init_process_group(backend="nccl", timeout=datetime.timedelta(seconds=xxx))

这个加在那呀，我设置 --ddp_time 加载数据集可以顺利加载一次，但是运行的时候要加载两次data_tokenizer,第二次就报错了。微信图片_20240318191811

Mar 18 '24 11:03 yawzhe

数据集小没问题，数据集大就会timeout，很可能卡在tokenizer on dataset这一步，如果是，通过设置： --preprocessing_num_workers 128 解决。

Mar 20 '24 01:03 JerryDaHeLian

LLaMA-Factory LLaMA-Factory copied to clipboard

多卡训练lora超时

LLaMA-Factory
LLaMA-Factory copied to clipboard