Firefly
Firefly copied to clipboard
微调llama2的模型,报workNCCL time out
环境:H800,8卡,每张卡80G 模型:chinese-alpaca-2-7b-hf 配置: "num_train_epochs": 3, "per_device_train_batch_size": 2, "gradient_accumulation_steps": 2, "learning_rate": 1e-4, "max_seq_length": 5120, "logging_steps": 300, "save_steps": 500, "save_total_limit": 1, "lr_scheduler_type": "constant_with_warmup", "warmup_steps": 3000, "lora_rank": 64, "lora_alpha": 16, "lora_dropout": 0.05, "gradient_checkpointing": true, "disable_tqdm": false, "optim": "paged_adamw_32bit", "seed": 42, "fp16": true, "report_to": "tensorboard", "dataloader_num_workers": 10, "save_strategy": "steps", "weight_decay": 0, "max_grad_norm": 0.3, "remove_unused_columns": false
会报下面错误:
8%|▊ | 566/6834 [2:11:54<23:25:06, 13.45s/it]
8%|▊ | 567/6834 [2:12:07<23:28:42, 13.49s/it]
8%|▊ | 568/6834 [2:12:30<28:23:15, 16.31s/it][E ProcessGroupNCCL.cpp:821] [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=23290, OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1801014 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:821] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=23290, OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1800634 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:821] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=23290, OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1802445 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:821] [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=23289, OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1800284 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:456] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[E ProcessGroupNCCL.cpp:461] To avoid data inconsistency, we are taking the entire process down.
[E ProcessGroupNCCL.cpp:456] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[E ProcessGroupNCCL.cpp:461] To avoid data inconsistency, we are taking the entire process down.
[E ProcessGroupNCCL.cpp:456] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[E ProcessGroupNCCL.cpp:461] To avoid data inconsistency, we are taking the entire process down.
[E ProcessGroupNCCL.cpp:456] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[E ProcessGroupNCCL.cpp:461] To avoid data inconsistency, we are taking the entire process down.
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -6) local_rank: 0 (pid: 58946) of binary: /home/miniconda3/envs/firefly_transformer_new/bin/python
Traceback (most recent call last):
File "/home/miniconda3/envs/firefly_transformer_new/bin/torchrun", line 8, in
sys.exit(main())
File "/home/miniconda3/envs/firefly_transformer_new/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 346, in wrapper
return f(*args, **kwargs)
File "/home/miniconda3/envs/firefly_transformer_new/lib/python3.10/site-packages/torch/distributed/run.py", line 762, in main
run(args)
File "/home/miniconda3/envs/firefly_transformer_new/lib/python3.10/site-packages/torch/distributed/run.py", line 753, in run
elastic_launch(
File "/home/miniconda3/envs/firefly_transformer_new/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 132, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/home/miniconda3/envs/firefly_transformer_new/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 246, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
train_qlora.py FAILED
Failures: [1]: time : 2023-10-23_12:19:20 host : localhost.localdomain rank : 1 (local_rank: 1) exitcode : -6 (pid: 58947) error_file: <N/A> traceback : Signal 6 (SIGABRT) received by PID 58947 [2]: time : 2023-10-23_12:19:20 host : localhost.localdomain rank : 2 (local_rank: 2) exitcode : -6 (pid: 58948) error_file: <N/A> traceback : Signal 6 (SIGABRT) received by PID 58948 [3]: time : 2023-10-23_12:19:20 host : localhost.localdomain rank : 3 (local_rank: 3) exitcode : -6 (pid: 58949) error_file: <N/A> traceback : Signal 6 (SIGABRT) received by PID 58949
Root Cause (first observed failure): [0]: time : 2023-10-23_12:19:20 host : localhost.localdomain rank : 0 (local_rank: 0) exitcode : -6 (pid: 58946) error_file: <N/A> traceback : Signal 6 (SIGABRT) received by PID 58946
当max_seq_length改为4096的时候就没有问题,请问这个问题怎么解决
the same question