微调llama2的模型，报workNCCL time out

Open lijingdiaomao opened this issue 1 year ago • 1 comments

环境：H800，8卡，每张卡80G 模型：chinese-alpaca-2-7b-hf 配置： "num_train_epochs": 3, "per_device_train_batch_size": 2, "gradient_accumulation_steps": 2, "learning_rate": 1e-4, "max_seq_length": 5120, "logging_steps": 300, "save_steps": 500, "save_total_limit": 1, "lr_scheduler_type": "constant_with_warmup", "warmup_steps": 3000, "lora_rank": 64, "lora_alpha": 16, "lora_dropout": 0.05, "gradient_checkpointing": true, "disable_tqdm": false, "optim": "paged_adamw_32bit", "seed": 42, "fp16": true, "report_to": "tensorboard", "dataloader_num_workers": 10, "save_strategy": "steps", "weight_decay": 0, "max_grad_norm": 0.3, "remove_unused_columns": false

会报下面错误：

8%|▊ | 566/6834 [2:11:54<23:25:06, 13.45s/it] 8%|▊ | 567/6834 [2:12:07<23:28:42, 13.49s/it] 8%|▊ | 568/6834 [2:12:30<28:23:15, 16.31s/it][E ProcessGroupNCCL.cpp:821] [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=23290, OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1801014 milliseconds before timing out. [E ProcessGroupNCCL.cpp:821] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=23290, OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1800634 milliseconds before timing out. [E ProcessGroupNCCL.cpp:821] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=23290, OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1802445 milliseconds before timing out. [E ProcessGroupNCCL.cpp:821] [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=23289, OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1800284 milliseconds before timing out. [E ProcessGroupNCCL.cpp:456] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [E ProcessGroupNCCL.cpp:461] To avoid data inconsistency, we are taking the entire process down. [E ProcessGroupNCCL.cpp:456] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [E ProcessGroupNCCL.cpp:461] To avoid data inconsistency, we are taking the entire process down. [E ProcessGroupNCCL.cpp:456] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [E ProcessGroupNCCL.cpp:461] To avoid data inconsistency, we are taking the entire process down. [E ProcessGroupNCCL.cpp:456] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [E ProcessGroupNCCL.cpp:461] To avoid data inconsistency, we are taking the entire process down. ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -6) local_rank: 0 (pid: 58946) of binary: /home/miniconda3/envs/firefly_transformer_new/bin/python Traceback (most recent call last): File "/home/miniconda3/envs/firefly_transformer_new/bin/torchrun", line 8, in sys.exit(main()) File "/home/miniconda3/envs/firefly_transformer_new/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 346, in wrapper return f(*args, kwargs) File "/home/miniconda3/envs/firefly_transformer_new/lib/python3.10/site-packages/torch/distributed/run.py", line 762, in main run(args) File "/home/miniconda3/envs/firefly_transformer_new/lib/python3.10/site-packages/torch/distributed/run.py", line 753, in run elastic_launch( File "/home/miniconda3/envs/firefly_transformer_new/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 132, in call** return launch_agent(self._config, self._entrypoint, list(args)) File "/home/miniconda3/envs/firefly_transformer_new/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 246, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

train_qlora.py FAILED

Failures: [1]: time : 2023-10-23_12:19:20 host : localhost.localdomain rank : 1 (local_rank: 1) exitcode : -6 (pid: 58947) error_file: <N/A> traceback : Signal 6 (SIGABRT) received by PID 58947 [2]: time : 2023-10-23_12:19:20 host : localhost.localdomain rank : 2 (local_rank: 2) exitcode : -6 (pid: 58948) error_file: <N/A> traceback : Signal 6 (SIGABRT) received by PID 58948 [3]: time : 2023-10-23_12:19:20 host : localhost.localdomain rank : 3 (local_rank: 3) exitcode : -6 (pid: 58949) error_file: <N/A> traceback : Signal 6 (SIGABRT) received by PID 58949

Root Cause (first observed failure): [0]: time : 2023-10-23_12:19:20 host : localhost.localdomain rank : 0 (local_rank: 0) exitcode : -6 (pid: 58946) error_file: <N/A> traceback : Signal 6 (SIGABRT) received by PID 58946

当max_seq_length改为4096的时候就没有问题，请问这个问题怎么解决

Nov 20 '23 09:11 lijingdiaomao

the same question

Nov 21 '23 01:11 zemu121

Firefly Firefly copied to clipboard

微调llama2的模型，报workNCCL time out

train_qlora.py FAILED

Root Cause (first observed failure): [0]: time : 2023-10-23_12:19:20 host : localhost.localdomain rank : 0 (local_rank: 0) exitcode : -6 (pid: 58946) error_file: <N/A> traceback : Signal 6 (SIGABRT) received by PID 58946

Firefly
Firefly copied to clipboard