训练中途突然报错 NCCL watchdog thread terminated with exception
Describe the bug
使用swift sft 命令微调MiniCPM-v-2.6模型时,训练到中途突然报错:
Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[E ProcessGroupNCCL.cpp:916] [Rank 3] NCCL watchdog thread terminated with exception: [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1250, OpType=ALLREDUCE, NumelIn=20280320, NumelOut=20280320, Timeout(ms)=1800000) ran for 1800782 milliseconds before timing out.
terminate called after throwing an instance of 'std::runtime_error'
该报错的意思是,一直在等某张GPU的数据计算完成然后all_reduce,但是卡在了某张GPU上(该GPU上数据没有完成计算),最终报错 time out。但是如果是数据有问题,在读取阶段应该能直接跳过有问题数据,这种在GPU上卡住算不出来的问题如何解决呢?
我的运行命令:
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 NPROC_PER_NODE=8 swift sft
--model_type minicpm-v-v2_6-chat
--model_id_or_path ../checkpoint/openbmb/MiniCPM-V-2_6
--sft_type lora
--dataset xxx.json
--save_steps 50
--val_dataset xxx.json
--deepspeed default-zero2
torch版本:2.1.2+cu118
训练中途:
这个会比较奇怪,怎么可能阻塞30分钟都拿不到数据
py-spy dump --pid xxx
看下每个进程都阻塞在了哪里
SIZE_FACTOR=28 MAX_PIXELS=100352 NFRAMES=24 \
swift sft \
--model_type qwen2-vl-7b-instruct \
--model_id_or_path Qwen2-VL-7B-Instruct \
--sft_type full \
--freeze_vit false \
--max_length 2048 \
--lazy_tokenize true \
--gradient_accumulation_step 2 \
--batch_size 1 \
--num_train_epochs 1 \
--learning_rate 1e-5 \
--weight_decay 0.1 \
--lr_scheduler_type cosine \
--warmup_ratio 0.05 \
--save_steps 200 \
--logging_steps 1 \
--dataloader_num_workers 8 \
--dataset qwen2-vl-val.jsonl \
--dataset_test_ratio 0.005 \
--output_dir qwen2-vl-7b-20240912 \
--deepspeed default-zero2
遇到同样问题
SIZE_FACTOR=28 MAX_PIXELS=100352 NFRAMES=24 \ swift sft \ --model_type qwen2-vl-7b-instruct \ --model_id_or_path Qwen2-VL-7B-Instruct \ --sft_type full \ --freeze_vit false \ --max_length 2048 \ --lazy_tokenize true \ --gradient_accumulation_step 2 \ --batch_size 1 \ --num_train_epochs 1 \ --learning_rate 1e-5 \ --weight_decay 0.1 \ --lr_scheduler_type cosine \ --warmup_ratio 0.05 \ --save_steps 200 \ --logging_steps 1 \ --dataloader_num_workers 8 \ --dataset qwen2-vl-val.jsonl \ --dataset_test_ratio 0.005 \ --output_dir qwen2-vl-7b-20240912 \ --deepspeed default-zero2遇到同样问题
py-spy进程主要有两种结果:
Process 175053: /usr/local/bin/python -u /usr/local/lib/python3.10/site-packages/swift/cli/sft.py --model_type qwen2-vl-7b-instruct --model_id_or_path Qwen2-VL-7B-Instruct --sft_type full --freeze_vit false --max_length 2048 --lazy_tokenize true --gradient_accumulation_step 2 --batch_size 1 --num_train_epochs 1 --learning_rate 1e-5 --weight_decay 0.1 --lr_scheduler_type cosine --warmup_ratio 0.05 --save_steps 200 --logging_steps 1 --dataloader_num_workers 1 --dataset qwen2-vl-val.jsonl --dataset_test_ratio 0.005 --output_dir qwen2-vl-7b-20240912 --deepspeed default-zero2
Python v3.10.14 (/usr/local/bin/python3.10)
Thread 175053 (active): "MainThread"
synchronize (torch/cuda/__init__.py:792)
synchronize (deepspeed/accelerator/cuda_accelerator.py:78)
independent_gradient_partition_epilogue (deepspeed/runtime/zero/stage_1_and_2.py:764)
overlapping_partition_gradients_reduce_epilogue (deepspeed/runtime/zero/stage_1_and_2.py:863)
allreduce_gradients (deepspeed/runtime/engine.py:1912)
wrapped_fn (deepspeed/utils/nvtx.py:15)
backward (deepspeed/runtime/engine.py:1993)
wrapped_fn (deepspeed/utils/nvtx.py:15)
backward (accelerate/utils/deepspeed.py:166)
backward (accelerate/accelerator.py:2151)
training_step (transformers/trainer.py:3452)
_inner_training_loop (transformers/trainer.py:2326)
train (transformers/trainer.py:1991)
train (swift/trainers/mixin.py:426)
llm_sft (swift/llm/sft.py:413)
x_main (swift/utils/run_utils.py:32)
<module> (swift/cli/sft.py:5)
Thread 175383 (idle): "Thread-1"
wait (threading.py:324)
wait (threading.py:607)
run (tqdm/_monitor.py:60)
_bootstrap_inner (threading.py:1016)
_bootstrap (threading.py:973)
Thread 175888 (idle): "Thread-2"
wait (threading.py:324)
wait (threading.py:607)
run (tqdm/_monitor.py:60)
_bootstrap_inner (threading.py:1016)
_bootstrap (threading.py:973)
Thread 176363 (idle): "Thread-3 (_pin_memory_loop)"
select (selectors.py:416)
wait (multiprocessing/connection.py:931)
_poll (multiprocessing/connection.py:424)
poll (multiprocessing/connection.py:257)
get (multiprocessing/queues.py:113)
do_one_step (torch/utils/data/_utils/pin_memory.py:31)
_pin_memory_loop (torch/utils/data/_utils/pin_memory.py:54)
run (threading.py:953)
_bootstrap_inner (threading.py:1016)
_bootstrap (threading.py:973)
Thread 176488 (idle): "QueueFeederThread"
wait (threading.py:320)
_feed (multiprocessing/queues.py:231)
run (threading.py:953)
_bootstrap_inner (threading.py:1016)
_bootstrap (threading.py:973)
Process 177063: /usr/local/bin/python -u /usr/local/lib/python3.10/site-packages/swift/cli/sft.py --model_type qwen2-vl-7b-instruct --model_id_or_path Qwen2-VL-7B-Instruct --sft_type full --freeze_vit false --max_length 2048 --lazy_tokenize true --gradient_accumulation_step 2 --batch_size 1 --num_train_epochs 1 --learning_rate 1e-5 --weight_decay 0.1 --lr_scheduler_type cosine --warmup_ratio 0.05 --save_steps 200 --logging_steps 1 --dataloader_num_workers 1 --dataset qwen2-vl-val.jsonl --dataset_test_ratio 0.005 --output_dir qwen2-vl-7b-20240912 --deepspeed default-zero2
Python v3.10.14 (/usr/local/bin/python3.10)
Thread 177063 (idle): "MainThread"
select (selectors.py:416)
wait (multiprocessing/connection.py:931)
_poll (multiprocessing/connection.py:424)
poll (multiprocessing/connection.py:257)
get (multiprocessing/queues.py:113)
_worker_loop (torch/utils/data/_utils/worker.py:275)
run (multiprocessing/process.py:108)
_bootstrap (multiprocessing/process.py:314)
_launch (multiprocessing/popen_fork.py:71)
__init__ (multiprocessing/popen_fork.py:19)
_Popen (multiprocessing/context.py:281)
_Popen (multiprocessing/context.py:224)
start (multiprocessing/process.py:121)
__init__ (torch/utils/data/dataloader.py:1040)
_get_iterator (torch/utils/data/dataloader.py:387)
__iter__ (torch/utils/data/dataloader.py:439)
__iter__ (accelerate/data_loader.py:451)
_inner_training_loop (transformers/trainer.py:2284)
train (transformers/trainer.py:1991)
train (swift/trainers/mixin.py:426)
llm_sft (swift/llm/sft.py:413)
x_main (swift/utils/run_utils.py:32)
<module> (swift/cli/sft.py:5)
Thread 177190 (idle): "QueueFeederThread"
wait (threading.py:320)
_feed (multiprocessing/queues.py:231)
run (threading.py:953)
_bootstrap_inner (threading.py:1016)
_bootstrap (threading.py:973)
Thread 177191 (idle): "Thread-3 (_serve)"
accept (socket.py:293)
accept (multiprocessing/connection.py:609)
accept (multiprocessing/connection.py:463)
_serve (multiprocessing/resource_sharer.py:138)
run (threading.py:953)
_bootstrap_inner (threading.py:1016)
_bootstrap (threading.py:973)
SIZE_FACTOR=28 MAX_PIXELS=100352 NFRAMES=24 \ swift sft \ --model_type qwen2-vl-7b-instruct \ --model_id_or_path Qwen2-VL-7B-Instruct \ --sft_type full \ --freeze_vit false \ --max_length 2048 \ --lazy_tokenize true \ --gradient_accumulation_step 2 \ --batch_size 1 \ --num_train_epochs 1 \ --learning_rate 1e-5 \ --weight_decay 0.1 \ --lr_scheduler_type cosine \ --warmup_ratio 0.05 \ --save_steps 200 \ --logging_steps 1 \ --dataloader_num_workers 8 \ --dataset qwen2-vl-val.jsonl \ --dataset_test_ratio 0.005 \ --output_dir qwen2-vl-7b-20240912 \ --deepspeed default-zero2遇到同样问题
问题相同。设置 --freeze_vit false 就会出现卡死。设置--freeze_vit true 就能正常训练。
SIZE_FACTOR=28 MAX_PIXELS=100352 NFRAMES=24 \ swift sft \ --model_type qwen2-vl-7b-instruct \ --model_id_or_path Qwen2-VL-7B-Instruct \ --sft_type full \ --freeze_vit false \ --max_length 2048 \ --lazy_tokenize true \ --gradient_accumulation_step 2 \ --batch_size 1 \ --num_train_epochs 1 \ --learning_rate 1e-5 \ --weight_decay 0.1 \ --lr_scheduler_type cosine \ --warmup_ratio 0.05 \ --save_steps 200 \ --logging_steps 1 \ --dataloader_num_workers 8 \ --dataset qwen2-vl-val.jsonl \ --dataset_test_ratio 0.005 \ --output_dir qwen2-vl-7b-20240912 \ --deepspeed default-zero2遇到同样问题
问题相同。设置 --freeze_vit false 就会出现卡死。设置--freeze_vit true 就能正常训练。
https://github.com/modelscope/ms-swift/pull/2114
SIZE_FACTOR=28 MAX_PIXELS=100352 NFRAMES=24 \ swift sft \ --model_type qwen2-vl-7b-instruct \ --model_id_or_path Qwen2-VL-7B-Instruct \ --sft_type full \ --freeze_vit false \ --max_length 2048 \ --lazy_tokenize true \ --gradient_accumulation_step 2 \ --batch_size 1 \ --num_train_epochs 1 \ --learning_rate 1e-5 \ --weight_decay 0.1 \ --lr_scheduler_type cosine \ --warmup_ratio 0.05 \ --save_steps 200 \ --logging_steps 1 \ --dataloader_num_workers 8 \ --dataset qwen2-vl-val.jsonl \ --dataset_test_ratio 0.005 \ --output_dir qwen2-vl-7b-20240912 \ --deepspeed default-zero2遇到同样问题
问题相同。设置 --freeze_vit false 就会出现卡死。设置--freeze_vit true 就能正常训练。
#2114
拉取最新代码+更新transformers==4.45.0+更新accelerate==0.34.2 还是出现训练卡住的现象
Train: 0%| | 0/40340 [00:00<?, ?it/s][WARNING:swift] Current length of row(2130) is larger than the max_length(2048), deleted.
[WARNING:swift] Current length of row(3365) is larger than the max_length(2048), deleted.
[INFO:swift] Using environment variable `NFRAMES`, Setting nframes: 24.
[INFO:swift] Setting fps: None. You can adjust this hyperparameter through the environment variable: `FPS`.
[INFO:swift] Setting min_pixels: 100352. You can adjust this hyperparameter through the environment variable: `MIN_PIXELS`.
[INFO:swift] Setting total_pixels: 19267584. You can adjust this hyperparameter through the environment variable: `TOTAL_PIXELS`.
[INFO:swift] Using environment variable `NFRAMES`, Setting nframes: 24.
[INFO:swift] Setting fps: None. You can adjust this hyperparameter through the environment variable: `FPS`.
[INFO:swift] Setting min_pixels: 100352. You can adjust this hyperparameter through the environment variable: `MIN_PIXELS`.
[INFO:swift] Setting total_pixels: 19267584. You can adjust this hyperparameter through the environment variable: `TOTAL_PIXELS`.
[INFO:swift] Using environment variable `NFRAMES`, Setting nframes: 24.
[INFO:swift] Setting fps: None. You can adjust this hyperparameter through the environment variable: `FPS`.
[INFO:swift] Setting min_pixels: 100352. You can adjust this hyperparameter through the environment variable: `MIN_PIXELS`.
[INFO:swift] Setting total_pixels: 19267584. You can adjust this hyperparameter through the environment variable: `TOTAL_PIXELS`.
[ERROR:swift] Error occurs in lazy tokenize: File not found: /mnt_wg/zhoumo.xjq/TDS1M/video/335337510318.mp4
[INFO:swift] Using environment variable `NFRAMES`, Setting nframes: 24.
[INFO:swift] Setting fps: None. You can adjust this hyperparameter through the environment variable: `FPS`.
[INFO:swift] Setting min_pixels: 100352. You can adjust this hyperparameter through the environment variable: `MIN_PIXELS`.
[INFO:swift] Setting total_pixels: 19267584. You can adjust this hyperparameter through the environment variable: `TOTAL_PIXELS`.
pip list | grep swift看看
pip list | grep swift看看
root@dlcprsc93a7i8zci-master-0:~# pip show ms-swift
Name: ms-swift
Version: 2.5.0.dev0
Summary: Swift: Scalable lightWeight Infrastructure for Fine-Tuning
Home-page: https://github.com/modelscope/swift
Author: DAMO ModelScope teams
Author-email: [email protected]
License: Apache License 2.0
Location: /root/swift
Editable project location: /root/swift
Requires: accelerate, addict, aiohttp, attrdict, binpacking, dacite, datasets, einops, importlib_metadata, jieba, matplotlib, modelscope, nltk, numpy, oss2, pandas, peft, requests, rouge, safetensors, tensorboard, tqdm, transformers, transformers_stream_generator, trl
Required-by:
pip list | grep swift看看
root@dlcprsc93a7i8zci-master-0:~# pip show ms-swift Name: ms-swift Version: 2.5.0.dev0 Summary: Swift: Scalable lightWeight Infrastructure for Fine-Tuning Home-page: https://github.com/modelscope/swift Author: DAMO ModelScope teams Author-email: [email protected] License: Apache License 2.0 Location: /root/swift Editable project location: /root/swift Requires: accelerate, addict, aiohttp, attrdict, binpacking, dacite, datasets, einops, importlib_metadata, jieba, matplotlib, modelscope, nltk, numpy, oss2, pandas, peft, requests, rouge, safetensors, tensorboard, tqdm, transformers, transformers_stream_generator, trl Required-by:
你这是2.0版本的swift吧,是不是得换3.0以上的。还有问题的原因其实应该是某个数据batch是纯文本数据,导致vision encoder模型没有数据流入,但是其又需要训练,因此和其他的rank不同步了(因为其他rank有图像数据),导致NCCL阻塞。
@Jintao-Huang
同样问题存在于训练超长文本模型,训练部分step报错。
训练命令如下:
deepspeed --hostfile=/etc/mpi/hostfile
swift/cli/sft.py
--model $PRETRAIN_MODEL
--torch_dtype bfloat16
--train_type full
--use_chat_template
--dataset $data_path
--packing true
--num_train_epochs 3
--per_device_train_batch_size $per_node_bsz
--data_seed 42
--weight_decay 0.1
--learning_rate 1e-5
--attn_impl flash_attn
--deepspeed zero3
--gradient_accumulation_steps $gradient_accumulation_steps
--warmup_ratio 0.01
--dataset_num_proc 8
--system "You are Qwen, created by Alibaba Cloud. You are a helpful assistant."
--save_total_limit 5
--save_strategy epoch
--eval_strategy no
--max_length 131072
--truncation_strategy delete
--split_dataset_ratio 0
--output_dir $output_dir
--use_liger_kernel true
--lazy_tokenize true
--use_hf
Bug截图如下:
@Jintao-Huang 同样问题存在于训练超长文本模型,训练部分step报错。 训练命令如下: deepspeed --hostfile=/etc/mpi/hostfile swift/cli/sft.py --model $PRETRAIN_MODEL --torch_dtype bfloat16 --train_type full --use_chat_template --dataset $data_path --packing true --num_train_epochs 3 --per_device_train_batch_size $per_node_bsz --data_seed 42 --weight_decay 0.1 --learning_rate 1e-5 --attn_impl flash_attn --deepspeed zero3 --gradient_accumulation_steps $gradient_accumulation_steps --warmup_ratio 0.01 --dataset_num_proc 8 --system "You are Qwen, created by Alibaba Cloud. You are a helpful assistant." --save_total_limit 5 --save_strategy epoch --eval_strategy no --max_length 131072 --truncation_strategy delete --split_dataset_ratio 0 --output_dir $output_dir --use_liger_kernel true --lazy_tokenize true --use_hf Bug截图如下:
蹲