ms-swift icon indicating copy to clipboard operation
ms-swift copied to clipboard

训练中途突然报错 NCCL watchdog thread terminated with exception

Open Wuyingwen opened this issue 1 year ago • 9 comments

Describe the bug 使用swift sft 命令微调MiniCPM-v-2.6模型时,训练到中途突然报错: Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [E ProcessGroupNCCL.cpp:916] [Rank 3] NCCL watchdog thread terminated with exception: [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1250, OpType=ALLREDUCE, NumelIn=20280320, NumelOut=20280320, Timeout(ms)=1800000) ran for 1800782 milliseconds before timing out. terminate called after throwing an instance of 'std::runtime_error' image 该报错的意思是,一直在等某张GPU的数据计算完成然后all_reduce,但是卡在了某张GPU上(该GPU上数据没有完成计算),最终报错 time out。但是如果是数据有问题,在读取阶段应该能直接跳过有问题数据,这种在GPU上卡住算不出来的问题如何解决呢? 我的运行命令: CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 NPROC_PER_NODE=8 swift sft
--model_type minicpm-v-v2_6-chat
--model_id_or_path ../checkpoint/openbmb/MiniCPM-V-2_6
--sft_type lora
--dataset xxx.json
--save_steps 50
--val_dataset xxx.json
--deepspeed default-zero2

torch版本:2.1.2+cu118 训练中途: image

Wuyingwen avatar Aug 26 '24 06:08 Wuyingwen

这个会比较奇怪,怎么可能阻塞30分钟都拿不到数据 py-spy dump --pid xxx 看下每个进程都阻塞在了哪里

tastelikefeet avatar Aug 28 '24 09:08 tastelikefeet

SIZE_FACTOR=28 MAX_PIXELS=100352 NFRAMES=24 \
swift sft \
    --model_type qwen2-vl-7b-instruct \
    --model_id_or_path Qwen2-VL-7B-Instruct \
    --sft_type full \
    --freeze_vit false \
    --max_length 2048 \
    --lazy_tokenize true \
    --gradient_accumulation_step 2 \
    --batch_size 1 \
    --num_train_epochs 1 \
    --learning_rate 1e-5 \
    --weight_decay 0.1 \
    --lr_scheduler_type cosine \
    --warmup_ratio 0.05 \
    --save_steps 200 \
    --logging_steps 1 \
    --dataloader_num_workers 8 \
    --dataset qwen2-vl-val.jsonl \
    --dataset_test_ratio 0.005 \
    --output_dir qwen2-vl-7b-20240912 \
    --deepspeed default-zero2

遇到同样问题

yunkchen avatar Sep 14 '24 07:09 yunkchen

SIZE_FACTOR=28 MAX_PIXELS=100352 NFRAMES=24 \
swift sft \
    --model_type qwen2-vl-7b-instruct \
    --model_id_or_path Qwen2-VL-7B-Instruct \
    --sft_type full \
    --freeze_vit false \
    --max_length 2048 \
    --lazy_tokenize true \
    --gradient_accumulation_step 2 \
    --batch_size 1 \
    --num_train_epochs 1 \
    --learning_rate 1e-5 \
    --weight_decay 0.1 \
    --lr_scheduler_type cosine \
    --warmup_ratio 0.05 \
    --save_steps 200 \
    --logging_steps 1 \
    --dataloader_num_workers 8 \
    --dataset qwen2-vl-val.jsonl \
    --dataset_test_ratio 0.005 \
    --output_dir qwen2-vl-7b-20240912 \
    --deepspeed default-zero2

遇到同样问题

py-spy进程主要有两种结果:

Process 175053: /usr/local/bin/python -u /usr/local/lib/python3.10/site-packages/swift/cli/sft.py --model_type qwen2-vl-7b-instruct --model_id_or_path Qwen2-VL-7B-Instruct --sft_type full --freeze_vit false --max_length 2048 --lazy_tokenize true --gradient_accumulation_step 2 --batch_size 1 --num_train_epochs 1 --learning_rate 1e-5 --weight_decay 0.1 --lr_scheduler_type cosine --warmup_ratio 0.05 --save_steps 200 --logging_steps 1 --dataloader_num_workers 1 --dataset qwen2-vl-val.jsonl --dataset_test_ratio 0.005 --output_dir qwen2-vl-7b-20240912 --deepspeed default-zero2
Python v3.10.14 (/usr/local/bin/python3.10)

Thread 175053 (active): "MainThread"
    synchronize (torch/cuda/__init__.py:792)
    synchronize (deepspeed/accelerator/cuda_accelerator.py:78)
    independent_gradient_partition_epilogue (deepspeed/runtime/zero/stage_1_and_2.py:764)
    overlapping_partition_gradients_reduce_epilogue (deepspeed/runtime/zero/stage_1_and_2.py:863)
    allreduce_gradients (deepspeed/runtime/engine.py:1912)
    wrapped_fn (deepspeed/utils/nvtx.py:15)
    backward (deepspeed/runtime/engine.py:1993)
    wrapped_fn (deepspeed/utils/nvtx.py:15)
    backward (accelerate/utils/deepspeed.py:166)
    backward (accelerate/accelerator.py:2151)
    training_step (transformers/trainer.py:3452)
    _inner_training_loop (transformers/trainer.py:2326)
    train (transformers/trainer.py:1991)
    train (swift/trainers/mixin.py:426)
    llm_sft (swift/llm/sft.py:413)
    x_main (swift/utils/run_utils.py:32)
    <module> (swift/cli/sft.py:5)
Thread 175383 (idle): "Thread-1"
    wait (threading.py:324)
    wait (threading.py:607)
    run (tqdm/_monitor.py:60)
    _bootstrap_inner (threading.py:1016)
    _bootstrap (threading.py:973)
Thread 175888 (idle): "Thread-2"
    wait (threading.py:324)
    wait (threading.py:607)
    run (tqdm/_monitor.py:60)
    _bootstrap_inner (threading.py:1016)
    _bootstrap (threading.py:973)
Thread 176363 (idle): "Thread-3 (_pin_memory_loop)"
    select (selectors.py:416)
    wait (multiprocessing/connection.py:931)
    _poll (multiprocessing/connection.py:424)
    poll (multiprocessing/connection.py:257)
    get (multiprocessing/queues.py:113)
    do_one_step (torch/utils/data/_utils/pin_memory.py:31)
    _pin_memory_loop (torch/utils/data/_utils/pin_memory.py:54)
    run (threading.py:953)
    _bootstrap_inner (threading.py:1016)
    _bootstrap (threading.py:973)
Thread 176488 (idle): "QueueFeederThread"
    wait (threading.py:320)
    _feed (multiprocessing/queues.py:231)
    run (threading.py:953)
    _bootstrap_inner (threading.py:1016)
    _bootstrap (threading.py:973)
Process 177063: /usr/local/bin/python -u /usr/local/lib/python3.10/site-packages/swift/cli/sft.py --model_type qwen2-vl-7b-instruct --model_id_or_path Qwen2-VL-7B-Instruct --sft_type full --freeze_vit false --max_length 2048 --lazy_tokenize true --gradient_accumulation_step 2 --batch_size 1 --num_train_epochs 1 --learning_rate 1e-5 --weight_decay 0.1 --lr_scheduler_type cosine --warmup_ratio 0.05 --save_steps 200 --logging_steps 1 --dataloader_num_workers 1 --dataset qwen2-vl-val.jsonl --dataset_test_ratio 0.005 --output_dir qwen2-vl-7b-20240912 --deepspeed default-zero2
Python v3.10.14 (/usr/local/bin/python3.10)

Thread 177063 (idle): "MainThread"
    select (selectors.py:416)
    wait (multiprocessing/connection.py:931)
    _poll (multiprocessing/connection.py:424)
    poll (multiprocessing/connection.py:257)
    get (multiprocessing/queues.py:113)
    _worker_loop (torch/utils/data/_utils/worker.py:275)
    run (multiprocessing/process.py:108)
    _bootstrap (multiprocessing/process.py:314)
    _launch (multiprocessing/popen_fork.py:71)
    __init__ (multiprocessing/popen_fork.py:19)
    _Popen (multiprocessing/context.py:281)
    _Popen (multiprocessing/context.py:224)
    start (multiprocessing/process.py:121)
    __init__ (torch/utils/data/dataloader.py:1040)
    _get_iterator (torch/utils/data/dataloader.py:387)
    __iter__ (torch/utils/data/dataloader.py:439)
    __iter__ (accelerate/data_loader.py:451)
    _inner_training_loop (transformers/trainer.py:2284)
    train (transformers/trainer.py:1991)
    train (swift/trainers/mixin.py:426)
    llm_sft (swift/llm/sft.py:413)
    x_main (swift/utils/run_utils.py:32)
    <module> (swift/cli/sft.py:5)
Thread 177190 (idle): "QueueFeederThread"
    wait (threading.py:320)
    _feed (multiprocessing/queues.py:231)
    run (threading.py:953)
    _bootstrap_inner (threading.py:1016)
    _bootstrap (threading.py:973)
Thread 177191 (idle): "Thread-3 (_serve)"
    accept (socket.py:293)
    accept (multiprocessing/connection.py:609)
    accept (multiprocessing/connection.py:463)
    _serve (multiprocessing/resource_sharer.py:138)
    run (threading.py:953)
    _bootstrap_inner (threading.py:1016)
    _bootstrap (threading.py:973)

yunkchen avatar Sep 14 '24 07:09 yunkchen

SIZE_FACTOR=28 MAX_PIXELS=100352 NFRAMES=24 \
swift sft \
    --model_type qwen2-vl-7b-instruct \
    --model_id_or_path Qwen2-VL-7B-Instruct \
    --sft_type full \
    --freeze_vit false \
    --max_length 2048 \
    --lazy_tokenize true \
    --gradient_accumulation_step 2 \
    --batch_size 1 \
    --num_train_epochs 1 \
    --learning_rate 1e-5 \
    --weight_decay 0.1 \
    --lr_scheduler_type cosine \
    --warmup_ratio 0.05 \
    --save_steps 200 \
    --logging_steps 1 \
    --dataloader_num_workers 8 \
    --dataset qwen2-vl-val.jsonl \
    --dataset_test_ratio 0.005 \
    --output_dir qwen2-vl-7b-20240912 \
    --deepspeed default-zero2

遇到同样问题

问题相同。设置 --freeze_vit false 就会出现卡死。设置--freeze_vit true 就能正常训练。

Nioolek avatar Sep 18 '24 09:09 Nioolek

SIZE_FACTOR=28 MAX_PIXELS=100352 NFRAMES=24 \
swift sft \
    --model_type qwen2-vl-7b-instruct \
    --model_id_or_path Qwen2-VL-7B-Instruct \
    --sft_type full \
    --freeze_vit false \
    --max_length 2048 \
    --lazy_tokenize true \
    --gradient_accumulation_step 2 \
    --batch_size 1 \
    --num_train_epochs 1 \
    --learning_rate 1e-5 \
    --weight_decay 0.1 \
    --lr_scheduler_type cosine \
    --warmup_ratio 0.05 \
    --save_steps 200 \
    --logging_steps 1 \
    --dataloader_num_workers 8 \
    --dataset qwen2-vl-val.jsonl \
    --dataset_test_ratio 0.005 \
    --output_dir qwen2-vl-7b-20240912 \
    --deepspeed default-zero2

遇到同样问题

问题相同。设置 --freeze_vit false 就会出现卡死。设置--freeze_vit true 就能正常训练。

https://github.com/modelscope/ms-swift/pull/2114

Jintao-Huang avatar Sep 24 '24 10:09 Jintao-Huang

SIZE_FACTOR=28 MAX_PIXELS=100352 NFRAMES=24 \
swift sft \
    --model_type qwen2-vl-7b-instruct \
    --model_id_or_path Qwen2-VL-7B-Instruct \
    --sft_type full \
    --freeze_vit false \
    --max_length 2048 \
    --lazy_tokenize true \
    --gradient_accumulation_step 2 \
    --batch_size 1 \
    --num_train_epochs 1 \
    --learning_rate 1e-5 \
    --weight_decay 0.1 \
    --lr_scheduler_type cosine \
    --warmup_ratio 0.05 \
    --save_steps 200 \
    --logging_steps 1 \
    --dataloader_num_workers 8 \
    --dataset qwen2-vl-val.jsonl \
    --dataset_test_ratio 0.005 \
    --output_dir qwen2-vl-7b-20240912 \
    --deepspeed default-zero2

遇到同样问题

问题相同。设置 --freeze_vit false 就会出现卡死。设置--freeze_vit true 就能正常训练。

#2114

拉取最新代码+更新transformers==4.45.0+更新accelerate==0.34.2 还是出现训练卡住的现象

Train:   0%|          | 0/40340 [00:00<?, ?it/s][WARNING:swift] Current length of row(2130) is larger than the max_length(2048), deleted.
[WARNING:swift] Current length of row(3365) is larger than the max_length(2048), deleted.
[INFO:swift] Using environment variable `NFRAMES`, Setting nframes: 24.
[INFO:swift] Setting fps: None. You can adjust this hyperparameter through the environment variable: `FPS`.
[INFO:swift] Setting min_pixels: 100352. You can adjust this hyperparameter through the environment variable: `MIN_PIXELS`.
[INFO:swift] Setting total_pixels: 19267584. You can adjust this hyperparameter through the environment variable: `TOTAL_PIXELS`.
[INFO:swift] Using environment variable `NFRAMES`, Setting nframes: 24.
[INFO:swift] Setting fps: None. You can adjust this hyperparameter through the environment variable: `FPS`.
[INFO:swift] Setting min_pixels: 100352. You can adjust this hyperparameter through the environment variable: `MIN_PIXELS`.
[INFO:swift] Setting total_pixels: 19267584. You can adjust this hyperparameter through the environment variable: `TOTAL_PIXELS`.
[INFO:swift] Using environment variable `NFRAMES`, Setting nframes: 24.
[INFO:swift] Setting fps: None. You can adjust this hyperparameter through the environment variable: `FPS`.
[INFO:swift] Setting min_pixels: 100352. You can adjust this hyperparameter through the environment variable: `MIN_PIXELS`.
[INFO:swift] Setting total_pixels: 19267584. You can adjust this hyperparameter through the environment variable: `TOTAL_PIXELS`.
[ERROR:swift] Error occurs in lazy tokenize: File not found: /mnt_wg/zhoumo.xjq/TDS1M/video/335337510318.mp4
[INFO:swift] Using environment variable `NFRAMES`, Setting nframes: 24.
[INFO:swift] Setting fps: None. You can adjust this hyperparameter through the environment variable: `FPS`.
[INFO:swift] Setting min_pixels: 100352. You can adjust this hyperparameter through the environment variable: `MIN_PIXELS`.
[INFO:swift] Setting total_pixels: 19267584. You can adjust this hyperparameter through the environment variable: `TOTAL_PIXELS`.

yunkchen avatar Sep 26 '24 07:09 yunkchen

pip list | grep swift看看

Jintao-Huang avatar Sep 26 '24 07:09 Jintao-Huang

pip list | grep swift看看

root@dlcprsc93a7i8zci-master-0:~# pip show ms-swift
Name: ms-swift
Version: 2.5.0.dev0
Summary: Swift: Scalable lightWeight Infrastructure for Fine-Tuning
Home-page: https://github.com/modelscope/swift
Author: DAMO ModelScope teams
Author-email: [email protected]
License: Apache License 2.0
Location: /root/swift
Editable project location: /root/swift
Requires: accelerate, addict, aiohttp, attrdict, binpacking, dacite, datasets, einops, importlib_metadata, jieba, matplotlib, modelscope, nltk, numpy, oss2, pandas, peft, requests, rouge, safetensors, tensorboard, tqdm, transformers, transformers_stream_generator, trl
Required-by:

yunkchen avatar Sep 26 '24 07:09 yunkchen

pip list | grep swift看看

root@dlcprsc93a7i8zci-master-0:~# pip show ms-swift
Name: ms-swift
Version: 2.5.0.dev0
Summary: Swift: Scalable lightWeight Infrastructure for Fine-Tuning
Home-page: https://github.com/modelscope/swift
Author: DAMO ModelScope teams
Author-email: [email protected]
License: Apache License 2.0
Location: /root/swift
Editable project location: /root/swift
Requires: accelerate, addict, aiohttp, attrdict, binpacking, dacite, datasets, einops, importlib_metadata, jieba, matplotlib, modelscope, nltk, numpy, oss2, pandas, peft, requests, rouge, safetensors, tensorboard, tqdm, transformers, transformers_stream_generator, trl
Required-by:

你这是2.0版本的swift吧,是不是得换3.0以上的。还有问题的原因其实应该是某个数据batch是纯文本数据,导致vision encoder模型没有数据流入,但是其又需要训练,因此和其他的rank不同步了(因为其他rank有图像数据),导致NCCL阻塞。

zsxm1998 avatar Mar 07 '25 15:03 zsxm1998

@Jintao-Huang 同样问题存在于训练超长文本模型,训练部分step报错。 训练命令如下: deepspeed --hostfile=/etc/mpi/hostfile
swift/cli/sft.py
--model $PRETRAIN_MODEL
--torch_dtype bfloat16
--train_type full
--use_chat_template
--dataset $data_path
--packing true
--num_train_epochs 3
--per_device_train_batch_size $per_node_bsz
--data_seed 42
--weight_decay 0.1
--learning_rate 1e-5
--attn_impl flash_attn
--deepspeed zero3
--gradient_accumulation_steps $gradient_accumulation_steps
--warmup_ratio 0.01
--dataset_num_proc 8
--system "You are Qwen, created by Alibaba Cloud. You are a helpful assistant."
--save_total_limit 5
--save_strategy epoch
--eval_strategy no
--max_length 131072
--truncation_strategy delete
--split_dataset_ratio 0
--output_dir $output_dir
--use_liger_kernel true
--lazy_tokenize true
--use_hf
Bug截图如下:

Image

leileilin avatar May 06 '25 12:05 leileilin

@Jintao-Huang 同样问题存在于训练超长文本模型,训练部分step报错。 训练命令如下: deepspeed --hostfile=/etc/mpi/hostfile swift/cli/sft.py --model $PRETRAIN_MODEL --torch_dtype bfloat16 --train_type full --use_chat_template --dataset $data_path --packing true --num_train_epochs 3 --per_device_train_batch_size $per_node_bsz --data_seed 42 --weight_decay 0.1 --learning_rate 1e-5 --attn_impl flash_attn --deepspeed zero3 --gradient_accumulation_steps $gradient_accumulation_steps --warmup_ratio 0.01 --dataset_num_proc 8 --system "You are Qwen, created by Alibaba Cloud. You are a helpful assistant." --save_total_limit 5 --save_strategy epoch --eval_strategy no --max_length 131072 --truncation_strategy delete --split_dataset_ratio 0 --output_dir $output_dir --use_liger_kernel true --lazy_tokenize true --use_hf Bug截图如下:

Image

ShuoSIr7 avatar May 30 '25 03:05 ShuoSIr7