[BUG] 多卡微调遇到sigkill 无报错退出
如题 使用4卡a100 80gb 运行/finetune/finetune_ds.sh 遭遇sigkill 无明显报错信息
尝试调小batch size 但还是报错
========================================================
[2024-05-26 19:30:30,675] torch.distributed.run: [WARNING]
[2024-05-26 19:30:30,675] torch.distributed.run: [WARNING] *****************************************
[2024-05-26 19:30:30,675] torch.distributed.run: [WARNING] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
[2024-05-26 19:30:30,675] torch.distributed.run: [WARNING] *****************************************
[2024-05-26 19:30:35,324] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-05-26 19:30:35,328] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-05-26 19:30:35,330] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-05-26 19:30:35,333] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[WARNING] Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH
[WARNING] Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH
[WARNING] Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH
[WARNING] Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH
[WARNING] sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.1
[WARNING] using untested triton version (2.1.0), only 1.0.0 is known to be compatible
[WARNING] sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.1
[WARNING] using untested triton version (2.1.0), only 1.0.0 is known to be compatible
[WARNING] sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.1
[WARNING] using untested triton version (2.1.0), only 1.0.0 is known to be compatible
[WARNING] sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.1
[WARNING] using untested triton version (2.1.0), only 1.0.0 is known to be compatible
[2024-05-26 19:30:37,387] [INFO] [comm.py:637:init_distributed] cdb=None
[2024-05-26 19:30:37,387] [INFO] [comm.py:668:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl
[2024-05-26 19:30:37,387] [INFO] [comm.py:637:init_distributed] cdb=None
[2024-05-26 19:30:37,403] [INFO] [comm.py:637:init_distributed] cdb=None
[2024-05-26 19:30:37,403] [INFO] [comm.py:637:init_distributed] cdb=None
Loading checkpoint shards: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 7/7 [01:01<00:00, 8.72s/it]
Loading checkpoint shards: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 7/7 [01:01<00:00, 8.72s/it]
Loading checkpoint shards: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 7/7 [01:00<00:00, 8.65s/it]
Loading checkpoint shards: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 7/7 [01:00<00:00, 8.65s/it]
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
{'Total': 8537092336, 'Trainable': 8537092336}
llm_type=minicpm
Loading data...
max_steps is given, it will override any value given in num_train_epochs
max_steps is given, it will override any value given in num_train_epochs
max_steps is given, it will override any value given in num_train_epochs
max_steps is given, it will override any value given in num_train_epochs
Using /home/xmyu/.cache/torch_extensions/py310_cu121 as PyTorch extensions root...
Using /home/xmyu/.cache/torch_extensions/py310_cu121 as PyTorch extensions root...
Using /home/xmyu/.cache/torch_extensions/py310_cu121 as PyTorch extensions root...
Using /home/xmyu/.cache/torch_extensions/py310_cu121 as PyTorch extensions root...
Detected CUDA files, patching ldflags
Emitting ninja build file /home/xmyu/.cache/torch_extensions/py310_cu121/fused_adam/build.ninja...
Building extension module fused_adam...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
ninja: no work to do.
Loading extension module fused_adam...
Time to load fused_adam op: 0.09513616561889648 seconds
Loading extension module fused_adam...
Time to load fused_adam op: 0.10147547721862793 seconds
Loading extension module fused_adam...
Loading extension module fused_adam...
Time to load fused_adam op: 0.10201215744018555 seconds
Time to load fused_adam op: 0.10188674926757812 seconds
[2024-05-26 19:32:20,838] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 2343054 closing signal SIGTERM
[2024-05-26 19:32:20,888] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 2343057 closing signal SIGTERM
[2024-05-26 19:32:22,473] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: -9) local_rank: 1 (pid: 2343055) of binary: /home/xmyu/anaconda3/envs/MiniCPM-V/bin/python3.10
Traceback (most recent call last):
File "/home/xmyu/anaconda3/envs/MiniCPM-V/bin/torchrun", line 8, in
finetune.py FAILED
srun显示检测到了oom 但是4卡把batchsize砍半也会oom嘛 谢谢回答
torchrun $DISTRIBUTED_ARGS finetune.py
--model_name_or_path $MODEL
--llm_type $LLM_TYPE
--data_path $DATA
--eval_data_path $EVAL_DATA
--remove_unused_columns false
--label_names "labels"
--prediction_loss_only false
--bf16 true
--bf16_full_eval true
--do_train
--do_eval
--model_max_length 2048
--max_steps 80000
--eval_steps 200
--output_dir output/output_minicpmv2
--logging_dir output/output_minicpmv2
--logging_strategy "steps"
--per_device_train_batch_size 2
--per_device_eval_batch_size 1
--gradient_accumulation_steps 4
--evaluation_strategy "steps"
--save_strategy "steps"
--save_steps 1000
--save_total_limit 10
--learning_rate 5e-7
--weight_decay 0.1
--adam_beta2 0.95
--warmup_ratio 0.01
--lr_scheduler_type "cosine"
--logging_steps 1
--gradient_checkpointing True
--deepspeed ds_config_zero2.json
--report_to "wandb" # wandb
@qyc-98
您好 可以尝试我们最新的finetune脚本,比如将ds_config_zero2.json 里的"offload_optimizer": { "device": "cpu", "pin_memory": true } 里的device这一项改为cpu,尝试zero2+offload的调试策略
你好 改成cpu了 有这个报错Exception ignored in: <function DeepSpeedCPUAdam.del at 0x7feb0b834dc0> Traceback (most recent call last): File "/home/xmyu/anaconda3/envs/MiniCPM-V/lib/python3.10/site-packages/deepspeed/ops/adam/cpu_adam.py", line 102, in del self.ds_opt_adam.destroy_adam(self.opt_id) AttributeError: 'DeepSpeedCPUAdam' object has no attribute 'ds_opt_adam' @Cuiunbo @qyc-98
what are the version of your deepspeed, pytorch, transformers? My pytorch is 2.1.2+cu118, deepspeed 0.14.2, and transformers 4.40.2
deepspeed 0.14.2 transformers 4.40.0. torch 2.1.2 torchvision 0.16.2
my torch version is torch 2.1.2+cu121