MiniCPM-V icon indicating copy to clipboard operation
MiniCPM-V copied to clipboard

[BUG] 多卡微调遇到sigkill 无报错退出

Open Stardust-y opened this issue 1 year ago • 4 comments

如题 使用4卡a100 80gb 运行/finetune/finetune_ds.sh 遭遇sigkill 无明显报错信息

Stardust-y avatar May 26 '24 11:05 Stardust-y

尝试调小batch size 但还是报错

Stardust-y avatar May 26 '24 11:05 Stardust-y

======================================================== [2024-05-26 19:30:30,675] torch.distributed.run: [WARNING] [2024-05-26 19:30:30,675] torch.distributed.run: [WARNING] ***************************************** [2024-05-26 19:30:30,675] torch.distributed.run: [WARNING] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. [2024-05-26 19:30:30,675] torch.distributed.run: [WARNING] ***************************************** [2024-05-26 19:30:35,324] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2024-05-26 19:30:35,328] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2024-05-26 19:30:35,330] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2024-05-26 19:30:35,333] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect) [WARNING] Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH [WARNING] Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH [WARNING] Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH [WARNING] Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH [WARNING] sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.1 [WARNING] using untested triton version (2.1.0), only 1.0.0 is known to be compatible [WARNING] sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.1 [WARNING] using untested triton version (2.1.0), only 1.0.0 is known to be compatible [WARNING] sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.1 [WARNING] using untested triton version (2.1.0), only 1.0.0 is known to be compatible [WARNING] sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.1 [WARNING] using untested triton version (2.1.0), only 1.0.0 is known to be compatible [2024-05-26 19:30:37,387] [INFO] [comm.py:637:init_distributed] cdb=None [2024-05-26 19:30:37,387] [INFO] [comm.py:668:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl [2024-05-26 19:30:37,387] [INFO] [comm.py:637:init_distributed] cdb=None [2024-05-26 19:30:37,403] [INFO] [comm.py:637:init_distributed] cdb=None [2024-05-26 19:30:37,403] [INFO] [comm.py:637:init_distributed] cdb=None Loading checkpoint shards: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 7/7 [01:01<00:00, 8.72s/it] Loading checkpoint shards: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 7/7 [01:01<00:00, 8.72s/it] Loading checkpoint shards: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 7/7 [01:00<00:00, 8.65s/it] Loading checkpoint shards: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 7/7 [01:00<00:00, 8.65s/it] Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. {'Total': 8537092336, 'Trainable': 8537092336} llm_type=minicpm Loading data... max_steps is given, it will override any value given in num_train_epochs max_steps is given, it will override any value given in num_train_epochs max_steps is given, it will override any value given in num_train_epochs max_steps is given, it will override any value given in num_train_epochs Using /home/xmyu/.cache/torch_extensions/py310_cu121 as PyTorch extensions root... Using /home/xmyu/.cache/torch_extensions/py310_cu121 as PyTorch extensions root... Using /home/xmyu/.cache/torch_extensions/py310_cu121 as PyTorch extensions root... Using /home/xmyu/.cache/torch_extensions/py310_cu121 as PyTorch extensions root... Detected CUDA files, patching ldflags Emitting ninja build file /home/xmyu/.cache/torch_extensions/py310_cu121/fused_adam/build.ninja... Building extension module fused_adam... Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N) ninja: no work to do. Loading extension module fused_adam... Time to load fused_adam op: 0.09513616561889648 seconds Loading extension module fused_adam... Time to load fused_adam op: 0.10147547721862793 seconds Loading extension module fused_adam... Loading extension module fused_adam... Time to load fused_adam op: 0.10201215744018555 seconds Time to load fused_adam op: 0.10188674926757812 seconds [2024-05-26 19:32:20,838] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 2343054 closing signal SIGTERM [2024-05-26 19:32:20,888] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 2343057 closing signal SIGTERM [2024-05-26 19:32:22,473] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: -9) local_rank: 1 (pid: 2343055) of binary: /home/xmyu/anaconda3/envs/MiniCPM-V/bin/python3.10 Traceback (most recent call last): File "/home/xmyu/anaconda3/envs/MiniCPM-V/bin/torchrun", line 8, in sys.exit(main()) File "/home/xmyu/anaconda3/envs/MiniCPM-V/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 346, in wrapper return f(*args, **kwargs) File "/home/xmyu/anaconda3/envs/MiniCPM-V/lib/python3.10/site-packages/torch/distributed/run.py", line 806, in main run(args) File "/home/xmyu/anaconda3/envs/MiniCPM-V/lib/python3.10/site-packages/torch/distributed/run.py", line 797, in run elastic_launch( File "/home/xmyu/anaconda3/envs/MiniCPM-V/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 134, in call return launch_agent(self._config, self._entrypoint, list(args)) File "/home/xmyu/anaconda3/envs/MiniCPM-V/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

finetune.py FAILED

Stardust-y avatar May 26 '24 11:05 Stardust-y

srun显示检测到了oom 但是4卡把batchsize砍半也会oom嘛 谢谢回答 torchrun $DISTRIBUTED_ARGS finetune.py
--model_name_or_path $MODEL
--llm_type $LLM_TYPE
--data_path $DATA
--eval_data_path $EVAL_DATA
--remove_unused_columns false
--label_names "labels"
--prediction_loss_only false
--bf16 true
--bf16_full_eval true
--do_train
--do_eval
--model_max_length 2048
--max_steps 80000
--eval_steps 200
--output_dir output/output_minicpmv2
--logging_dir output/output_minicpmv2
--logging_strategy "steps"
--per_device_train_batch_size 2
--per_device_eval_batch_size 1
--gradient_accumulation_steps 4
--evaluation_strategy "steps"
--save_strategy "steps"
--save_steps 1000
--save_total_limit 10
--learning_rate 5e-7
--weight_decay 0.1
--adam_beta2 0.95
--warmup_ratio 0.01
--lr_scheduler_type "cosine"
--logging_steps 1
--gradient_checkpointing True
--deepspeed ds_config_zero2.json
--report_to "wandb" # wandb

Stardust-y avatar May 26 '24 11:05 Stardust-y

@qyc-98

Cuiunbo avatar May 26 '24 21:05 Cuiunbo

您好 可以尝试我们最新的finetune脚本,比如将ds_config_zero2.json 里的"offload_optimizer": { "device": "cpu", "pin_memory": true } 里的device这一项改为cpu,尝试zero2+offload的调试策略

qyc-98 avatar May 29 '24 07:05 qyc-98

你好 改成cpu了 有这个报错Exception ignored in: <function DeepSpeedCPUAdam.del at 0x7feb0b834dc0> Traceback (most recent call last): File "/home/xmyu/anaconda3/envs/MiniCPM-V/lib/python3.10/site-packages/deepspeed/ops/adam/cpu_adam.py", line 102, in del self.ds_opt_adam.destroy_adam(self.opt_id) AttributeError: 'DeepSpeedCPUAdam' object has no attribute 'ds_opt_adam' @Cuiunbo @qyc-98

Stardust-y avatar Jun 03 '24 12:06 Stardust-y

what are the version of your deepspeed, pytorch, transformers? My pytorch is 2.1.2+cu118, deepspeed 0.14.2, and transformers 4.40.2

qyc-98 avatar Jun 03 '24 15:06 qyc-98

deepspeed 0.14.2 transformers 4.40.0. torch 2.1.2 torchvision 0.16.2

Stardust-y avatar Jun 04 '24 07:06 Stardust-y

my torch version is torch 2.1.2+cu121

Stardust-y avatar Jun 04 '24 07:06 Stardust-y