TigerBot icon indicating copy to clipboard operation
TigerBot copied to clipboard

exits with return code = -9

Open sunshineyg2018 opened this issue 1 year ago • 1 comments

显卡内存 80 GB

可使用内存 120 GB

也是提醒 exits with return code = -9

配置 deepspeed
--include="localhost:0"
./train_sft.py
--deepspeed ./ds_config/ds_config_zero3.json
--model_name_or_path /root/base_model
--train_file_path /root/data
--do_train
--output_dir /root/output
--overwrite_output_dir
--preprocess_num_workers 8
--num_train_epochs 800
--learning_rate 1e-5
--evaluation_strategy steps
--eval_steps 100
--bf16 True
--save_strategy steps
--save_steps 400
--save_total_limit 2
--logging_steps 10
--tf32 True
--per_device_train_batch_size 8
--per_device_eval_batch_size 8

sunshineyg2018 avatar Jun 24 '23 07:06 sunshineyg2018

可以在启动脚本的同时,watch -n 1 'free -h' 观察一下内存情况,如果内存耗尽,可以检察是不是数据量太大导致的问题。如果显存耗尽,考虑调小--per_device_train_batch_size,同时参考https://huggingface.co/docs/transformers/main_classes/deepspeed#how-to-choose-which-zero-stage-and-offloads-to-use-for-best-performance推荐的方式修改deepspeed配置。

i4never avatar Jul 03 '23 03:07 i4never