全量训练的时候修改什么参数，能降低显存的使用

Open lixiaoxiaobin opened this issue 1 year ago • 4 comments

4块A100，160G的显存，训练数据都是报显存不足，我怎么调整一下参数呢，或者修改哪里能让数据训练起来

#FT

torchrun --nproc_per_node 4 /home/jovyan/vol-1/BELLE/train/src/train.py
--model_name_or_path ${model_name_or_path}
--llama
--deepspeed configs/deepspeed_config_stage3.json
--train_file ${train_file}
--validation_file ${validation_file}
--per_device_train_batch_size 1
--per_device_eval_batch_size 1
--gradient_accumulation_steps 1
--num_train_epochs 1
--model_max_length ${cutoff_len}
--save_strategy "steps"
--save_total_limit 3
--learning_rate 8e-6
--weight_decay 0.00001
--warmup_ratio 0.05
--lr_scheduler_type "cosine"
--logging_steps 10
--evaluation_strategy "steps"
--fp16 True
--seed 1234
--gradient_checkpointing True
--cache_dir ${cache_dir}
--output_dir ${output_dir}

Jun 01 '23 07:06 lixiaoxiaobin

stage2 比stage3 省内存吧

Jun 02 '23 06:06 nuass

4块A100，160G的显存，训练数据都是报显存不足，我怎么调整一下参数呢，或者修改哪里能让数据训练起来

#FT

torchrun --nproc_per_node 4 /home/jovyan/vol-1/BELLE/train/src/train.py --model_name_or_path ${model_name_or_path} --llama --deepspeed configs/deepspeed_config_stage3.json --train_file ${train_file} --validation_file ${validation_file} --per_device_train_batch_size 1 --per_device_eval_batch_size 1 --gradient_accumulation_steps 1 --num_train_epochs 1 --model_max_length ${cutoff_len} --save_strategy "steps" --save_total_limit 3 --learning_rate 8e-6 --weight_decay 0.00001 --warmup_ratio 0.05 --lr_scheduler_type "cosine" --logging_steps 10 --evaluation_strategy "steps" --fp16 True --seed 1234 --gradient_checkpointing True --cache_dir ${cache_dir} --output_dir ${output_dir}

多大模型啊，都用stage3了怎么还显存不足呢

Jun 02 '23 11:06 xianghuisun

4块80G的A100，同样GPU报OOM，deepspeed没生效？

Jun 07 '23 13:06 XINyexun

如果是7B的模型的话按照你的batch_size设置肯定是够的但是如果是13B的模型的话在zero3模式下单张卡的显存需求在60GB左右肯定是要报OOM的错误的解决的方法是在deepspeed_config_stage3.json里增加 cpu或者NVMe offload的设置了

Jun 15 '23 02:06 hulkliu77

BELLE BELLE copied to clipboard

全量训练的时候修改什么参数，能降低显存的使用

#FT

#FT

BELLE
BELLE copied to clipboard