LLaMA-Factory icon indicating copy to clipboard operation
LLaMA-Factory copied to clipboard

oom when sft Qwen1.5-32B-Chat using deepspeed z3 offload + bs1

Open dachengai opened this issue 3 months ago • 0 comments

use ds_z3_offload_config.json + bs 1 + 8 x A100

start train

deepspeed --num_gpus 8 --master_port=9901 src/train_bash.py
--deepspeed examples/deepseed/ds_z3_offload_config.json
--stage sft
--cutoff_len 32768
--do_train
--model_name_or_path $model_name_or_path
--dataset $dataset
--template qwen
--finetuning_type full
--output_dir $output_dir
--per_device_train_batch_size 1
--gradient_accumulation_steps 1
--preprocessing_num_workers 32
--lr_scheduler_type cosine
--logging_steps 5
--save_steps 1000
--save_strategy steps
--eval_steps 1000
--evaluation_strategy steps
--save_total_limit 6
--max_grad_norm 1
--warmup_steps 800
--learning_rate 1e-5
--num_train_epochs 3.0
--val_size 0.0005
--plot_loss
--report_to tensorboard
--ddp_timeout 180000000
--bf16
--flash_attn

dachengai avatar Apr 13 '24 09:04 dachengai