stanford_alpaca
stanford_alpaca copied to clipboard
training on v100
Can you please send how to train on v100? i have tried multiple attempts always lead to OOM Error. Also Tried with batch size 1
batch size 1 and gradient accumulation 1 works for more than 2000 steps. Not sure I can finish training without OOM
I was able to train using deepspeed on 8 V100 GPUs. Here is the training script and deepseed config file.
torchrun --nproc_per_node=8 --master_port=9776 train.py
--model_name_or_path hf_model/llama-7b
--data_path ./alpaca_data.json
--bf16 False
--output_dir ./finetuned_4/
--num_train_epochs 3
--per_device_train_batch_size 3
--per_device_eval_batch_size 3
--gradient_accumulation_steps 5
--evaluation_strategy "no"
--save_strategy "steps"
--save_steps 1200
--save_total_limit 2
--learning_rate 2e-5
--weight_decay 0.
--warmup_ratio 0.03
--lr_scheduler_type "cosine"
--logging_steps 1
--deepspeed ds_config.json
--fp16
--tf32 False \
dc_config.json { "zero_optimization": { "stage": 3, "contiguous_gradients": true, "stage3_max_live_parameters": 0, "stage3_max_reuse_distance": 0, "stage3_prefetch_bucket_size": 0, "stage3_param_persistence_threshold": 1e2, "reduce_bucket_size": 1e2, "sub_group_size": 1e8, "offload_optimizer": { "device": "cpu", "pin_memory": true }, "offload_param": { "device": "cpu", "pin_memory": true }, "stage3_gather_16bit_weights_on_model_save": true }, "fp16": { "enabled": true, "auto_cast": false, "loss_scale": 0, "initial_scale_power": 32, "loss_scale_window": 1000, "hysteresis": 2, "min_loss_scale": 1 }, "train_batch_size": "auto", "train_micro_batch_size_per_gpu": "auto", "wall_clock_breakdown": false }
hi @shaileshj2803, thanks for sharing the ds.config setting. :) I am new to deepspeed integration with torchrun, is there any change I need to make in original "train.py" file? Thanks!
@shaileshj2803 How big is your CPU memory? I used ds on 4 v100 and the CPU memory OOM.
I was able to train using deepspeed on 8 V100 GPUs. Here is the training script and deepseed config file.
torchrun --nproc_per_node=8 --master_port=9776 train.py --model_name_or_path hf_model/llama-7b --data_path ./alpaca_data.json --bf16 False --output_dir ./finetuned_4/ --num_train_epochs 3 --per_device_train_batch_size 3 --per_device_eval_batch_size 3 --gradient_accumulation_steps 5 --evaluation_strategy "no" --save_strategy "steps" --save_steps 1200 --save_total_limit 2 --learning_rate 2e-5 --weight_decay 0. --warmup_ratio 0.03 --lr_scheduler_type "cosine" --logging_steps 1 --deepspeed ds_config.json --fp16 --tf32 False \
dc_config.json { "zero_optimization": { "stage": 3, "contiguous_gradients": true, "stage3_max_live_parameters": 0, "stage3_max_reuse_distance": 0, "stage3_prefetch_bucket_size": 0, "stage3_param_persistence_threshold": 1e2, "reduce_bucket_size": 1e2, "sub_group_size": 1e8, "offload_optimizer": { "device": "cpu", "pin_memory": true }, "offload_param": { "device": "cpu", "pin_memory": true }, "stage3_gather_16bit_weights_on_model_save": true }, "fp16": { "enabled": true, "auto_cast": false, "loss_scale": 0, "initial_scale_power": 32, "loss_scale_window": 1000, "hysteresis": 2, "min_loss_scale": 1 }, "train_batch_size": "auto", "train_micro_batch_size_per_gpu": "auto", "wall_clock_breakdown": false }
Hi, thanks for sharing. May I ask how much memory the training take for each GPU? I wonder wether I can fine-tune the model on 8 3090 GPUs.
I was able to train using deepspeed on 8 V100 GPUs. Here is the training script and deepseed config file.
torchrun --nproc_per_node=8 --master_port=9776 train.py --model_name_or_path hf_model/llama-7b --data_path ./alpaca_data.json --bf16 False --output_dir ./finetuned_4/ --num_train_epochs 3 --per_device_train_batch_size 3 --per_device_eval_batch_size 3 --gradient_accumulation_steps 5 --evaluation_strategy "no" --save_strategy "steps" --save_steps 1200 --save_total_limit 2 --learning_rate 2e-5 --weight_decay 0. --warmup_ratio 0.03 --lr_scheduler_type "cosine" --logging_steps 1 --deepspeed ds_config.json --fp16 --tf32 False \
dc_config.json { "zero_optimization": { "stage": 3, "contiguous_gradients": true, "stage3_max_live_parameters": 0, "stage3_max_reuse_distance": 0, "stage3_prefetch_bucket_size": 0, "stage3_param_persistence_threshold": 1e2, "reduce_bucket_size": 1e2, "sub_group_size": 1e8, "offload_optimizer": { "device": "cpu", "pin_memory": true }, "offload_param": { "device": "cpu", "pin_memory": true }, "stage3_gather_16bit_weights_on_model_save": true }, "fp16": { "enabled": true, "auto_cast": false, "loss_scale": 0, "initial_scale_power": 32, "loss_scale_window": 1000, "hysteresis": 2, "min_loss_scale": 1 }, "train_batch_size": "auto", "train_micro_batch_size_per_gpu": "auto", "wall_clock_breakdown": false }
Hi, thanks for shring. Would like to share the pytorch and transformers version if possible?
I was able to train using deepspeed on 8 V100 GPUs. Here is the training script and deepseed config file.
torchrun --nproc_per_node=8 --master_port=9776 train.py --model_name_or_path hf_model/llama-7b --data_path ./alpaca_data.json --bf16 False --output_dir ./finetuned_4/ --num_train_epochs 3 --per_device_train_batch_size 3 --per_device_eval_batch_size 3 --gradient_accumulation_steps 5 --evaluation_strategy "no" --save_strategy "steps" --save_steps 1200 --save_total_limit 2 --learning_rate 2e-5 --weight_decay 0. --warmup_ratio 0.03 --lr_scheduler_type "cosine" --logging_steps 1 --deepspeed ds_config.json --fp16 --tf32 False \
dc_config.json { "zero_optimization": { "stage": 3, "contiguous_gradients": true, "stage3_max_live_parameters": 0, "stage3_max_reuse_distance": 0, "stage3_prefetch_bucket_size": 0, "stage3_param_persistence_threshold": 1e2, "reduce_bucket_size": 1e2, "sub_group_size": 1e8, "offload_optimizer": { "device": "cpu", "pin_memory": true }, "offload_param": { "device": "cpu", "pin_memory": true }, "stage3_gather_16bit_weights_on_model_save": true }, "fp16": { "enabled": true, "auto_cast": false, "loss_scale": 0, "initial_scale_power": 32, "loss_scale_window": 1000, "hysteresis": 2, "min_loss_scale": 1 }, "train_batch_size": "auto", "train_micro_batch_size_per_gpu": "auto", "wall_clock_breakdown": false }
Hello,
Thank you for sharing this. I was able to finetune with the deepspeed config. However, after training, it saves a "pytorch_model.bin" in the <output_dir> and another "pytorch_model.bin" in <output_dir>/checkpoint_1200. Then there are multiple: zero_pp_rank.....{model, optim}.pt files in global_step1200 folder.
What is the process for inference? Should I load the:
- pytorch_model.bin from <output_dir> or
- pytorch_model.bin from <output_dir>/checkpoint_1200 or
- should I use zero_to_fp32.py to convert all the zero_pp_rank...{model, optim}.pt files into final_pytorch_model.bin in the <output_dir>/checkpoint_1200 and then load final_pytorch_model.bin
Thanks for your help. This is my first time using deepspeed.
if you are using: https://github.com/tatsu-lab/stanford_alpaca/blob/main/train.py,
then
in line 218: replace trainer.save_model(output_dir=training_args.output_dir)
with
checkpoint_dir = os.path.join(training_args.output_dir, "checkpoint-final")
trainer.deepspeed.save_checkpoint(checkpoint_dir)
then, checkpoint-final will contains zero_to_fp32.py after the training is done.
just run python zero_to_fp32.py . pytorch_model.bin
for more information, look here: https://huggingface.co/transformers/v4.10.1/main_classes/deepspeed.html#getting-the-model-weights-out
I was able to train using deepspeed on 8 V100 GPUs. Here is the training script and deepseed config file.
torchrun --nproc_per_node=8 --master_port=9776 train.py --model_name_or_path hf_model/llama-7b --data_path ./alpaca_data.json --bf16 False --output_dir ./finetuned_4/ --num_train_epochs 3 --per_device_train_batch_size 3 --per_device_eval_batch_size 3 --gradient_accumulation_steps 5 --evaluation_strategy "no" --save_strategy "steps" --save_steps 1200 --save_total_limit 2 --learning_rate 2e-5 --weight_decay 0. --warmup_ratio 0.03 --lr_scheduler_type "cosine" --logging_steps 1 --deepspeed ds_config.json --fp16 --tf32 False \
dc_config.json { "zero_optimization": { "stage": 3, "contiguous_gradients": true, "stage3_max_live_parameters": 0, "stage3_max_reuse_distance": 0, "stage3_prefetch_bucket_size": 0, "stage3_param_persistence_threshold": 1e2, "reduce_bucket_size": 1e2, "sub_group_size": 1e8, "offload_optimizer": { "device": "cpu", "pin_memory": true }, "offload_param": { "device": "cpu", "pin_memory": true }, "stage3_gather_16bit_weights_on_model_save": true }, "fp16": { "enabled": true, "auto_cast": false, "loss_scale": 0, "initial_scale_power": 32, "loss_scale_window": 1000, "hysteresis": 2, "min_loss_scale": 1 }, "train_batch_size": "auto", "train_micro_batch_size_per_gpu": "auto", "wall_clock_breakdown": false }
Hi, thanks for sharing. Your V100 is 16G or 32G ?
I was able to train using deepspeed on 8 V100 GPUs. Here is the training script and deepseed config file.
torchrun --nproc_per_node=8 --master_port=9776 train.py --model_name_or_path hf_model/llama-7b --data_path ./alpaca_data.json --bf16 False --output_dir ./finetuned_4/ --num_train_epochs 3 --per_device_train_batch_size 3 --per_device_eval_batch_size 3 --gradient_accumulation_steps 5 --evaluation_strategy "no" --save_strategy "steps" --save_steps 1200 --save_total_limit 2 --learning_rate 2e-5 --weight_decay 0. --warmup_ratio 0.03 --lr_scheduler_type "cosine" --logging_steps 1 --deepspeed ds_config.json --fp16 --tf32 False \
dc_config.json { "zero_optimization": { "stage": 3, "contiguous_gradients": true, "stage3_max_live_parameters": 0, "stage3_max_reuse_distance": 0, "stage3_prefetch_bucket_size": 0, "stage3_param_persistence_threshold": 1e2, "reduce_bucket_size": 1e2, "sub_group_size": 1e8, "offload_optimizer": { "device": "cpu", "pin_memory": true }, "offload_param": { "device": "cpu", "pin_memory": true }, "stage3_gather_16bit_weights_on_model_save": true }, "fp16": { "enabled": true, "auto_cast": false, "loss_scale": 0, "initial_scale_power": 32, "loss_scale_window": 1000, "hysteresis": 2, "min_loss_scale": 1 }, "train_batch_size": "auto", "train_micro_batch_size_per_gpu": "auto", "wall_clock_breakdown": false }
Did you try llama13B with this method?