stanford_alpaca icon indicating copy to clipboard operation
stanford_alpaca copied to clipboard

training on v100

Open shaileshj2803 opened this issue 2 years ago • 10 comments

Can you please send how to train on v100? i have tried multiple attempts always lead to OOM Error. Also Tried with batch size 1

shaileshj2803 avatar Mar 19 '23 06:03 shaileshj2803

batch size 1 and gradient accumulation 1 works for more than 2000 steps. Not sure I can finish training without OOM

SeungyounShin avatar Mar 20 '23 04:03 SeungyounShin

I was able to train using deepspeed on 8 V100 GPUs. Here is the training script and deepseed config file.

torchrun --nproc_per_node=8 --master_port=9776 train.py
--model_name_or_path hf_model/llama-7b
--data_path ./alpaca_data.json
--bf16 False
--output_dir ./finetuned_4/
--num_train_epochs 3
--per_device_train_batch_size 3
--per_device_eval_batch_size 3
--gradient_accumulation_steps 5
--evaluation_strategy "no"
--save_strategy "steps"
--save_steps 1200
--save_total_limit 2
--learning_rate 2e-5
--weight_decay 0.
--warmup_ratio 0.03
--lr_scheduler_type "cosine"
--logging_steps 1
--deepspeed ds_config.json
--fp16
--tf32 False \

dc_config.json { "zero_optimization": { "stage": 3, "contiguous_gradients": true, "stage3_max_live_parameters": 0, "stage3_max_reuse_distance": 0, "stage3_prefetch_bucket_size": 0, "stage3_param_persistence_threshold": 1e2, "reduce_bucket_size": 1e2, "sub_group_size": 1e8, "offload_optimizer": { "device": "cpu", "pin_memory": true }, "offload_param": { "device": "cpu", "pin_memory": true }, "stage3_gather_16bit_weights_on_model_save": true }, "fp16": { "enabled": true, "auto_cast": false, "loss_scale": 0, "initial_scale_power": 32, "loss_scale_window": 1000, "hysteresis": 2, "min_loss_scale": 1 }, "train_batch_size": "auto", "train_micro_batch_size_per_gpu": "auto", "wall_clock_breakdown": false }

shaileshj2803 avatar Mar 20 '23 22:03 shaileshj2803

hi @shaileshj2803, thanks for sharing the ds.config setting. :) I am new to deepspeed integration with torchrun, is there any change I need to make in original "train.py" file? Thanks!

TAIYISONG avatar Mar 23 '23 21:03 TAIYISONG

@shaileshj2803 How big is your CPU memory? I used ds on 4 v100 and the CPU memory OOM.

Sorezza avatar Mar 24 '23 02:03 Sorezza

I was able to train using deepspeed on 8 V100 GPUs. Here is the training script and deepseed config file.

torchrun --nproc_per_node=8 --master_port=9776 train.py --model_name_or_path hf_model/llama-7b --data_path ./alpaca_data.json --bf16 False --output_dir ./finetuned_4/ --num_train_epochs 3 --per_device_train_batch_size 3 --per_device_eval_batch_size 3 --gradient_accumulation_steps 5 --evaluation_strategy "no" --save_strategy "steps" --save_steps 1200 --save_total_limit 2 --learning_rate 2e-5 --weight_decay 0. --warmup_ratio 0.03 --lr_scheduler_type "cosine" --logging_steps 1 --deepspeed ds_config.json --fp16 --tf32 False \

dc_config.json { "zero_optimization": { "stage": 3, "contiguous_gradients": true, "stage3_max_live_parameters": 0, "stage3_max_reuse_distance": 0, "stage3_prefetch_bucket_size": 0, "stage3_param_persistence_threshold": 1e2, "reduce_bucket_size": 1e2, "sub_group_size": 1e8, "offload_optimizer": { "device": "cpu", "pin_memory": true }, "offload_param": { "device": "cpu", "pin_memory": true }, "stage3_gather_16bit_weights_on_model_save": true }, "fp16": { "enabled": true, "auto_cast": false, "loss_scale": 0, "initial_scale_power": 32, "loss_scale_window": 1000, "hysteresis": 2, "min_loss_scale": 1 }, "train_batch_size": "auto", "train_micro_batch_size_per_gpu": "auto", "wall_clock_breakdown": false }

Hi, thanks for sharing. May I ask how much memory the training take for each GPU? I wonder wether I can fine-tune the model on 8 3090 GPUs.

Hiusam avatar Mar 31 '23 10:03 Hiusam

I was able to train using deepspeed on 8 V100 GPUs. Here is the training script and deepseed config file.

torchrun --nproc_per_node=8 --master_port=9776 train.py --model_name_or_path hf_model/llama-7b --data_path ./alpaca_data.json --bf16 False --output_dir ./finetuned_4/ --num_train_epochs 3 --per_device_train_batch_size 3 --per_device_eval_batch_size 3 --gradient_accumulation_steps 5 --evaluation_strategy "no" --save_strategy "steps" --save_steps 1200 --save_total_limit 2 --learning_rate 2e-5 --weight_decay 0. --warmup_ratio 0.03 --lr_scheduler_type "cosine" --logging_steps 1 --deepspeed ds_config.json --fp16 --tf32 False \

dc_config.json { "zero_optimization": { "stage": 3, "contiguous_gradients": true, "stage3_max_live_parameters": 0, "stage3_max_reuse_distance": 0, "stage3_prefetch_bucket_size": 0, "stage3_param_persistence_threshold": 1e2, "reduce_bucket_size": 1e2, "sub_group_size": 1e8, "offload_optimizer": { "device": "cpu", "pin_memory": true }, "offload_param": { "device": "cpu", "pin_memory": true }, "stage3_gather_16bit_weights_on_model_save": true }, "fp16": { "enabled": true, "auto_cast": false, "loss_scale": 0, "initial_scale_power": 32, "loss_scale_window": 1000, "hysteresis": 2, "min_loss_scale": 1 }, "train_batch_size": "auto", "train_micro_batch_size_per_gpu": "auto", "wall_clock_breakdown": false }

Hi, thanks for shring. Would like to share the pytorch and transformers version if possible?

xieexiaotuzi avatar Apr 02 '23 03:04 xieexiaotuzi

I was able to train using deepspeed on 8 V100 GPUs. Here is the training script and deepseed config file.

torchrun --nproc_per_node=8 --master_port=9776 train.py --model_name_or_path hf_model/llama-7b --data_path ./alpaca_data.json --bf16 False --output_dir ./finetuned_4/ --num_train_epochs 3 --per_device_train_batch_size 3 --per_device_eval_batch_size 3 --gradient_accumulation_steps 5 --evaluation_strategy "no" --save_strategy "steps" --save_steps 1200 --save_total_limit 2 --learning_rate 2e-5 --weight_decay 0. --warmup_ratio 0.03 --lr_scheduler_type "cosine" --logging_steps 1 --deepspeed ds_config.json --fp16 --tf32 False \

dc_config.json { "zero_optimization": { "stage": 3, "contiguous_gradients": true, "stage3_max_live_parameters": 0, "stage3_max_reuse_distance": 0, "stage3_prefetch_bucket_size": 0, "stage3_param_persistence_threshold": 1e2, "reduce_bucket_size": 1e2, "sub_group_size": 1e8, "offload_optimizer": { "device": "cpu", "pin_memory": true }, "offload_param": { "device": "cpu", "pin_memory": true }, "stage3_gather_16bit_weights_on_model_save": true }, "fp16": { "enabled": true, "auto_cast": false, "loss_scale": 0, "initial_scale_power": 32, "loss_scale_window": 1000, "hysteresis": 2, "min_loss_scale": 1 }, "train_batch_size": "auto", "train_micro_batch_size_per_gpu": "auto", "wall_clock_breakdown": false }

Hello,

Thank you for sharing this. I was able to finetune with the deepspeed config. However, after training, it saves a "pytorch_model.bin" in the <output_dir> and another "pytorch_model.bin" in <output_dir>/checkpoint_1200. Then there are multiple: zero_pp_rank.....{model, optim}.pt files in global_step1200 folder.

What is the process for inference? Should I load the:

  1. pytorch_model.bin from <output_dir> or
  2. pytorch_model.bin from <output_dir>/checkpoint_1200 or
  3. should I use zero_to_fp32.py to convert all the zero_pp_rank...{model, optim}.pt files into final_pytorch_model.bin in the <output_dir>/checkpoint_1200 and then load final_pytorch_model.bin

Thanks for your help. This is my first time using deepspeed.

amulyahwr avatar Apr 03 '23 14:04 amulyahwr

if you are using: https://github.com/tatsu-lab/stanford_alpaca/blob/main/train.py, then in line 218: replace trainer.save_model(output_dir=training_args.output_dir) with

    checkpoint_dir = os.path.join(training_args.output_dir, "checkpoint-final")
    trainer.deepspeed.save_checkpoint(checkpoint_dir)

then, checkpoint-final will contains zero_to_fp32.py after the training is done. just run python zero_to_fp32.py . pytorch_model.bin

for more information, look here: https://huggingface.co/transformers/v4.10.1/main_classes/deepspeed.html#getting-the-model-weights-out

luffycodes avatar Apr 24 '23 04:04 luffycodes

I was able to train using deepspeed on 8 V100 GPUs. Here is the training script and deepseed config file.

torchrun --nproc_per_node=8 --master_port=9776 train.py --model_name_or_path hf_model/llama-7b --data_path ./alpaca_data.json --bf16 False --output_dir ./finetuned_4/ --num_train_epochs 3 --per_device_train_batch_size 3 --per_device_eval_batch_size 3 --gradient_accumulation_steps 5 --evaluation_strategy "no" --save_strategy "steps" --save_steps 1200 --save_total_limit 2 --learning_rate 2e-5 --weight_decay 0. --warmup_ratio 0.03 --lr_scheduler_type "cosine" --logging_steps 1 --deepspeed ds_config.json --fp16 --tf32 False \

dc_config.json { "zero_optimization": { "stage": 3, "contiguous_gradients": true, "stage3_max_live_parameters": 0, "stage3_max_reuse_distance": 0, "stage3_prefetch_bucket_size": 0, "stage3_param_persistence_threshold": 1e2, "reduce_bucket_size": 1e2, "sub_group_size": 1e8, "offload_optimizer": { "device": "cpu", "pin_memory": true }, "offload_param": { "device": "cpu", "pin_memory": true }, "stage3_gather_16bit_weights_on_model_save": true }, "fp16": { "enabled": true, "auto_cast": false, "loss_scale": 0, "initial_scale_power": 32, "loss_scale_window": 1000, "hysteresis": 2, "min_loss_scale": 1 }, "train_batch_size": "auto", "train_micro_batch_size_per_gpu": "auto", "wall_clock_breakdown": false }

Hi, thanks for sharing. Your V100 is 16G or 32G ?

qwjaskzxl avatar Apr 27 '23 15:04 qwjaskzxl

I was able to train using deepspeed on 8 V100 GPUs. Here is the training script and deepseed config file.

torchrun --nproc_per_node=8 --master_port=9776 train.py --model_name_or_path hf_model/llama-7b --data_path ./alpaca_data.json --bf16 False --output_dir ./finetuned_4/ --num_train_epochs 3 --per_device_train_batch_size 3 --per_device_eval_batch_size 3 --gradient_accumulation_steps 5 --evaluation_strategy "no" --save_strategy "steps" --save_steps 1200 --save_total_limit 2 --learning_rate 2e-5 --weight_decay 0. --warmup_ratio 0.03 --lr_scheduler_type "cosine" --logging_steps 1 --deepspeed ds_config.json --fp16 --tf32 False \

dc_config.json { "zero_optimization": { "stage": 3, "contiguous_gradients": true, "stage3_max_live_parameters": 0, "stage3_max_reuse_distance": 0, "stage3_prefetch_bucket_size": 0, "stage3_param_persistence_threshold": 1e2, "reduce_bucket_size": 1e2, "sub_group_size": 1e8, "offload_optimizer": { "device": "cpu", "pin_memory": true }, "offload_param": { "device": "cpu", "pin_memory": true }, "stage3_gather_16bit_weights_on_model_save": true }, "fp16": { "enabled": true, "auto_cast": false, "loss_scale": 0, "initial_scale_power": 32, "loss_scale_window": 1000, "hysteresis": 2, "min_loss_scale": 1 }, "train_batch_size": "auto", "train_micro_batch_size_per_gpu": "auto", "wall_clock_breakdown": false }

Did you try llama13B with this method?

LebronXierunfeng avatar May 24 '23 15:05 LebronXierunfeng