stanford_alpaca icon indicating copy to clipboard operation
stanford_alpaca copied to clipboard

about v100 save model

Open yyl199655 opened this issue 2 years ago • 3 comments

I use the 8 v100 train the model, the saved model is error, the size of model is :

image

the command is as follows torchrun --nproc_per_node=1 --master_port=12345 train.py
--model_name_or_path /cfs/cfs-4q3ifcxl/saminyang/model_torch/LLaMA/7B/convert
--data_path ./data/test.json
--bf16 False
--output_dir output/7B_chinese
--num_train_epochs 1
--per_device_train_batch_size 1
--per_device_eval_batch_size 1
--gradient_accumulation_steps 16
--evaluation_strategy "no"
--save_strategy "steps"
--save_steps 2000
--save_total_limit 1
--learning_rate 2e-5
--weight_decay 0.
--warmup_ratio 0.03
--lr_scheduler_type "cosine"
--logging_steps 1
--fsdp_transformer_layer_cls_to_wrap 'LlamaDecoderLayer'
--tf32 False
--deepspeed ./deepspeed-cfg/ZeRO-3.json but

yyl199655 avatar Apr 09 '23 09:04 yyl199655

hey man, how much GPU memory your device uses. My devices are two A10 with 48GB, but there is an OOM error.

xv994 avatar Apr 10 '23 01:04 xv994

You can use alpaca-lora. Here I have curated latest fine tuned libraries and models awesome-llms. It can be trained on consumer devices having 24 GB of GPU.

sanjibnarzary avatar Apr 10 '23 03:04 sanjibnarzary

if you are using: https://github.com/tatsu-lab/stanford_alpaca/blob/main/train.py, then in line 218: replace trainer.save_model(output_dir=training_args.output_dir) with

    checkpoint_dir = os.path.join(training_args.output_dir, "checkpoint-final")
    trainer.deepspeed.save_checkpoint(checkpoint_dir)

then, checkpoint-final will contains zero_to_fp32.py after the training is done. just run python zero_to_fp32.py . pytorch_model.bin

for more information, look here: https://huggingface.co/transformers/v4.10.1/main_classes/deepspeed.html#getting-the-model-weights-out

luffycodes avatar Apr 24 '23 03:04 luffycodes