stanford_alpaca
stanford_alpaca copied to clipboard
Fine-Tuning very slow (6h->24h??)
Hello, first of all thank you for releasing the training code for alpaca, we really appreaciate it.
I am running the fine-tuning script on an 4xA100-SXM4-80GB, and currently getting an 24H ETA. Which doesn't really scales with the reported "3 hours on 8 80GB A100s" mentioned on https://crfm.stanford.edu/2023/03/13/alpaca.html , Shouldn't it be around 6hours, or even 12hours considering that the script "is not particularly optimized"?
Is anyone else encountering this issue? And if this is expected, then what were the methods you used to optimize the fine-tuning process?
Running on CUDA 12.1, Torch 1.13, and the transformers fork of llama at the commit you mentioned.
Thanks.
Can you share the result file when you finish?
Can you share the result file when you finish?
sure, but theres a LORA repo that supposedly gives better results than the current one, not sure... heres those: https://huggingface.co/chavinlo/alpaca https://github.com/tloen/alpaca-lora
Hello, first of all thank you for releasing the training code for alpaca, we really appreaciate it.
I am running the fine-tuning script on an 4xA100-SXM4-80GB, and currently getting an 24H ETA. Which doesn't really scales with the reported "3 hours on 8 80GB A100s" mentioned on https://crfm.stanford.edu/2023/03/13/alpaca.html , Shouldn't it be around 6hours, or even 12hours considering that the script "is not particularly optimized"?
Is anyone else encountering this issue? And if this is expected, then what were the methods you used to optimize the fine-tuning process?
Running on CUDA 12.1, Torch 1.13, and the transformers fork of llama at the commit you mentioned.
Thanks.
Thanks for your interest! That should not happen, in principle. I reproduced our initial model with the given training code and command with a training run that lasts less than 2 hours. Could you share more details of your setup? Are you getting reasonable GPU utilization?
Hello, first of all thank you for releasing the training code for alpaca, we really appreaciate it. I am running the fine-tuning script on an 4xA100-SXM4-80GB, and currently getting an 24H ETA. Which doesn't really scales with the reported "3 hours on 8 80GB A100s" mentioned on https://crfm.stanford.edu/2023/03/13/alpaca.html , Shouldn't it be around 6hours, or even 12hours considering that the script "is not particularly optimized"? Is anyone else encountering this issue? And if this is expected, then what were the methods you used to optimize the fine-tuning process? Running on CUDA 12.1, Torch 1.13, and the transformers fork of llama at the commit you mentioned. Thanks.
Thanks for your interest! That should not happen, in principle. I reproduced our initial model with the given training code and command with a training run that lasts less than 2 hours. Could you share more details of your setup? Are you getting reasonable GPU utilization?
Power usage is a bit low, but the rest is at max:
the full stats are available on wandb: https://wandb.ai/peruano/huggingface/runs/68m25500?workspace=user-peruano
currently it's going at 64.81s/it after a reboot.
I am keep getting this error message, I am wondering whether you have seen it:
Exception: Could not find the transformer layer class to wrap in the model.
Thank you!
hi @chavinlo ,
are you running the released code? best to adapt from there.
Thanks
and @charliezjw,
there is not enough details for me to respond to. what code are you running
are you running the released code? best to adapt from there.
Yes I am running the fine-tuning code from this repo.
Can you share the result file when you finish?
sure, but theres a LORA repo that supposedly gives better results than the current one, not sure... heres those: https://huggingface.co/chavinlo/alpaca https://github.com/tloen/alpaca-lora
I tried the LORA one (from the repo, not yours) and found it to be worse, getting a lot of "noinput" vs the Stanford one, so let us know if you get it trained :D
https://github.com/tloen/alpaca-lora
I am keep getting this error message, I am wondering whether you have seen it:
Exception: Could not find the transformer layer class to wrap in the model.
Thank you!
did you install the commit mentioned in the README? I got this error when I installed current version of the HF PR.
pip install git+https://github.com/zphang/transformers.git@68d640f7c368bcaaaecfc678f11908ebbd3d6176
I solved the error by replacing every "LLaMA" to "Llama"
https://github.com/tloen/alpaca-lora
I am keep getting this error message, I am wondering whether you have seen it:
Exception: Could not find the transformer layer class to wrap in the model.
Thank you!did you install the commit mentioned in the README? I got this error when I installed current version of the HF PR.
pip install git+https://github.com/zphang/transformers.git@68d640f7c368bcaaaecfc678f11908ebbd3d6176
https://github.com/tloen/alpaca-lora
I am keep getting this error message, I am wondering whether you have seen it:
Exception: Could not find the transformer layer class to wrap in the model.
Thank you!did you install the commit mentioned in the README? I got this error when I installed current version of the HF PR.
pip install git+https://github.com/zphang/transformers.git@68d640f7c368bcaaaecfc678f11908ebbd3d6176
Try to revise the parameter "fsdp_transformer_layer_cls_to_wrap" to "LlamaDecoderLayer", this issue could be solved.
@joaopandolfi @devilismyfriend I will be progressively uploading the checkpoints, it's getting saved every 200 steps, total is 1200.
Heres the 200step checkpoint: https://huggingface.co/chavinlo/alpaca-native/tree/main aka 17%
@chavinlo thank you for your work! Are you able to train the LORA on 13b (or potentially larger)? Also, since the loss stops decreasing after ~1 epoch, it might not be necessary to keep training.
@chavinlo thank you for your work! Are you able to train the LORA on 13b (or potentiall larger)? Also, since the loss stops decreasing after ~1 epoch, it might not be necessary to keep training.
yes, but once I get my a100s fixed because there definetly is something throttling them
yes, but once I get my a100s fixed because there definetly is something throttling them
Thanks - I just finetuned my own lora with tloen's code, unfortunately the results are not too different from 7B with lora. Maybe I am using the wrong prompts to test it lol
yes, but once I get my a100s fixed because there definetly is something throttling them
Thanks - I just finetuned my own lora with tloen's code, unfortunately the results are not too different from 7B with lora. Maybe I am using the wrong prompts to test it lol
I encountered the same issue like you, I used the inference kwargs from this repo instead of what he has there and it's miles better, still think it's a bit worse then this but give it a try
@joaopandolfi @devilismyfriend I will be progressively uploading the checkpoints, it's getting saved every 200 steps, total is 1200.
Heres the 200step checkpoint: https://huggingface.co/chavinlo/alpaca-native/tree/main aka 17%
thanks! can't wait to test the final checkpoint :)
inference kwargs from this repo
Could you specify what you meant here? Did you use the alpaca code and train your own model?
@chavinlo
hi @chavinlo ,
are you running the released code? best to adapt from there.
Thanks
and @charliezjw,
there is not enough details for me to respond to. what code are you running I have the same problem. Have you solved it?
@chavinlo
hi @chavinlo , are you running the released code? best to adapt from there. Thanks and @charliezjw, there is not enough details for me to respond to. what code are you running I have the same problem. Have you solved it?
????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????
@chavinlo I have the same problem. Have you solved it? Exception: Could not find the transformer layer class to wrap in the model.
The reason for some of these issues is explained in this note.
Feel free to reopen if it doesn't fully resolve the mysteries :)
The reason for some of these issues is explained in this note.
Feel free to reopen if it doesn't fully resolve the mysteries :)
My issue is about speed... Nothing related about these layer class errors, they just came into my issue for some reason
inference kwargs from this repo
Could you specify what you meant here? Did you use the alpaca code and train your own model?
search for the inference kwargs issue here, use the parameters in the generation config in the lora repo
@lxuechen Can you reopen this issue? The original problem was about speed, not about layers. Aditionally, I have tried cleaning the instance, and I still get the same speed.
report.txt
here is the nvidia-smi -q
report of my gpus...
@chavinlo I am on the same boat. ~24 hour ETA with 4 X 80G A100. My GPU power usage is also low(~80W/300W) which seems to be suspicious.
My training finishes in 4 hours with 8 * A100 (40G) using fp16. (3 epochs)
Environment: Pytorch=1.13 transformer (latest one on git)
Command:
torchrun --nproc_per_node=8 --master_port=1234 train.py --model_name_or_path converted_llama_7B --data_path ./alpaca_data.json --fp16 True --output_dir ./trained_model --num_train_epochs 3 --per_device_train_batch_size 1 --per_device_eval_batch_size 1 --gradient_accumulation_steps 4 --evaluation_strategy "no" --save_strategy "steps" --save_steps 2000 --save_total_limit 5 --learning_rate 2e-5 --weight_decay 0. --warmup_ratio 0.03 --lr_scheduler_type "cosine" --logging_steps 1 --fsdp "full_shard auto_wrap" --fsdp_transformer_layer_cls_to_wrap 'LlamaDecoderLayer'
However, once the training finishes, the model is generating non-sense tokens as mentioned in https://github.com/tatsu-lab/stanford_alpaca/issues/70 and https://github.com/tatsu-lab/stanford_alpaca/issues/51
@chavinlo I am on the same boat. ~24 hour ETA with 4 X 80G A100. My GPU power usage is also low(~80W/300W) which seems to be suspicious.
I think it might have to do with something in the os configuration. I've changed GPUs and they still perform the same, even the first time it went 2.5 times slower giving an ETA of 52h.
However, someone told me that runpod works alright. Here are his graphs:
You can see a 20-30% power usage increase from what I got: https://wandb.ai/peruano/huggingface/runs/jbeh9a6r?workspace=user-peruano
@charliezjw @puyuanliu Excuse me.Is the following installation method correct?
pip install git+https://github.com/zphang/transformers.git@llama_push
Each version is as follows: numpy==1.24.2 rouge-score==0.1.2 fire==0.5.0 openai==0.27.2 sentencepiece==0.1.97 wandb==0.14.0