stanford_alpaca Fine-Tuning very slow (6h->24h??)

Hello, first of all thank you for releasing the training code for alpaca, we really appreaciate it.

I am running the fine-tuning script on an 4xA100-SXM4-80GB, and currently getting an 24H ETA. Which doesn't really scales with the reported "3 hours on 8 80GB A100s" mentioned on https://crfm.stanford.edu/2023/03/13/alpaca.html , Shouldn't it be around 6hours, or even 12hours considering that the script "is not particularly optimized"?

Is anyone else encountering this issue? And if this is expected, then what were the methods you used to optimize the fine-tuning process?

Running on CUDA 12.1, Torch 1.13, and the transformers fork of llama at the commit you mentioned.

Thanks.

Mar 15 '23 15:03 chavinlo

Can you share the result file when you finish?

Mar 15 '23 16:03 joaopandolfi

Can you share the result file when you finish?

sure, but theres a LORA repo that supposedly gives better results than the current one, not sure... heres those: https://huggingface.co/chavinlo/alpaca https://github.com/tloen/alpaca-lora

Mar 15 '23 16:03 chavinlo

Hello, first of all thank you for releasing the training code for alpaca, we really appreaciate it.

I am running the fine-tuning script on an 4xA100-SXM4-80GB, and currently getting an 24H ETA. Which doesn't really scales with the reported "3 hours on 8 80GB A100s" mentioned on https://crfm.stanford.edu/2023/03/13/alpaca.html , Shouldn't it be around 6hours, or even 12hours considering that the script "is not particularly optimized"?

Is anyone else encountering this issue? And if this is expected, then what were the methods you used to optimize the fine-tuning process?

Running on CUDA 12.1, Torch 1.13, and the transformers fork of llama at the commit you mentioned.

Thanks.

Thanks for your interest! That should not happen, in principle. I reproduced our initial model with the given training code and command with a training run that lasts less than 2 hours. Could you share more details of your setup? Are you getting reasonable GPU utilization?

Mar 15 '23 16:03 lxuechen

Hello, first of all thank you for releasing the training code for alpaca, we really appreaciate it. I am running the fine-tuning script on an 4xA100-SXM4-80GB, and currently getting an 24H ETA. Which doesn't really scales with the reported "3 hours on 8 80GB A100s" mentioned on https://crfm.stanford.edu/2023/03/13/alpaca.html , Shouldn't it be around 6hours, or even 12hours considering that the script "is not particularly optimized"? Is anyone else encountering this issue? And if this is expected, then what were the methods you used to optimize the fine-tuning process? Running on CUDA 12.1, Torch 1.13, and the transformers fork of llama at the commit you mentioned. Thanks.

Thanks for your interest! That should not happen, in principle. I reproduced our initial model with the given training code and command with a training run that lasts less than 2 hours. Could you share more details of your setup? Are you getting reasonable GPU utilization?

Power usage is a bit low, but the rest is at max:

the full stats are available on wandb: https://wandb.ai/peruano/huggingface/runs/68m25500?workspace=user-peruano

currently it's going at 64.81s/it after a reboot.

Mar 15 '23 18:03 chavinlo

I am keep getting this error message, I am wondering whether you have seen it: Exception: Could not find the transformer layer class to wrap in the model.

Thank you!

Mar 15 '23 20:03 charliezjw

hi @chavinlo ,

are you running the released code? best to adapt from there.

Thanks

and @charliezjw,

there is not enough details for me to respond to. what code are you running

Mar 15 '23 20:03 Tiiiger

are you running the released code? best to adapt from there.

Yes I am running the fine-tuning code from this repo.

Mar 15 '23 20:03 chavinlo

Can you share the result file when you finish?

sure, but theres a LORA repo that supposedly gives better results than the current one, not sure... heres those: https://huggingface.co/chavinlo/alpaca https://github.com/tloen/alpaca-lora

I tried the LORA one (from the repo, not yours) and found it to be worse, getting a lot of "noinput" vs the Stanford one, so let us know if you get it trained :D

Mar 15 '23 22:03 devilismyfriend

https://github.com/tloen/alpaca-lora

I am keep getting this error message, I am wondering whether you have seen it: Exception: Could not find the transformer layer class to wrap in the model.

Thank you!

did you install the commit mentioned in the README? I got this error when I installed current version of the HF PR.

pip install git+https://github.com/zphang/transformers.git@68d640f7c368bcaaaecfc678f11908ebbd3d6176

Mar 15 '23 23:03 melisa-writer

I solved the error by replacing every "LLaMA" to "Llama"

https://github.com/tloen/alpaca-lora

I am keep getting this error message, I am wondering whether you have seen it: Exception: Could not find the transformer layer class to wrap in the model. Thank you!

did you install the commit mentioned in the README? I got this error when I installed current version of the HF PR.
pip install git+https://github.com/zphang/transformers.git@68d640f7c368bcaaaecfc678f11908ebbd3d6176

Mar 16 '23 00:03 charliezjw

https://github.com/tloen/alpaca-lora

I am keep getting this error message, I am wondering whether you have seen it: Exception: Could not find the transformer layer class to wrap in the model. Thank you!

did you install the commit mentioned in the README? I got this error when I installed current version of the HF PR.
pip install git+https://github.com/zphang/transformers.git@68d640f7c368bcaaaecfc678f11908ebbd3d6176

Try to revise the parameter "fsdp_transformer_layer_cls_to_wrap" to "LlamaDecoderLayer", this issue could be solved.

Mar 16 '23 01:03 XuhuiRen

@joaopandolfi @devilismyfriend I will be progressively uploading the checkpoints, it's getting saved every 200 steps, total is 1200.

Heres the 200step checkpoint: https://huggingface.co/chavinlo/alpaca-native/tree/main aka 17%

Mar 16 '23 02:03 chavinlo

@chavinlo thank you for your work! Are you able to train the LORA on 13b (or potentially larger)? Also, since the loss stops decreasing after ~1 epoch, it might not be necessary to keep training.

Mar 16 '23 03:03 0xbitches

@chavinlo thank you for your work! Are you able to train the LORA on 13b (or potentiall larger)? Also, since the loss stops decreasing after ~1 epoch, it might not be necessary to keep training.

yes, but once I get my a100s fixed because there definetly is something throttling them

Mar 16 '23 03:03 chavinlo

yes, but once I get my a100s fixed because there definetly is something throttling them

Thanks - I just finetuned my own lora with tloen's code, unfortunately the results are not too different from 7B with lora. Maybe I am using the wrong prompts to test it lol

Mar 16 '23 04:03 0xbitches

yes, but once I get my a100s fixed because there definetly is something throttling them

Thanks - I just finetuned my own lora with tloen's code, unfortunately the results are not too different from 7B with lora. Maybe I am using the wrong prompts to test it lol

I encountered the same issue like you, I used the inference kwargs from this repo instead of what he has there and it's miles better, still think it's a bit worse then this but give it a try

Mar 16 '23 05:03 devilismyfriend

@joaopandolfi @devilismyfriend I will be progressively uploading the checkpoints, it's getting saved every 200 steps, total is 1200.

Heres the 200step checkpoint: https://huggingface.co/chavinlo/alpaca-native/tree/main aka 17%

thanks! can't wait to test the final checkpoint :)

Mar 16 '23 05:03 devilismyfriend

inference kwargs from this repo

Could you specify what you meant here? Did you use the alpaca code and train your own model?

Mar 16 '23 05:03 0xbitches

@chavinlo

hi @chavinlo ,

are you running the released code? best to adapt from there.

Thanks

and @charliezjw,

there is not enough details for me to respond to. what code are you running I have the same problem. Have you solved it?

Mar 16 '23 07:03 cxj01

@chavinlo

hi @chavinlo , are you running the released code? best to adapt from there. Thanks and @charliezjw, there is not enough details for me to respond to. what code are you running I have the same problem. Have you solved it?

????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????

Mar 16 '23 07:03 chavinlo

@chavinlo I have the same problem. Have you solved it? Exception: Could not find the transformer layer class to wrap in the model.

Mar 16 '23 07:03 cxj01

The reason for some of these issues is explained in this note.

Feel free to reopen if it doesn't fully resolve the mysteries :)

Mar 16 '23 07:03 lxuechen

The reason for some of these issues is explained in this note.

Feel free to reopen if it doesn't fully resolve the mysteries :)

My issue is about speed... Nothing related about these layer class errors, they just came into my issue for some reason

Mar 16 '23 07:03 chavinlo

inference kwargs from this repo

Could you specify what you meant here? Did you use the alpaca code and train your own model?

search for the inference kwargs issue here, use the parameters in the generation config in the lora repo

Mar 16 '23 12:03 devilismyfriend

@lxuechen Can you reopen this issue? The original problem was about speed, not about layers. Aditionally, I have tried cleaning the instance, and I still get the same speed.

Mar 16 '23 23:03 chavinlo

report.txt here is the nvidia-smi -q report of my gpus...

Mar 16 '23 23:03 chavinlo

@chavinlo I am on the same boat. ~24 hour ETA with 4 X 80G A100. My GPU power usage is also low(~80W/300W) which seems to be suspicious.

Mar 17 '23 00:03 helloeve

My training finishes in 4 hours with 8 * A100 (40G) using fp16. (3 epochs)

Environment: Pytorch=1.13 transformer (latest one on git)

Command:

torchrun --nproc_per_node=8 --master_port=1234 train.py --model_name_or_path converted_llama_7B --data_path ./alpaca_data.json --fp16 True --output_dir ./trained_model --num_train_epochs 3 --per_device_train_batch_size 1 --per_device_eval_batch_size 1 --gradient_accumulation_steps 4 --evaluation_strategy "no" --save_strategy "steps" --save_steps 2000 --save_total_limit 5 --learning_rate 2e-5 --weight_decay 0. --warmup_ratio 0.03 --lr_scheduler_type "cosine" --logging_steps 1 --fsdp "full_shard auto_wrap" --fsdp_transformer_layer_cls_to_wrap 'LlamaDecoderLayer'

However, once the training finishes, the model is generating non-sense tokens as mentioned in https://github.com/tatsu-lab/stanford_alpaca/issues/70 and https://github.com/tatsu-lab/stanford_alpaca/issues/51

Mar 17 '23 00:03 puyuanliu

@chavinlo I am on the same boat. ~24 hour ETA with 4 X 80G A100. My GPU power usage is also low(~80W/300W) which seems to be suspicious.

I think it might have to do with something in the os configuration. I've changed GPUs and they still perform the same, even the first time it went 2.5 times slower giving an ETA of 52h.

However, someone told me that runpod works alright. Here are his graphs: You can see a 20-30% power usage increase from what I got: https://wandb.ai/peruano/huggingface/runs/jbeh9a6r?workspace=user-peruano

Mar 17 '23 01:03 chavinlo

@charliezjw @puyuanliu Excuse me.Is the following installation method correct?

pip install git+https://github.com/zphang/transformers.git@llama_push

Each version is as follows： numpy==1.24.2 rouge-score==0.1.2 fire==0.5.0 openai==0.27.2 sentencepiece==0.1.97 wandb==0.14.0

Mar 17 '23 01:03 447428054

stanford_alpaca stanford_alpaca copied to clipboard

Fine-Tuning very slow (6h->24h??)

stanford_alpaca
stanford_alpaca copied to clipboard