stanford_alpaca icon indicating copy to clipboard operation
stanford_alpaca copied to clipboard

Fine-Tuning very slow (6h->24h??)

Open chavinlo opened this issue 1 year ago • 3 comments

Hello, first of all thank you for releasing the training code for alpaca, we really appreaciate it.

I am running the fine-tuning script on an 4xA100-SXM4-80GB, and currently getting an 24H ETA. Which doesn't really scales with the reported "3 hours on 8 80GB A100s" mentioned on https://crfm.stanford.edu/2023/03/13/alpaca.html , Shouldn't it be around 6hours, or even 12hours considering that the script "is not particularly optimized"?

Is anyone else encountering this issue? And if this is expected, then what were the methods you used to optimize the fine-tuning process?

Running on CUDA 12.1, Torch 1.13, and the transformers fork of llama at the commit you mentioned.

Thanks.

chavinlo avatar Mar 15 '23 15:03 chavinlo

Can you share the result file when you finish?

joaopandolfi avatar Mar 15 '23 16:03 joaopandolfi

Can you share the result file when you finish?

sure, but theres a LORA repo that supposedly gives better results than the current one, not sure... heres those: https://huggingface.co/chavinlo/alpaca https://github.com/tloen/alpaca-lora

chavinlo avatar Mar 15 '23 16:03 chavinlo

Hello, first of all thank you for releasing the training code for alpaca, we really appreaciate it.

I am running the fine-tuning script on an 4xA100-SXM4-80GB, and currently getting an 24H ETA. Which doesn't really scales with the reported "3 hours on 8 80GB A100s" mentioned on https://crfm.stanford.edu/2023/03/13/alpaca.html , Shouldn't it be around 6hours, or even 12hours considering that the script "is not particularly optimized"?

Is anyone else encountering this issue? And if this is expected, then what were the methods you used to optimize the fine-tuning process?

Running on CUDA 12.1, Torch 1.13, and the transformers fork of llama at the commit you mentioned.

Thanks.

Thanks for your interest! That should not happen, in principle. I reproduced our initial model with the given training code and command with a training run that lasts less than 2 hours. Could you share more details of your setup? Are you getting reasonable GPU utilization?

lxuechen avatar Mar 15 '23 16:03 lxuechen

Hello, first of all thank you for releasing the training code for alpaca, we really appreaciate it. I am running the fine-tuning script on an 4xA100-SXM4-80GB, and currently getting an 24H ETA. Which doesn't really scales with the reported "3 hours on 8 80GB A100s" mentioned on https://crfm.stanford.edu/2023/03/13/alpaca.html , Shouldn't it be around 6hours, or even 12hours considering that the script "is not particularly optimized"? Is anyone else encountering this issue? And if this is expected, then what were the methods you used to optimize the fine-tuning process? Running on CUDA 12.1, Torch 1.13, and the transformers fork of llama at the commit you mentioned. Thanks.

Thanks for your interest! That should not happen, in principle. I reproduced our initial model with the given training code and command with a training run that lasts less than 2 hours. Could you share more details of your setup? Are you getting reasonable GPU utilization?

Power usage is a bit low, but the rest is at max: image

the full stats are available on wandb: https://wandb.ai/peruano/huggingface/runs/68m25500?workspace=user-peruano

currently it's going at 64.81s/it after a reboot.

chavinlo avatar Mar 15 '23 18:03 chavinlo

I am keep getting this error message, I am wondering whether you have seen it: Exception: Could not find the transformer layer class to wrap in the model.

Thank you!

charliezjw avatar Mar 15 '23 20:03 charliezjw

hi @chavinlo ,

are you running the released code? best to adapt from there.

Thanks

and @charliezjw,

there is not enough details for me to respond to. what code are you running

Tiiiger avatar Mar 15 '23 20:03 Tiiiger

are you running the released code? best to adapt from there.

Yes I am running the fine-tuning code from this repo.

chavinlo avatar Mar 15 '23 20:03 chavinlo

Can you share the result file when you finish?

sure, but theres a LORA repo that supposedly gives better results than the current one, not sure... heres those: https://huggingface.co/chavinlo/alpaca https://github.com/tloen/alpaca-lora

I tried the LORA one (from the repo, not yours) and found it to be worse, getting a lot of "noinput" vs the Stanford one, so let us know if you get it trained :D

devilismyfriend avatar Mar 15 '23 22:03 devilismyfriend

https://github.com/tloen/alpaca-lora

I am keep getting this error message, I am wondering whether you have seen it: Exception: Could not find the transformer layer class to wrap in the model.

Thank you!

did you install the commit mentioned in the README? I got this error when I installed current version of the HF PR.

pip install git+https://github.com/zphang/transformers.git@68d640f7c368bcaaaecfc678f11908ebbd3d6176

melisa-writer avatar Mar 15 '23 23:03 melisa-writer

I solved the error by replacing every "LLaMA" to "Llama"

https://github.com/tloen/alpaca-lora

I am keep getting this error message, I am wondering whether you have seen it: Exception: Could not find the transformer layer class to wrap in the model. Thank you!

did you install the commit mentioned in the README? I got this error when I installed current version of the HF PR.

pip install git+https://github.com/zphang/transformers.git@68d640f7c368bcaaaecfc678f11908ebbd3d6176

charliezjw avatar Mar 16 '23 00:03 charliezjw

https://github.com/tloen/alpaca-lora

I am keep getting this error message, I am wondering whether you have seen it: Exception: Could not find the transformer layer class to wrap in the model. Thank you!

did you install the commit mentioned in the README? I got this error when I installed current version of the HF PR.

pip install git+https://github.com/zphang/transformers.git@68d640f7c368bcaaaecfc678f11908ebbd3d6176

Try to revise the parameter "fsdp_transformer_layer_cls_to_wrap" to "LlamaDecoderLayer", this issue could be solved.

XuhuiRen avatar Mar 16 '23 01:03 XuhuiRen

@joaopandolfi @devilismyfriend I will be progressively uploading the checkpoints, it's getting saved every 200 steps, total is 1200.

Heres the 200step checkpoint: https://huggingface.co/chavinlo/alpaca-native/tree/main aka 17%

chavinlo avatar Mar 16 '23 02:03 chavinlo

@chavinlo thank you for your work! Are you able to train the LORA on 13b (or potentially larger)? Also, since the loss stops decreasing after ~1 epoch, it might not be necessary to keep training.

0xbitches avatar Mar 16 '23 03:03 0xbitches

@chavinlo thank you for your work! Are you able to train the LORA on 13b (or potentiall larger)? Also, since the loss stops decreasing after ~1 epoch, it might not be necessary to keep training.

yes, but once I get my a100s fixed because there definetly is something throttling them

chavinlo avatar Mar 16 '23 03:03 chavinlo

yes, but once I get my a100s fixed because there definetly is something throttling them

Thanks - I just finetuned my own lora with tloen's code, unfortunately the results are not too different from 7B with lora. Maybe I am using the wrong prompts to test it lol

0xbitches avatar Mar 16 '23 04:03 0xbitches

yes, but once I get my a100s fixed because there definetly is something throttling them

Thanks - I just finetuned my own lora with tloen's code, unfortunately the results are not too different from 7B with lora. Maybe I am using the wrong prompts to test it lol

I encountered the same issue like you, I used the inference kwargs from this repo instead of what he has there and it's miles better, still think it's a bit worse then this but give it a try

devilismyfriend avatar Mar 16 '23 05:03 devilismyfriend

@joaopandolfi @devilismyfriend I will be progressively uploading the checkpoints, it's getting saved every 200 steps, total is 1200.

Heres the 200step checkpoint: https://huggingface.co/chavinlo/alpaca-native/tree/main aka 17%

thanks! can't wait to test the final checkpoint :)

devilismyfriend avatar Mar 16 '23 05:03 devilismyfriend

inference kwargs from this repo

Could you specify what you meant here? Did you use the alpaca code and train your own model?

0xbitches avatar Mar 16 '23 05:03 0xbitches

@chavinlo

hi @chavinlo ,

are you running the released code? best to adapt from there.

Thanks

and @charliezjw,

there is not enough details for me to respond to. what code are you running I have the same problem. Have you solved it?

cxj01 avatar Mar 16 '23 07:03 cxj01

@chavinlo

hi @chavinlo , are you running the released code? best to adapt from there. Thanks and @charliezjw, there is not enough details for me to respond to. what code are you running I have the same problem. Have you solved it?

????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????

chavinlo avatar Mar 16 '23 07:03 chavinlo

@chavinlo I have the same problem. Have you solved it? Exception: Could not find the transformer layer class to wrap in the model.

cxj01 avatar Mar 16 '23 07:03 cxj01

The reason for some of these issues is explained in this note.

Feel free to reopen if it doesn't fully resolve the mysteries :)

lxuechen avatar Mar 16 '23 07:03 lxuechen

The reason for some of these issues is explained in this note.

Feel free to reopen if it doesn't fully resolve the mysteries :)

My issue is about speed... Nothing related about these layer class errors, they just came into my issue for some reason

chavinlo avatar Mar 16 '23 07:03 chavinlo

inference kwargs from this repo

Could you specify what you meant here? Did you use the alpaca code and train your own model?

search for the inference kwargs issue here, use the parameters in the generation config in the lora repo

devilismyfriend avatar Mar 16 '23 12:03 devilismyfriend

@lxuechen Can you reopen this issue? The original problem was about speed, not about layers. Aditionally, I have tried cleaning the instance, and I still get the same speed.

chavinlo avatar Mar 16 '23 23:03 chavinlo

report.txt here is the nvidia-smi -q report of my gpus...

chavinlo avatar Mar 16 '23 23:03 chavinlo

@chavinlo I am on the same boat. ~24 hour ETA with 4 X 80G A100. My GPU power usage is also low(~80W/300W) which seems to be suspicious.

helloeve avatar Mar 17 '23 00:03 helloeve

My training finishes in 4 hours with 8 * A100 (40G) using fp16. (3 epochs)

Environment: Pytorch=1.13 transformer (latest one on git)

Command:

torchrun --nproc_per_node=8 --master_port=1234 train.py --model_name_or_path converted_llama_7B --data_path ./alpaca_data.json --fp16 True --output_dir ./trained_model --num_train_epochs 3 --per_device_train_batch_size 1 --per_device_eval_batch_size 1 --gradient_accumulation_steps 4 --evaluation_strategy "no" --save_strategy "steps" --save_steps 2000 --save_total_limit 5 --learning_rate 2e-5 --weight_decay 0. --warmup_ratio 0.03 --lr_scheduler_type "cosine" --logging_steps 1 --fsdp "full_shard auto_wrap" --fsdp_transformer_layer_cls_to_wrap 'LlamaDecoderLayer'

However, once the training finishes, the model is generating non-sense tokens as mentioned in https://github.com/tatsu-lab/stanford_alpaca/issues/70 and https://github.com/tatsu-lab/stanford_alpaca/issues/51

puyuanliu avatar Mar 17 '23 00:03 puyuanliu

@chavinlo I am on the same boat. ~24 hour ETA with 4 X 80G A100. My GPU power usage is also low(~80W/300W) which seems to be suspicious.

I think it might have to do with something in the os configuration. I've changed GPUs and they still perform the same, even the first time it went 2.5 times slower giving an ETA of 52h.

However, someone told me that runpod works alright. Here are his graphs: image You can see a 20-30% power usage increase from what I got: https://wandb.ai/peruano/huggingface/runs/jbeh9a6r?workspace=user-peruano

chavinlo avatar Mar 17 '23 01:03 chavinlo

@charliezjw @puyuanliu Excuse me.Is the following installation method correct?

pip install git+https://github.com/zphang/transformers.git@llama_push

Each version is as follows: numpy==1.24.2 rouge-score==0.1.2 fire==0.5.0 openai==0.27.2 sentencepiece==0.1.97 wandb==0.14.0

447428054 avatar Mar 17 '23 01:03 447428054