alpaca-lora icon indicating copy to clipboard operation
alpaca-lora copied to clipboard

Anyone try fine-tuning 13B model?

Open gururise opened this issue 2 years ago • 28 comments
trafficstars

Training the 7B model takes about 18GB of RAM.

I tried training the 13B model, and ran out of VRAM on my 24GB card. I suspect, will need at least 32GB of VRAM.

Has anyone else been successful with fine-tuning a 13B model?

gururise avatar Mar 16 '23 18:03 gururise

https://github.com/tloen/alpaca-lora/issues/14#issuecomment-1471273729

@ItsLogic what's your experience been like?

tloen avatar Mar 16 '23 19:03 tloen

It has been quite good although understandably slow to train. I have a 4090 so a 24GB card is enough. The only thing you need to change to make it work is MICRO_BATCH_SIZE which I have set to 2. The whole time training I was within a few hundred MB of an OOM so you might need to close a few background tasks when you decide to train. Training time is ~10 hours for the full three epochs. I trained a single epoch(406 steps) in 3 hours 15 mins and got these results on 13B:

13B with lora

image image

13B normal

image

Just a heads up the provided export_state_dict_checkpoint.py has the parameters set for 7B so you will need to change those to match the 13B params before you can use it. It also only outputs one file at the end but the llama to HF conversion script works fine as long as you change the 13B shard count to 1 if you plan on using transformers.

ItsLogic avatar Mar 16 '23 20:03 ItsLogic

@ItsLogic Great results! Can you share your way to make it as a chat with memorized history? Do you include the whole chat history to "### Input:" field with the next prompt or what is the idea? It would be great if you can share your chat logic code!

Btw, how big chat history it can handle with decent memorization and ability to recover info from previous messages?

baleksey avatar Mar 16 '23 21:03 baleksey

@ItsLogic Great results! Can you share your way to make it as a chat with memorized history? Do you include the whole chat history to "### Input:" field with the next prompt or what is the idea? It would be great if you can share your chat logic code!

I'm just using https://github.com/oobabooga/text-generation-webui with the --cai-chat launch arg as my GUI and it handles everything

Btw, how big chat history it can handle with decent memorization and ability to recover info from previous messages?

Honestly not sure. Haven't had any super long conversations so I can't speak for it's memory.

ItsLogic avatar Mar 16 '23 21:03 ItsLogic

I'm just using https://github.com/oobabooga/text-generation-webui with the --cai-chat launch arg as my GUI and it handles everything

It's my understanding that you have to format your prompt in a specific way for this model, as is done here. I don't think text-generation-webui does that (yet).

shawnanastasio avatar Mar 16 '23 22:03 shawnanastasio

@tloen @ItsLogic Guys, do you remember what was the minimal loss when you stopped the training? And do you know (from papers or experience) what loss is considered great for such model/training goal?

Fine-tuning my 13B at the moment and guessing when it's "enough" :)

baleksey avatar Mar 17 '23 00:03 baleksey

@baleksey Trained a 13B lora with one epoch as well, my loss was around 0.75 lowest. Text gen results don't feel too different for me than 7B though

0xbitches avatar Mar 17 '23 01:03 0xbitches

Crazy to think we can even finetune the 13B on a 4090, next step is to train the 33B and convert it to 4int :D

devilismyfriend avatar Mar 17 '23 01:03 devilismyfriend

@ItsLogic Guys, do you remember what was the minimal loss when you stopped the training?

I hovered around low 0.8 and high 0.7 from about 20% through the first epoch. It ended just above 0.8. I just finished epoch 3 and got 0.78. took 9 hours and 20 mins.

ItsLogic avatar Mar 17 '23 08:03 ItsLogic

Thanks! Just finished my 13B as well with 0.79.

Tried to run fine-tuning 33B with A6000. It uses 48.4GB out of 49 and says that 1 epoch will take about 12 hours. Maybe do it later.

baleksey avatar Mar 17 '23 10:03 baleksey

you may need to also add max_memory={0: 18} to LlamaForCausalLM.from_pretrained if u run into OOM errors when fine-tuning the 13B model

mastoca avatar Mar 17 '23 16:03 mastoca

Any hero out there got Alpaca-13B distributed yet? For those of us that lack a 24GB GPU 😓

justinh-rahb avatar Mar 17 '23 17:03 justinh-rahb

Any hero out there got Alpaca-13B distributed yet? For those of us that lack a 24GB GPU sweat

I uploaded my epoch 1 and epoch 3 loras https://huggingface.co/Draff/llama-alpaca-stuff/tree/main/Alpaca-Loras

ItsLogic avatar Mar 17 '23 17:03 ItsLogic

@ItsLogic Forgive my ignorance, but how do I download this? from_pretrained() doesn't support subdirectories.

justinh-rahb avatar Mar 17 '23 17:03 justinh-rahb

@ItsLogic Forgive my ignorance, but how do I download this? from_pretrained() doesn't support subdirectories.

just download adapter_config.json and adapter_model.bin manually and put them in a folder and then edit the path in generate.py to point to that folder.

ItsLogic avatar Mar 17 '23 17:03 ItsLogic

I uploaded my epoch 1 and epoch 3 loras

Any noticeable difference between epoch 1 and epoch 3? I'm going to train a model on my cleaned up dataset. There are a lot of issues with the current dataset.

gururise avatar Mar 17 '23 18:03 gururise

Any noticeable difference between epoch 1 and epoch 3?

Havent done any in depth testing yet but from my usage so far they feel about the same

ItsLogic avatar Mar 17 '23 18:03 ItsLogic

Can report the same as @ItsLogic e1 and e3 feels roughly the same, probably because the loss are both ~0.78.

Somewhat related, when I was trying the model out with textgen-webui, the outputs are incredibly short. Don't know if it's a problem with the webui or the model itself.

0xbitches avatar Mar 17 '23 19:03 0xbitches

Any hero out there got Alpaca-13B distributed yet? For those of us that lack a 24GB GPU sweat

Not so sure the 13B model is gonna perform much better than the 7B right now, the stanford dataset has a ton of issues. I've been going through trying to fix them.

I've done a first best effort to resolve the issues, and I'm training a new 7b model right now, but my GPU is a potato, so I wont have anything to show until tomorrow.

gururise avatar Mar 17 '23 20:03 gururise

Any hero out there got Alpaca-13B distributed yet? For those of us that lack a 24GB GPU sweat

Not so sure the 13B model is gonna perform much better than the 7B right now, the stanford dataset has a ton of issues. I've been going through trying to fix them.

I've done a first best effort to resolve the issues, and I'm training a new 7b model right now, but my GPU is a potato, so I wont have anything to show until tomorrow.

@gururise Could you please elaborate on the issues you are seeing with the Stanford Alpaca dataset?

Ahh, never mind -- I just saw this: https://github.com/tloen/alpaca-lora/pull/32

rohvani avatar Mar 17 '23 21:03 rohvani

Hi @gururise

but my GPU is a potato

I'm co-founder of qblocks.cloud. We would love to offer to you some GPU credits to help with your research and experimentation on alpaca / lora. Can we connect some way?

gvijqb avatar Mar 28 '23 07:03 gvijqb

Hi @gururise

but my GPU is a potato

I'm co-founder of qblocks.cloud. We would love to offer to you some GPU credits to help with your research and experimentation on alpaca / lora. Can we connect some way?

Would love to take you up on your offer of GPU credits to generate some fine-tuned Alpaca models using my cleaned dataset. I've sent you an email.

gururise avatar Mar 28 '23 15:03 gururise

@tloen @ItsLogic Guys, do you remember what was the minimal loss when you stopped the training? And do you know (from papers or experience) what loss is considered great for such model/training goal?

Fine-tuning my 13B at the moment and guessing when it's "enough" :)

loss This is my loss curve figure for 13B model. All parameters same to 7B model. Training time 7.5h on 3090. Wondering if others are the same

kizunasunhy avatar Apr 07 '23 01:04 kizunasunhy

Training time 7.5h on 3090.

I suppose Kaggle grandmaster's lesson of using 5-fold cross-validation also for DL models is out of the window then:)

mirekphd avatar May 01 '23 06:05 mirekphd

@ItsLogic Can you show what your trainer args and hyper params are for the 13B training run? My models seem to take WAY longer than 10 hours to train. on 3090ti. Like a couple days at a time. :(

pGit1 avatar May 24 '23 03:05 pGit1

@ItsLogic nevermind. The longer training time definitely stemmed from cutoff len going from 256 to 512.

pGit1 avatar May 24 '23 06:05 pGit1

@ItsLogic I get the error while fine-tuning 13B on 2 quadro 24GB. Setting the micro_batch_size to 2 does not help either. What are your training params? the same code works fine for distributed finetuning of 7B model. So, I suspect I would have to reduce the size, not sure which one.

torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 26.00 MiB (GPU 0; 23.65 GiB total capacity; 21.88 GiB already allocated; 8.31 MiB free; 22.82 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF Traceback (most recent call last): File "finetune.py", line 317, in fire.Fire(train)

pmudgal-Intel avatar Dec 05 '23 22:12 pmudgal-Intel

Update. bitsandbytes==0.37.2 solved the problem. I had bitsandbytes==0.39.0 previously.

pmudgal-Intel avatar Dec 06 '23 20:12 pmudgal-Intel