alpaca.cpp What will it take to get a 65B alpaca weight?

trafficstars

This was initially released with 7B and 13B alpaca weights. I added instructions how to use a 30B alpaca weight yesterday since it appeared here https://huggingface.co/Pi3141/alpaca-30B-ggml. I know there is a 65B llama weight, but as far as I understand, not a 65B alpaca weight yet.

What will it take to get a 65B alpaca weight and how can we get this done as a community?

Mar 22 '23 20:03 trevtravtrev

It will really just take a lot of money. We can finetune on the same datasets the others have been finetuned on in order to get a 65B model, but we will need to do using state of the art hardware.

Mar 23 '23 20:03 NolanMangalMM

do we know if the person who made the 30B alpaca is working on the 65B file? would love to pitch in a few dollars!

Mar 23 '23 21:03 crash5551

It will really just take a lot of money. We can finetune on the same datasets the others have been finetuned on in order to get a 65B model, but we will need to do using state of the art hardware.

I’ve heard of people renting the computational power from Google to train models/datasets so that could be an option. It’s relatively affordable apparently

Mar 24 '23 09:03 Bigcow11

Any knowledge of open initiatives doing this? Also would love to pitch in.

Mar 25 '23 19:03 AlexandreCassagne

https://huggingface.co/chavinlo/Alpaca-65B/tree/main

Mar 27 '23 13:03 Green-Sky

The founder of Zapier has offered to fund this: https://twitter.com/mikeknoop/status/1638248244911435776

Mar 27 '23 23:03 adamsmith

@Green-Sky newbie question: how come the file sizes are so small?

Mar 28 '23 13:03 d33tah

@d33tah The finetune does not contain the full model. to cite LoRA, the used technique: LoRA reduces the number of trainable parameters by learning pairs of rank-decompostion matrices while freezing the original weights. This vastly reduces the storage requirement for large language models adapted to specific tasks and enables efficient task-switching during deployment all without introducing inference latency. LoRA also outperforms several other adaptation methods including adapter, prefix-tuning, and fine-tuning.

Mar 29 '23 11:03 Green-Sky

Amen! How long would a 4x RTX A6000 (48GB each) server need for the fine-tuning (assuming it's up to the task)

Apr 03 '23 10:04 neuhaus

Y'all: This shouldn't be difficult. I finetuned the 30B 8-bit Llama with Alpaca Lora in about 26 hours on a couple of 3090's with good results. The 65B model quantized in 4-bit has a memory footprint roughly the same as 30B in 8-bit. It looks like the Alpaca 65B weights are available on HF here: https://huggingface.co/chavinlo/Alpaca-65B. I haven't been able to fine-tune the 65B-4bit across multiple GPU's yet due to issues with training 4-bit models but it's certainly looking feasible and I don't see why it couldn't be done on 2x3090s with NVlink.

Apr 07 '23 20:04 alxfoster

Y'all: This shouldn't be difficult. I finetuned the 30B 8-bit Llama with Alpaca Lora in about 26 hours on a couple of 3090's with good results. The 65B model quantized in 4-bit has a memory footprint roughly the same as 30B in 8-bit. It looks like the Alpaca 65B weights are available on HF here: https://huggingface.co/chavinlo/Alpaca-65B. I haven't been able to fine-tune the 65B-4bit across multiple GPU's yet due to issues with training 4-bit models but it's certainly looking feasible and I don't see why it couldn't be done on 2x3090s with NVlink.

Awesome, thanks for the link. Will this work with either of the alpaca.cpp or llama.cpp projects?

Apr 07 '23 22:04 Castaa

I thought training and fine-tuning is primarily done with FP16? What are the drawbacks of training in 4 bit?

Apr 08 '23 11:04 neuhaus

Y'all: This shouldn't be difficult. I finetuned the 30B 8-bit Llama with Alpaca Lora in about 26 hours on a couple of 3090's with good results. The 65B model quantized in 4-bit has a memory footprint roughly the same as 30B in 8-bit. It looks like the Alpaca 65B weights are available on HF here: https://huggingface.co/chavinlo/Alpaca-65B. I haven't been able to fine-tune the 65B-4bit across multiple GPU's yet due to issues with training 4-bit models but it's certainly looking feasible and I don't see why it couldn't be done on 2x3090s with NVlink.

I'll have the motherboard I need tomorrow to set up my 2 3090ti's and nvlink adapter properly. I'd be willing to let it churn on the task for a few days - I'd love to have a 4 bit 65b alpaca model to run on my setup. Any advice on training over nvlink? I'm a little new to the llm stuff.

Apr 10 '23 16:04 RandyHaylor

@RandyHaylor 4-bit lora training is currently only in this repo https://github.com/johnsmith0031/alpaca_lora_4bit afaik

I'm interested in doing this myself, too. Will have to monitor the temperatures of the 3090s closely…

Apr 11 '23 12:04 neuhaus

@RandyHaylor 4-bit lora training is currently only in this repo https://github.com/johnsmith0031/alpaca_lora_4bit afaik

I'm interested in doing this myself, too. Will have to monitor the temperatures of the 3090s closely…

Thanks!

Based on an article I read about 3090ti's only losing 17% processing power when underpowered by 33% (300w vs 450w), I'll probably run them like that for any long tasks. I'm not in such a hurry that I'll risk burning these cards out...

Apr 12 '23 17:04 RandyHaylor

@RandyHaylor 4-bit lora training is currently only in this repo https://github.com/johnsmith0031/alpaca_lora_4bit afaik

I'm interested in doing this myself, too. Will have to monitor the temperatures of the 3090s closely…

If I'm going to bother, what's the best 65b model to start from? Is there a trustworthy 4bit one to pull from?

I'm currently working on getting a 65b llama 4bit model even running on my 3090 ti's (Ubuntu desktop 22.04 bare metal install) I'm suspecting I might be having issues with the two cards not going in the two main gpu slots (risers arriving later today will fix that plus let me connect nvlink adapter)

Any advice on how to get it going? I've had good luck with text-generation-webui running 30b models on one card so far.

Apr 12 '23 17:04 RandyHaylor

I have an X570 based board with the two GPU slots 60mm apart (3 slots) and their bandwidth is PCIe 4.0 x8 each when using both slots. I managed to find a 60mm NVLink adapter that didn't cost an arm and a leg. Inference with text-generation-webui works with 65b-4bit and two x090 24GB nvidia cards. Just give it the gpu memory parameter and assign less memory to the first GPU: --gpu-memory 16 21

Apr 12 '23 19:04 neuhaus

alpaca.cpp alpaca.cpp copied to clipboard

What will it take to get a 65B alpaca weight?

alpaca.cpp
alpaca.cpp copied to clipboard