llama.cpp Add proper instructions for using Alpaca models

So I am looking at https://github.com/antimatter15/alpaca.cpp and I see they are already running 30B Alpaca models, while we are struggling to run 7B due to the recent tokenizer updates.

I also see that the models are now even floating on Hugging Face - I guess license issues are no longer a problem?

We should add detailed instructions for obtaining the Alpaca models and a temporary explanation how to use the following script to make the models compatible with the latest master:

https://github.com/ggerganov/llama.cpp/issues/324#issuecomment-1476227818

The bigger issue is that people keep producing the old version of the ggml models instead of migrating to the latest llama.cpp changes. And therefore, we now need this extra conversion step. It's best to figure out the steps for generating the Alpaca models and generate them in the correct format.

Edit: just don't post direct links to the models!

Mar 22 '23 07:03 ggerganov

Here is what I did to run Alpaca 30b on my system with llama.cpp. I would assume it would work with Alpaca 13b as well.

Downloaded and built llama.cpp from scratch as the latest version is required to specify that the model is in 1 file with the new --n_parts 1 parameter
Downloaded this 30b alpaca model https://huggingface.co/Pi3141/alpaca-30B-ggml/tree/main (If you check the model card, you can find links to other alpaca model sizes)
Named the file ggml-alpaca-30b-q4.bin and placed it in /models/Alpaca/30b inside llama.cpp
Downloaded the script mentioned here: https://github.com/ggerganov/llama.cpp/issues/324#issuecomment-1476227818
Named it convert.py and placed it in the root folder of llama.cpp.
Downloaded the tokenizer mentioned here: https://github.com/ggerganov/llama.cpp/issues/324#issuecomment-1476242192
Placed the tokenizer.model file in /models
Ran python convert.py models/Alpaca/30b models/tokenizer.model in the command prompt from the base folder of llama.cpp (personally I got the message that I needed the module sentencepiece, so I wrote pip install sentencepiece and then I re-ran python convert.py models/Alpaca/30b models/tokenizer.model and it worked. You may or may not encounter this error.)
In the 30b folder, there is now a ggml-alpaca-30b-q4.bin and a ggml-alpaca-30b-q4.bin.tmp file, I renamed ggml-alpaca-30b-q4.bin to ggml-alpaca-30b-q4.bin.old to keep it as a backup, and ggml-alpaca-30b-q4.bin.tmp to ggml-alpaca-30b-q4.bin
Now I can run llama.cpp with ./main -m ./models/alpaca/30b/ggml-alpaca-30b-q4.bin --color -f ./prompts/alpaca.txt -ins --n_parts 1.

Maybe this can be of temporary help to anybody else eager to set it up. Please correct me if I've made any mistakes, I wrote it retroactively from memory.

Mar 22 '23 08:03 madmads11

Can confirm the above works for the 13B model too.

Mar 22 '23 11:03 Puncia

The above instructions work for me too for the 13B model! Thank you!

Mar 22 '23 12:03 lolxdmainkaisemaanlu

check sum for the converted (ggmf v1) Pi3141 alpaca-30B-ggml

$ sha256sum ggml-model-q4_0.bin
969652d32ce186ca3c93217ece8311ebe81f15939aa66a6fe162a08dd893faf8  ggml-model-q4_0.bin

Mar 22 '23 12:03 Green-Sky

all of them (7B/13B/30B/65B*) 4b quantized q4_0 (RTN) and GPTQ new tokenizer format *no alpaca-65b tho as it would take very long time does not include batteries

https://btcache.me/torrent/E5322AB4676E24632A907FD9846234BB40265C4F https://torrage.info/torrent.php?h=e5322ab4676e24632a907fd9846234bb40265c4f

single command option:

aria2c --summary-interval=0 --bt-max-peers=0 http://taco.cab/ggml/ggml-q4.torrent

as usual, the alpaca and gptq models need the --n_parts 1 option

palpaca-7B

hope that helps :+1:

Mar 23 '23 06:03 anzz1

@anzz1 you did not specify for which model your links are. also please provide checksums :)

Mar 23 '23 10:03 Green-Sky

me: i should try and debug all those crashes me: > help me write a song about llama.cpp (c++ api for facebooks llm) llama.cpp:

A llama is an animal that's so strange,
It can do things we only imagine.
LLamaCPP is the code that gives it its brawn,
Allowing us to use it like a clown.

The api has commands we can use,
To take advantage of this llama abuse.
It's an interface that let's us be boss,
If you know the right way to make your call.

(the 30B alpaca lora finetune by pi)

Mar 23 '23 10:03 Green-Sky

i linked the checksums here https://github.com/ggerganov/llama.cpp/issues/374#issuecomment-1480719278

Mar 23 '23 12:03 anzz1

@anzz1 Thank you for the download. Did you see the latest fix to GPTQ conversion?

Yes.

@anzz1 Any chance you could re convert them using the changes from this: #423

No need.

Mar 24 '23 01:03 anzz1

I just saw the updated readme file stating that you cannot link to model downloads anywhere on this repository. Would instructions like mine, where in step 2 I link to a model download on HF, violate that rule going forward? I assume that the instructions as is are okay because they were written before the rule, but what about going forward?

Mar 24 '23 21:03 madmads11

the ones you linked are sadly mixed, and not "pure" lora models. so i would assume no. you could just say "pi3141 alpaca 30B" model, and it would be fine i guess.

Mar 24 '23 21:03 Green-Sky

the ones you linked are sadly mixed, and not "pure" lora models. so i would assume no. you could just say "pi3141 alpaca 30B" model, and it would be fine i guess.

Interesting, I didn't realize it was mixed. Can you explain what that means in this context?

Mar 24 '23 21:03 madmads11

"mixed" -> "merged" If you look at this for example https://huggingface.co/tloen/alpaca-lora-7b/tree/main , those are only the lora weights. I think (need to actually read the paper) those are either not directly derived from llama, or are derived enough, to count as remixing/fairuse or something.

edit: you can clearly see by the filesize.

Mar 24 '23 22:03 Green-Sky

Worked for me thanks @anzz1 the AI is running kind of slow tho, Im on windows with 5950X and 80+ GB ram... But the writting time is like GPT4 on max load x) any params I forgot to set ? Edit : tried to change the -t value to 32, nothing changes, prompt is still slow AF

Mar 25 '23 07:03 ghost

@anzz1 , did you actually try those models? Using the latest master (4b8efff) on Windows to try to load alpaca-13B-ggml GPTQ from your torrent, it just starts spitting out C# code as soon as I launch it.

C:\_downloads\ggml-q4\models\alpaca-13B-ggml>main.exe -m ggml-model-gptq4.bin --interactive --color --n_parts 1
main: seed = 1679990008
llama_model_load: loading model from 'ggml-model-gptq4.bin' - please wait ...
llama_model_load: n_vocab = 32000
llama_model_load: n_ctx   = 512
llama_model_load: n_embd  = 5120
llama_model_load: n_mult  = 256
llama_model_load: n_head  = 40
llama_model_load: n_layer = 40
llama_model_load: n_rot   = 128
llama_model_load: f16     = 4
llama_model_load: n_ff    = 13824
llama_model_load: n_parts = 1
llama_model_load: type    = 2
llama_model_load: ggml ctx size = 10101.68 MB
llama_model_load: mem required  = 12149.68 MB (+ 1608.00 MB per state)
llama_model_load: loading model part 1/1 from 'ggml-model-gptq4.bin'
llama_model_load: ............................................. done
llama_model_load: model size =  9701.58 MB / num tensors = 363
llama_init_from_file: kv self size  =  400.00 MB

system_info: n_threads = 4 / 8 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | FMA = 0 | NEON = 0 | ARM_FMA = 0 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 0 | VSX = 0 |
main: interactive mode on.
sampling: temp = 0.800000, top_k = 40, top_p = 0.950000, repeat_last_n = 64, repeat_penalty = 1.100000
generate: n_ctx = 512, n_batch = 8, n_predict = 128, n_keep = 0


== Running in interactive mode. ==
 - Press Ctrl+C to interject at any time.
 - Press Return to return control to LLaMa.
 - If you want to submit another line, end your input in '\'.

 using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Threading.Tasks;

namespace _14.Bucket_Sort
{
    class Program
    {
        static void Main(string[] args)
        {
            var input = Console.ReadLine();
            int n = int.Parse(input);

Then I used CTRL+C to interrupt it thinking it could be a minor bug, and asked it "who is Kanye West". Response until I closed the program:

                  What did he do?
                ;
                arr[i] = long.Parse(line);
            }
            Array.Sort(arr);
            for (int i = 0; i < n;

I also downloaded the non-GPTQ version, it has the same issue, spitting out C++ code:

>main.exe -m ggml-model-q4_0.bin --interactive --color --n_parts 1
main: seed = 1679992628
llama_model_load: loading model from 'ggml-model-q4_0.bin' - please wait ...
llama_model_load: n_vocab = 32000
llama_model_load: n_ctx   = 512
llama_model_load: n_embd  = 5120
llama_model_load: n_mult  = 256
llama_model_load: n_head  = 40
llama_model_load: n_layer = 40
llama_model_load: n_rot   = 128
llama_model_load: f16     = 2
llama_model_load: n_ff    = 13824
llama_model_load: n_parts = 1
llama_model_load: type    = 2
llama_model_load: ggml ctx size = 8159.49 MB
llama_model_load: mem required  = 10207.49 MB (+ 1608.00 MB per state)
llama_model_load: loading model part 1/1 from 'ggml-model-q4_0.bin'
llama_model_load: ............................................. done
llama_model_load: model size =  7759.39 MB / num tensors = 363
llama_init_from_file: kv self size  =  400.00 MB

system_info: n_threads = 4 / 8 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | FMA = 0 | NEON = 0 | ARM_FMA = 0 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 0 | VSX = 0 |
main: interactive mode on.
sampling: temp = 0.800000, top_k = 40, top_p = 0.950000, repeat_last_n = 64, repeat_penalty = 1.100000
generate: n_ctx = 512, n_batch = 8, n_predict = 128, n_keep = 0


== Running in interactive mode. ==
 - Press Ctrl+C to interject at any time.
 - Press Return to return control to LLaMa.
 - If you want to submit another line, end your input in '\'.


#include "pch.h"
#include "Scenario1_LaunchUri.xaml.h"

using namespace SDKTemplate

Mar 28 '23 07:03 paniphons

Hi! I'm on windows, using master 5a5f8b1

I downloaded 13b and 30b alpaca models as mentioned by @madmads11 and @Puncia
Ran python convert-unversioned-ggml-to-ggml.py models\Alpaca\13B models/LLaMA/tokenizer.model and python convert-unversioned-ggml-to-ggml.py models\Alpaca\30B models/LLaMA/tokenizer.model
I can run llama.cpp with bin\Release\main.exe -m models\Alpaca\13B\ggml-alpaca-13b-q4_0.bin --n_parts 1 --color -f prompts\alpaca.txt -ins -t 6 or bin\Release\main.exe -m models\Alpaca\30B\ggml-alpaca-30b-q4_0.bin --n_parts 1 --color -f prompts\alpaca.txt -ins -t 6 but it doesn't work well

binReleasemain exe -m modelsAlpaca13Bggml-alpaca-13b-q4_0 bin --n_parts 1 --color -f promptsalpaca txt -ins t -7

Does this happen to everyone or just me?

Mar 29 '23 14:03 maria-mh07

I edited this whole thing because it was basically incorrect.

@maria-mh07 It's working more or less as you should expect.

@paniphons You need to provide a prompt from the command line with --prompt or using -f and point to a file.

Mar 29 '23 20:03 morpheus2448

what do the other parameters do? its a bit confusing repeat_last_n repeat_penalty top_k top_p temp seed threads

Mar 31 '23 08:03 LitenBuzzTh

I explained a bunch of them in https://github.com/ggerganov/llama.cpp/discussions/559#discussioncomment-5455407.

Mar 31 '23 12:03 j-f1

Hi @madmads11 @j-f1 Just yesterday, this migration script was added : migrate-ggml-2023-03-30-pr613.py. So, what I did on top of @madmads11 instructions was to use this above script and generate the final bin file to work with.

Details :

Alpaca Model used : https://huggingface.co/Pi3141/alpaca-lora-7B-ggml
Tokenizer used : https://huggingface.co/decapoda-research/llama-7b-hf/blob/main/tokenizer.model

I am using llama.cpp just today to run alpaca model. (was using antimatters alpaca.cpp until now)

This same model that's converted and loaded in llama.cpp runs very slow compared to running it in alpaca.cpp.

How I started up model :

./main -m ./models/alpaca-7b-migrated.bin -ins --n_parts 1

The logs :

main: seed = 1680346670
llama_model_load: loading model from './models/alpaca-7b-migrated.bin' - please wait ...
llama_model_load: n_vocab = 32000
llama_model_load: n_ctx   = 512
llama_model_load: n_embd  = 4096
llama_model_load: n_mult  = 256
llama_model_load: n_head  = 32
llama_model_load: n_layer = 32
llama_model_load: n_rot   = 128
llama_model_load: f16     = 2
llama_model_load: n_ff    = 11008
llama_model_load: n_parts = 1
llama_model_load: type    = 1
llama_model_load: ggml map size = 4017.70 MB
llama_model_load: ggml ctx size =  81.25 KB
llama_model_load: mem required  = 5809.78 MB (+ 1026.00 MB per state)
llama_model_load: loading tensors from './models/alpaca-7b-migrated.bin'
llama_model_load: model size =  4017.27 MB / num tensors = 291
llama_init_from_file: kv self size  =  256.00 MB

system_info: n_threads = 16 / 16 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | VSX = 0 | 
main: interactive mode on.
Reverse prompt: '### Instruction:

'
sampling: temp = 0.800000, top_k = 40, top_p = 0.950000, repeat_last_n = 64, repeat_penalty = 1.100000
generate: n_ctx = 512, n_batch = 8, n_predict = 128, n_keep = 2

Additionally, I also used this bin file : https://huggingface.co/Pi3141/alpaca-lora-7B-ggml/blob/main/ggml-model-q4_1.bin that's already migrated for llama.cpp. And even for this, model is running slow with llama.cpp.

One thing I noticed was, while loading between these two model variants, this line is different than on above. llama_model_load: f16 = 3.

Apr 01 '23 11:04 robin-coac

Here is what I did to run Alpaca 30b on my system with llama.cpp. I would assume it would work with Alpaca 13b as well.

Downloaded and built llama.cpp from scratch as the latest version is required to specify that the model is in 1 file with the new --n_parts 1 parameter

Downloaded this 30b alpaca model https://huggingface.co/Pi3141/alpaca-30B-ggml/tree/main (If you check the model card, you can find links to other alpaca model sizes)

Named the file ggml-alpaca-30b-q4.bin and placed it in /models/Alpaca/30b inside llama.cpp

Downloaded the script mentioned here: Breaking change of models since PR #252 #324 (comment)

Named it convert.py and placed it in the root folder of llama.cpp.

Downloaded the tokenizer mentioned here: Breaking change of models since PR #252 #324 (comment)

Placed the tokenizer.model file in /models

Ran python convert.py models/Alpaca/30b models/tokenizer.model in the command prompt from the base folder of llama.cpp (personally I got the message that I needed the module sentencepiece, so I wrote pip install sentencepiece and then I re-ran python convert.py models/Alpaca/30b models/tokenizer.model and it worked. You may or may not encounter this error.)

In the 30b folder, there is now a ggml-alpaca-30b-q4.bin and a ggml-alpaca-30b-q4.bin.tmp file, I renamed ggml-alpaca-30b-q4.bin to ggml-alpaca-30b-q4.bin.old to keep it as a backup, and ggml-alpaca-30b-q4.bin.tmp to ggml-alpaca-30b-q4.bin

Now I can run llama.cpp with ./main -m ./models/alpaca/30b/ggml-alpaca-30b-q4.bin --color -f ./prompts/alpaca.txt -ins --n_parts 1.

Maybe this can be of temporary help to anybody else eager to set it up. Please correct me if I've made any mistakes, I wrote it retroactively from memory.

I get this error upon running Convert

% python3 convert.py models/alpaca/13B models/tokenizer.model converting models/alpaca/13B/ggml-model-q4_0.bin Traceback (most recent call last): File "/Users/FD00199/llama.cpp/convert.py", line 96, in main() File "/Users/FD00199/llama.cpp/convert.py", line 93, in main convert_one_file(file, tokenizer) File "/Users/FD00199/llama.cpp/convert.py", line 78, in convert_one_file write_header(f_out, read_header(f_in)) File "/Users/FD00199/llama.cpp/convert.py", line 27, in write_header raise Exception('Invalid file magic. Must be an old style ggml file.') Exception: Invalid file magic. Must be an old style ggml file.

Apr 11 '23 05:04 sachinspanicker

Here is what I did to run Alpaca 30b on my system with llama.cpp. I would assume it would work with Alpaca 13b as well.

Downloaded and built llama.cpp from scratch as the latest version is required to specify that the model is in 1 file with the new --n_parts 1 parameter

Downloaded this 30b alpaca model https://huggingface.co/Pi3141/alpaca-30B-ggml/tree/main (If you check the model card, you can find links to other alpaca model sizes)

Named the file ggml-alpaca-30b-q4.bin and placed it in /models/Alpaca/30b inside llama.cpp

Downloaded the script mentioned here: Breaking change of models since PR #252 #324 (comment)

Named it convert.py and placed it in the root folder of llama.cpp.

Downloaded the tokenizer mentioned here: Breaking change of models since PR #252 #324 (comment)

Placed the tokenizer.model file in /models

Ran python convert.py models/Alpaca/30b models/tokenizer.model in the command prompt from the base folder of llama.cpp (personally I got the message that I needed the module sentencepiece, so I wrote pip install sentencepiece and then I re-ran python convert.py models/Alpaca/30b models/tokenizer.model and it worked. You may or may not encounter this error.)

In the 30b folder, there is now a ggml-alpaca-30b-q4.bin and a ggml-alpaca-30b-q4.bin.tmp file, I renamed ggml-alpaca-30b-q4.bin to ggml-alpaca-30b-q4.bin.old to keep it as a backup, and ggml-alpaca-30b-q4.bin.tmp to ggml-alpaca-30b-q4.bin

Now I can run llama.cpp with ./main -m ./models/alpaca/30b/ggml-alpaca-30b-q4.bin --color -f ./prompts/alpaca.txt -ins --n_parts 1.

Maybe this can be of temporary help to anybody else eager to set it up. Please correct me if I've made any mistakes, I wrote it retroactively from memory.

I get this error upon running Convert

% python3 convert.py models/alpaca/13B models/tokenizer.model converting models/alpaca/13B/ggml-model-q4_0.bin Traceback (most recent call last): File "/Users/FD00199/llama.cpp/convert.py", line 96, in main() File "/Users/FD00199/llama.cpp/convert.py", line 93, in main convert_one_file(file, tokenizer) File "/Users/FD00199/llama.cpp/convert.py", line 78, in convert_one_file write_header(f_out, read_header(f_in)) File "/Users/FD00199/llama.cpp/convert.py", line 27, in write_header raise Exception('Invalid file magic. Must be an old style ggml file.') Exception: Invalid file magic. Must be an old style ggml file.

if you have this version : ggml-model-q4_1.bin you have the error with ggml-model-q4_0.bin you don't have the error

Oct 11 '23 21:10 Wataru3355

llama.cpp llama.cpp copied to clipboard

Add proper instructions for using Alpaca models

llama.cpp
llama.cpp copied to clipboard