llama.cpp
llama.cpp copied to clipboard
Add proper instructions for using Alpaca models
So I am looking at https://github.com/antimatter15/alpaca.cpp and I see they are already running 30B Alpaca models, while we are struggling to run 7B due to the recent tokenizer updates.
I also see that the models are now even floating on Hugging Face - I guess license issues are no longer a problem?
We should add detailed instructions for obtaining the Alpaca models and a temporary explanation how to use the following script to make the models compatible with the latest master:
https://github.com/ggerganov/llama.cpp/issues/324#issuecomment-1476227818
The bigger issue is that people keep producing the old version of the ggml models instead of migrating to the latest llama.cpp changes. And therefore, we now need this extra conversion step. It's best to figure out the steps for generating the Alpaca models and generate them in the correct format.
Edit: just don't post direct links to the models!
Here is what I did to run Alpaca 30b on my system with llama.cpp. I would assume it would work with Alpaca 13b as well.
- Downloaded and built llama.cpp from scratch as the latest version is required to specify that the model is in 1 file with the new
--n_parts 1parameter - Downloaded this 30b alpaca model https://huggingface.co/Pi3141/alpaca-30B-ggml/tree/main (If you check the model card, you can find links to other alpaca model sizes)
- Named the file
ggml-alpaca-30b-q4.binand placed it in /models/Alpaca/30b inside llama.cpp - Downloaded the script mentioned here: https://github.com/ggerganov/llama.cpp/issues/324#issuecomment-1476227818
- Named it convert.py and placed it in the root folder of llama.cpp.
- Downloaded the tokenizer mentioned here: https://github.com/ggerganov/llama.cpp/issues/324#issuecomment-1476242192
- Placed the tokenizer.model file in /models
- Ran
python convert.py models/Alpaca/30b models/tokenizer.modelin the command prompt from the base folder of llama.cpp (personally I got the message that I needed the modulesentencepiece, so I wrotepip install sentencepieceand then I re-ranpython convert.py models/Alpaca/30b models/tokenizer.modeland it worked. You may or may not encounter this error.) - In the 30b folder, there is now a
ggml-alpaca-30b-q4.binand aggml-alpaca-30b-q4.bin.tmpfile, I renamedggml-alpaca-30b-q4.bintoggml-alpaca-30b-q4.bin.oldto keep it as a backup, andggml-alpaca-30b-q4.bin.tmptoggml-alpaca-30b-q4.bin - Now I can run llama.cpp with
./main -m ./models/alpaca/30b/ggml-alpaca-30b-q4.bin --color -f ./prompts/alpaca.txt -ins --n_parts 1.
Maybe this can be of temporary help to anybody else eager to set it up. Please correct me if I've made any mistakes, I wrote it retroactively from memory.
Can confirm the above works for the 13B model too.
The above instructions work for me too for the 13B model! Thank you!
check sum for the converted (ggmf v1) Pi3141 alpaca-30B-ggml
$ sha256sum ggml-model-q4_0.bin
969652d32ce186ca3c93217ece8311ebe81f15939aa66a6fe162a08dd893faf8 ggml-model-q4_0.bin
all of them (7B/13B/30B/65B*) 4b quantized q4_0 (RTN) and GPTQ new tokenizer format *no alpaca-65b tho as it would take very long time does not include batteries
https://btcache.me/torrent/E5322AB4676E24632A907FD9846234BB40265C4F https://torrage.info/torrent.php?h=e5322ab4676e24632a907fd9846234bb40265c4f
single command option:
aria2c --summary-interval=0 --bt-max-peers=0 http://taco.cab/ggml/ggml-q4.torrent
as usual, the alpaca and gptq models need the --n_parts 1 option
hope that helps :+1:
@anzz1 you did not specify for which model your links are. also please provide checksums :)
me: i should try and debug all those crashes
me: > help me write a song about llama.cpp (c++ api for facebooks llm)
llama.cpp:
A llama is an animal that's so strange,
It can do things we only imagine.
LLamaCPP is the code that gives it its brawn,
Allowing us to use it like a clown.
The api has commands we can use,
To take advantage of this llama abuse.
It's an interface that let's us be boss,
If you know the right way to make your call.
(the 30B alpaca lora finetune by pi)
i linked the checksums here https://github.com/ggerganov/llama.cpp/issues/374#issuecomment-1480719278
@anzz1 Thank you for the download. Did you see the latest fix to GPTQ conversion?
Yes.
@anzz1 Any chance you could re convert them using the changes from this: #423
No need.
I just saw the updated readme file stating that you cannot link to model downloads anywhere on this repository. Would instructions like mine, where in step 2 I link to a model download on HF, violate that rule going forward? I assume that the instructions as is are okay because they were written before the rule, but what about going forward?
the ones you linked are sadly mixed, and not "pure" lora models. so i would assume no. you could just say "pi3141 alpaca 30B" model, and it would be fine i guess.
the ones you linked are sadly mixed, and not "pure" lora models. so i would assume no. you could just say "pi3141 alpaca 30B" model, and it would be fine i guess.
Interesting, I didn't realize it was mixed. Can you explain what that means in this context?
"mixed" -> "merged" If you look at this for example https://huggingface.co/tloen/alpaca-lora-7b/tree/main , those are only the lora weights. I think (need to actually read the paper) those are either not directly derived from llama, or are derived enough, to count as remixing/fairuse or something.
edit: you can clearly see by the filesize.
Worked for me thanks @anzz1 the AI is running kind of slow tho, Im on windows with 5950X and 80+ GB ram... But the writting time is like GPT4 on max load x) any params I forgot to set ? Edit : tried to change the -t value to 32, nothing changes, prompt is still slow AF
@anzz1 , did you actually try those models? Using the latest master (4b8efff) on Windows to try to load alpaca-13B-ggml GPTQ from your torrent, it just starts spitting out C# code as soon as I launch it.
C:\_downloads\ggml-q4\models\alpaca-13B-ggml>main.exe -m ggml-model-gptq4.bin --interactive --color --n_parts 1
main: seed = 1679990008
llama_model_load: loading model from 'ggml-model-gptq4.bin' - please wait ...
llama_model_load: n_vocab = 32000
llama_model_load: n_ctx = 512
llama_model_load: n_embd = 5120
llama_model_load: n_mult = 256
llama_model_load: n_head = 40
llama_model_load: n_layer = 40
llama_model_load: n_rot = 128
llama_model_load: f16 = 4
llama_model_load: n_ff = 13824
llama_model_load: n_parts = 1
llama_model_load: type = 2
llama_model_load: ggml ctx size = 10101.68 MB
llama_model_load: mem required = 12149.68 MB (+ 1608.00 MB per state)
llama_model_load: loading model part 1/1 from 'ggml-model-gptq4.bin'
llama_model_load: ............................................. done
llama_model_load: model size = 9701.58 MB / num tensors = 363
llama_init_from_file: kv self size = 400.00 MB
system_info: n_threads = 4 / 8 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | FMA = 0 | NEON = 0 | ARM_FMA = 0 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 0 | VSX = 0 |
main: interactive mode on.
sampling: temp = 0.800000, top_k = 40, top_p = 0.950000, repeat_last_n = 64, repeat_penalty = 1.100000
generate: n_ctx = 512, n_batch = 8, n_predict = 128, n_keep = 0
== Running in interactive mode. ==
- Press Ctrl+C to interject at any time.
- Press Return to return control to LLaMa.
- If you want to submit another line, end your input in '\'.
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Threading.Tasks;
namespace _14.Bucket_Sort
{
class Program
{
static void Main(string[] args)
{
var input = Console.ReadLine();
int n = int.Parse(input);
Then I used CTRL+C to interrupt it thinking it could be a minor bug, and asked it "who is Kanye West". Response until I closed the program:
What did he do?
;
arr[i] = long.Parse(line);
}
Array.Sort(arr);
for (int i = 0; i < n;
I also downloaded the non-GPTQ version, it has the same issue, spitting out C++ code:
>main.exe -m ggml-model-q4_0.bin --interactive --color --n_parts 1
main: seed = 1679992628
llama_model_load: loading model from 'ggml-model-q4_0.bin' - please wait ...
llama_model_load: n_vocab = 32000
llama_model_load: n_ctx = 512
llama_model_load: n_embd = 5120
llama_model_load: n_mult = 256
llama_model_load: n_head = 40
llama_model_load: n_layer = 40
llama_model_load: n_rot = 128
llama_model_load: f16 = 2
llama_model_load: n_ff = 13824
llama_model_load: n_parts = 1
llama_model_load: type = 2
llama_model_load: ggml ctx size = 8159.49 MB
llama_model_load: mem required = 10207.49 MB (+ 1608.00 MB per state)
llama_model_load: loading model part 1/1 from 'ggml-model-q4_0.bin'
llama_model_load: ............................................. done
llama_model_load: model size = 7759.39 MB / num tensors = 363
llama_init_from_file: kv self size = 400.00 MB
system_info: n_threads = 4 / 8 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | FMA = 0 | NEON = 0 | ARM_FMA = 0 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 0 | VSX = 0 |
main: interactive mode on.
sampling: temp = 0.800000, top_k = 40, top_p = 0.950000, repeat_last_n = 64, repeat_penalty = 1.100000
generate: n_ctx = 512, n_batch = 8, n_predict = 128, n_keep = 0
== Running in interactive mode. ==
- Press Ctrl+C to interject at any time.
- Press Return to return control to LLaMa.
- If you want to submit another line, end your input in '\'.
#include "pch.h"
#include "Scenario1_LaunchUri.xaml.h"
using namespace SDKTemplate
Hi! I'm on windows, using master 5a5f8b1
- I downloaded 13b and 30b alpaca models as mentioned by @madmads11 and @Puncia
- Ran
python convert-unversioned-ggml-to-ggml.py models\Alpaca\13B models/LLaMA/tokenizer.modelandpython convert-unversioned-ggml-to-ggml.py models\Alpaca\30B models/LLaMA/tokenizer.model - I can run llama.cpp with
bin\Release\main.exe -m models\Alpaca\13B\ggml-alpaca-13b-q4_0.bin --n_parts 1 --color -f prompts\alpaca.txt -ins -t 6orbin\Release\main.exe -m models\Alpaca\30B\ggml-alpaca-30b-q4_0.bin --n_parts 1 --color -f prompts\alpaca.txt -ins -t 6but it doesn't work well

Does this happen to everyone or just me?
I edited this whole thing because it was basically incorrect.
@maria-mh07 It's working more or less as you should expect.
@paniphons You need to provide a prompt from the command line with --prompt or using -f and point to a file.
what do the other parameters do? its a bit confusing repeat_last_n repeat_penalty top_k top_p temp seed threads
I explained a bunch of them in https://github.com/ggerganov/llama.cpp/discussions/559#discussioncomment-5455407.
Hi @madmads11 @j-f1
Just yesterday, this migration script was added : migrate-ggml-2023-03-30-pr613.py.
So, what I did on top of @madmads11 instructions was to use this above script and generate the final bin file to work with.
Details :
- Alpaca Model used : https://huggingface.co/Pi3141/alpaca-lora-7B-ggml
- Tokenizer used : https://huggingface.co/decapoda-research/llama-7b-hf/blob/main/tokenizer.model
I am using llama.cpp just today to run alpaca model. (was using antimatters alpaca.cpp until now)
This same model that's converted and loaded in llama.cpp runs very slow compared to running it in alpaca.cpp.
How I started up model :
./main -m ./models/alpaca-7b-migrated.bin -ins --n_parts 1
The logs :
main: seed = 1680346670
llama_model_load: loading model from './models/alpaca-7b-migrated.bin' - please wait ...
llama_model_load: n_vocab = 32000
llama_model_load: n_ctx = 512
llama_model_load: n_embd = 4096
llama_model_load: n_mult = 256
llama_model_load: n_head = 32
llama_model_load: n_layer = 32
llama_model_load: n_rot = 128
llama_model_load: f16 = 2
llama_model_load: n_ff = 11008
llama_model_load: n_parts = 1
llama_model_load: type = 1
llama_model_load: ggml map size = 4017.70 MB
llama_model_load: ggml ctx size = 81.25 KB
llama_model_load: mem required = 5809.78 MB (+ 1026.00 MB per state)
llama_model_load: loading tensors from './models/alpaca-7b-migrated.bin'
llama_model_load: model size = 4017.27 MB / num tensors = 291
llama_init_from_file: kv self size = 256.00 MB
system_info: n_threads = 16 / 16 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | VSX = 0 |
main: interactive mode on.
Reverse prompt: '### Instruction:
'
sampling: temp = 0.800000, top_k = 40, top_p = 0.950000, repeat_last_n = 64, repeat_penalty = 1.100000
generate: n_ctx = 512, n_batch = 8, n_predict = 128, n_keep = 2
Additionally, I also used this bin file : https://huggingface.co/Pi3141/alpaca-lora-7B-ggml/blob/main/ggml-model-q4_1.bin that's already migrated for llama.cpp. And even for this, model is running slow with llama.cpp.
One thing I noticed was, while loading between these two model variants, this line is different than on above.
llama_model_load: f16 = 3.
Here is what I did to run Alpaca 30b on my system with llama.cpp. I would assume it would work with Alpaca 13b as well.
- Downloaded and built llama.cpp from scratch as the latest version is required to specify that the model is in 1 file with the new
--n_parts 1parameter- Downloaded this 30b alpaca model https://huggingface.co/Pi3141/alpaca-30B-ggml/tree/main (If you check the model card, you can find links to other alpaca model sizes)
- Named the file
ggml-alpaca-30b-q4.binand placed it in /models/Alpaca/30b inside llama.cpp- Downloaded the script mentioned here: Breaking change of models since PR #252 #324 (comment)
- Named it convert.py and placed it in the root folder of llama.cpp.
- Downloaded the tokenizer mentioned here: Breaking change of models since PR #252 #324 (comment)
- Placed the tokenizer.model file in /models
- Ran
python convert.py models/Alpaca/30b models/tokenizer.modelin the command prompt from the base folder of llama.cpp (personally I got the message that I needed the modulesentencepiece, so I wrotepip install sentencepieceand then I re-ranpython convert.py models/Alpaca/30b models/tokenizer.modeland it worked. You may or may not encounter this error.)- In the 30b folder, there is now a
ggml-alpaca-30b-q4.binand aggml-alpaca-30b-q4.bin.tmpfile, I renamedggml-alpaca-30b-q4.bintoggml-alpaca-30b-q4.bin.oldto keep it as a backup, andggml-alpaca-30b-q4.bin.tmptoggml-alpaca-30b-q4.bin- Now I can run llama.cpp with
./main -m ./models/alpaca/30b/ggml-alpaca-30b-q4.bin --color -f ./prompts/alpaca.txt -ins --n_parts 1.Maybe this can be of temporary help to anybody else eager to set it up. Please correct me if I've made any mistakes, I wrote it retroactively from memory.
I get this error upon running Convert
% python3 convert.py models/alpaca/13B models/tokenizer.model
converting models/alpaca/13B/ggml-model-q4_0.bin
Traceback (most recent call last):
File "/Users/FD00199/llama.cpp/convert.py", line 96, in
Here is what I did to run Alpaca 30b on my system with llama.cpp. I would assume it would work with Alpaca 13b as well.
- Downloaded and built llama.cpp from scratch as the latest version is required to specify that the model is in 1 file with the new
--n_parts 1parameter- Downloaded this 30b alpaca model https://huggingface.co/Pi3141/alpaca-30B-ggml/tree/main (If you check the model card, you can find links to other alpaca model sizes)
- Named the file
ggml-alpaca-30b-q4.binand placed it in /models/Alpaca/30b inside llama.cpp- Downloaded the script mentioned here: Breaking change of models since PR #252 #324 (comment)
- Named it convert.py and placed it in the root folder of llama.cpp.
- Downloaded the tokenizer mentioned here: Breaking change of models since PR #252 #324 (comment)
- Placed the tokenizer.model file in /models
- Ran
python convert.py models/Alpaca/30b models/tokenizer.modelin the command prompt from the base folder of llama.cpp (personally I got the message that I needed the modulesentencepiece, so I wrotepip install sentencepieceand then I re-ranpython convert.py models/Alpaca/30b models/tokenizer.modeland it worked. You may or may not encounter this error.)- In the 30b folder, there is now a
ggml-alpaca-30b-q4.binand aggml-alpaca-30b-q4.bin.tmpfile, I renamedggml-alpaca-30b-q4.bintoggml-alpaca-30b-q4.bin.oldto keep it as a backup, andggml-alpaca-30b-q4.bin.tmptoggml-alpaca-30b-q4.bin- Now I can run llama.cpp with
./main -m ./models/alpaca/30b/ggml-alpaca-30b-q4.bin --color -f ./prompts/alpaca.txt -ins --n_parts 1.Maybe this can be of temporary help to anybody else eager to set it up. Please correct me if I've made any mistakes, I wrote it retroactively from memory.
I get this error upon running Convert
% python3 convert.py models/alpaca/13B models/tokenizer.model converting models/alpaca/13B/ggml-model-q4_0.bin Traceback (most recent call last): File "/Users/FD00199/llama.cpp/convert.py", line 96, in main() File "/Users/FD00199/llama.cpp/convert.py", line 93, in main convert_one_file(file, tokenizer) File "/Users/FD00199/llama.cpp/convert.py", line 78, in convert_one_file write_header(f_out, read_header(f_in)) File "/Users/FD00199/llama.cpp/convert.py", line 27, in write_header raise Exception('Invalid file magic. Must be an old style ggml file.') Exception: Invalid file magic. Must be an old style ggml file.
if you have this version : ggml-model-q4_1.bin you have the error with ggml-model-q4_0.bin you don't have the error