llama.cpp
llama.cpp copied to clipboard
Running a Vicuna-13B 4it model ?
I found this model : [ggml-vicuna-13b-4bit](https://huggingface.co/eachadea/ggml-vicuna-13b-4bit/tree/main) and judging by their online demo it's very impressive. I tried to run it with llama.cpp latest version - the model loads fine, but as soon as it loads it starts hallucinating and quits by itself. Do I need to have it converted or something like that ?
Have a look here --> https://github.com/ggerganov/llama.cpp/discussions/643
I've had the most success with this model with the following patch to the instruct mode.
diff --git a/examples/main/main.cpp b/examples/main/main.cpp
index 453450a..70b4f45 100644
--- a/examples/main/main.cpp
+++ b/examples/main/main.cpp
@@ -152,13 +152,13 @@ int main(int argc, char ** argv) {
}
// prefix & suffix for instruct mode
- const auto inp_pfx = ::llama_tokenize(ctx, "\n\n### Instruction:\n\n", true);
- const auto inp_sfx = ::llama_tokenize(ctx, "\n\n### Response:\n\n", false);
+ const auto inp_pfx = ::llama_tokenize(ctx, "\n\n### Human:\n\n", true);
+ const auto inp_sfx = ::llama_tokenize(ctx, "\n\n### Assistant:\n\n", false);
// in instruct mode, we inject a prefix and a suffix to each input by the user
if (params.instruct) {
params.interactive_start = true;
- params.antiprompt.push_back("### Instruction:\n\n");
+ params.antiprompt.push_back("### Human:\n\n");
}
// enable interactive mode if reverse prompt or interactive start is specified
And then running the model with the following options. If there are better options, please let me know.
./main \
--model ./models/ggml-vicuna-13b-4bit/ggml-vicuna-13b-4bit.bin \
--color \
--threads 7 \
--batch_size 256 \
--n_predict -1 \
--top_k 12 \
--top_p 1 \
--temp 0.36 \
--repeat_penalty 1.05 \
--ctx_size 2048 \
--instruct \
--reverse-prompt '### Human:' \
--file prompts/vicuna.txt
And my prompt file
A chat between a curious human and an artificial intelligence assistant.
The assistant gives helpful, detailed, and polite answers to the human's questions.
Example output
A chat between a curious human and an artificial intelligence assistant.
The assistant gives helpful, detailed, and polite answers to the human's questions.
> What NFL team won the Super Bowl in the year Justin Bieber was born?
The NFL team that won the Super Bowl in the year Justin Bieber was born, which is 1994, was the Dallas Cowboys. They defeated the Buffalo Bills in Super Bowl XXVII, which was held on January 31, 1994.
### Human:
> Who won the year after?
The NFL team that won the Super Bowl the year after Justin Bieber was born, which is 1995, was the Dallas Cowboys again. They defeated the Buffalo Bills in Super Bowl XXVIII, which was held on January 30, 1995. The Cowboys became the fi
rst team to win back-to-back Super Bowls since the Pittsburgh Steelers did so in the 1970s.
### Human:
I've made this change to align with FastChat and the roles it uses.
Someone who knows C better than I, could they make a prompt suffix flag? Having a prompt suffix flag would make it easier to be compatible with other models in the future.
Vicuna is a pretty strict model in terms of following that ### Human/### Assistant format when compared to alpaca and gpt4all. Less flexible but fairly impressive in how it mimics ChatGPT responses.
It's extremely slow on my M1 MacBook (unusable), quite usable on my 4 yr old i7 workstation. And doesn't work at all on the same workstation inside docker.
Found #767, adding --mlock solved the slowness issue on Macbook. Docker issue here #537. Simply built my own image tailored to my own machine. Works like a charm.
I've had the most success with this model with the following patch to the instruct mode.
diff --git a/examples/main/main.cpp b/examples/main/main.cpp index 453450a..70b4f45 100644 --- a/examples/main/main.cpp +++ b/examples/main/main.cpp @@ -152,13 +152,13 @@ int main(int argc, char ** argv) { } // prefix & suffix for instruct mode - const auto inp_pfx = ::llama_tokenize(ctx, "\n\n### Instruction:\n\n", true); - const auto inp_sfx = ::llama_tokenize(ctx, "\n\n### Response:\n\n", false); + const auto inp_pfx = ::llama_tokenize(ctx, "\n\n### Human:\n\n", true); + const auto inp_sfx = ::llama_tokenize(ctx, "\n\n### Assistant:\n\n", false); // in instruct mode, we inject a prefix and a suffix to each input by the user if (params.instruct) { params.interactive_start = true; - params.antiprompt.push_back("### Instruction:\n\n"); + params.antiprompt.push_back("### Human:\n\n"); } // enable interactive mode if reverse prompt or interactive start is specified
And then running the model with the following options. If there are better options, please let me know.
./main \ --model ./models/ggml-vicuna-13b-4bit/ggml-vicuna-13b-4bit.bin \ --color \ --threads 7 \ --batch_size 256 \ --n_predict -1 \ --top_k 12 \ --top_p 1 \ --temp 0.36 \ --repeat_penalty 1.05 \ --ctx_size 2048 \ --instruct \ --reverse-prompt '### Human:' \ --file prompts/vicuna.txt
And my prompt file
A chat between a curious human and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the human's questions.
Example output
A chat between a curious human and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the human's questions. > What NFL team won the Super Bowl in the year Justin Bieber was born? The NFL team that won the Super Bowl in the year Justin Bieber was born, which is 1994, was the Dallas Cowboys. They defeated the Buffalo Bills in Super Bowl XXVII, which was held on January 31, 1994. ### Human: > Who won the year after? The NFL team that won the Super Bowl the year after Justin Bieber was born, which is 1995, was the Dallas Cowboys again. They defeated the Buffalo Bills in Super Bowl XXVIII, which was held on January 30, 1995. The Cowboys became the fi rst team to win back-to-back Super Bowls since the Pittsburgh Steelers did so in the 1970s. ### Human:
I've made this change to align with FastChat and the roles it uses.
Someone who knows C better than I, could they make a prompt suffix flag? Having a prompt suffix flag would make it easier to be compatible with other models in the future.
Although I'm not proficient in C, I was able to make some modifications to llama.cpp by recompiling main.cpp with the changes, renaming the resulting main.exe to vicuna.exe, and moving it into my main llama.cpp folder. To choose a model, I created a bat file that prompts me to select a model, and if I choose the vicuna model, the bat file runs vicuna.exe instead of main.exe.
I've included the bat file below for reference:
setlocal EnableDelayedExpansion
set /a count=0
for %%f in (S:\llama.cpp\models\*.bin) do (
if /i not "%%~nxf"=="ggml-vocab.bin" (
set /a count+=1
set file[!count!]=%%f
echo !count!. %%f
)
)
set /p selection=Enter the number of the model you want to use:
set model=!file[%selection%]!
if /I "!model:vicuna=!" neq "!model!" (
echo Running main.exe with model !model! ...
title !model!
S:\llama.cpp\vicuna.exe --model "!model!" --color --threads 8 --batch_size 256 --n_predict -1 --top_k 12 --top_p 1 --temp 0.36 --repeat_penalty 1.05 --ctx_size 2048 --instruct --reverse-prompt "### Human:" --file S:\llama.cpp\prompts\vicuna.txt
) else (
echo Running main.exe with model !model! ...
title !model!
S:\llama.cpp\main.exe --model "!model!" --color --threads 8 -n 1024 --top_k 0 --top_p 0.73 --temp 0.72 --repeat_penalty 1.1 --instruct --file S:\llama.cpp\prompts\alpaca.txt
)
pause
I hope this helps anyone who may be interested in trying this out!
sampling: temp = 0.360000, top_k = 12, top_p = 1.000000, repeat_last_n = 64, repeat_penalty = 1.050000
generate: n_ctx = 2048, n_batch = 256, n_predict = -1, n_keep = 56
Tried these settings and it's really nice! It really has learned the ChatGPT style well, and the 13b model seems to have good underlying knowledge.
A chat between a curious human and an artificial intelligence assistant.
Assistant is helpful, kind, honest, good at writing, give answers to the human's questions in step by step explanations when it's confident, otherwise it will say it is not sure.
> What are the main themes of the game Fallout 4?
Fallout 4 is a post-apocalyptic role-playing game set in the aftermath of a nuclear war that took place on October 23, 2077. The game takes place in the ruins of Boston and its surrounding areas, known as the Commonwealth. The main theme of Fallout 4 is survival, as players must navigate the dangerous world of the Commonwealth while also dealing with the challenges of building a new community in the midst of destruction.
The game also explores themes of power and control, as players must navigate the political landscape of the Commonwealth and make alliances with various factions in order to gain power and control over the region. Additionally, the game deals with issues of morality, as players are forced to make difficult choices that can have a significant impact on the world around them.
Overall, Fallout 4 is a game that explores the challenges of survival, power, and morality in a post-apocalyptic world. The game's immersive setting and complex storylines make it a favorite among fans of the series, and its themes of survival and morality are sure to keep players engaged for hours on end.
But there is a problem that it doesn't seem to stop by itself, it will continue to generate the next line of ### Human
and continuing another response.
I've been able to compile latest standard llama.cpp with cmake under the Windows 10, then run ggml-vicuna-7b-4bit-rev1.bin , and even ggml-vicuna-13b-4bit-rev1.bin, with this command-line code (assuming that your .bin in the same folder as main.exe):
main -i --interactive-first -r "### Human:" --temp 0 -c 2048 -n -1 --ignore-eos --repeat_penalty 1.2 --instruct -m ggml-vicuna-7b-4bit-rev1.bin
And that one kinda works even faster on my 8-core CPU:
main --color --threads 7 --batch_size 256 --n_predict -1 --top_k 12 --top_p 1 --temp 0.36 --repeat_penalty 1.05 --ctx_size 2048 --instruct --reverse-prompt "### Human:" --model ggml-vicuna-13b-4bit-rev1.bin
It works on Win laptop with 16 Gb RAM and looks almost like ChatGPT! (Slower, of course, but with speed almost same as human will type!) I agree that it may be the best LLM to run locally!
And it seems that it can write much more correct and longer program code than gpt4all! It's just amazing!
But sometimes, after a few answers, it just freezes forever while continuing to load the CPU. Has anyone noticed this? Why it may be so?
But sometimes, after a few answers, it just freezes forever while continuing to load the CPU. Has anyone noticed this? Why it may be so?
Context swap. The context fills up and then the first half of it is deleted to make more room, but that means that the whole context has to be reevaluated to catch up. The OpenBLAS option is supposed to accelerate it, but don't know how easy it is to make it work on Windows, vcpkg seems to have some BLAS packages.
Context swap. The context fills up and then the first half of it is deleted to make more room, but that means that the whole context has to be reevaluated to catch up.
So it's just trying to compress overfilled context so that it would be possible to continue conversation without loosing any important details? And it is normal, and I just should take a cup of tee in that time and not restarts it as I did? :-)
You can use --keep
to keep some part of the initial prompt (-1 for all) or use a smaller context. You can try different --batch_size
values because this determines the sizes of the matrixes that are used in this operation.
can someone explain to me what the difference between these two options is? (Both options work fine)
- Option 1: main -i --interactive-first -r "### Human:" --temp 0 -c 2048 -n -1 --color --ignore-eos --repeat_penalty 1.2 --instruct -m ggml-vicuna-7b-4bit-rev1.bin
- option 2: main --color --threads 7 --batch_size 256 --n_predict -1 --top_k 12 --top_p 1 --temp 0.36 --repeat_penalty 1.05 --ctx_size 2048 --instruct --reverse-prompt "### Human:" --model ggml-vicuna-13b-4bit-rev1.bin And how many threads should one use? I have an i511400F that has 6 cores and supports 12 threads.
can someone explain to me what the difference between these two options is? Same except temperature and amount of threads.
About temperature read here. As I preliminary think, the higher temperature - the more stochastic/chaotic choice of words. The lower temperature - the more deterministic would be your result, and at temperature = 0 your result would be always the same. So you can tune that parameter for your application. If you write a code, may be better temp =0, if you write a poem - may be better temp=1 or even more... (If i'm wrong - correct me!)
What about threads, I intuit that you can use as many threads as your CPU support bar 1 or 2 (so that other apps and system will not hang). I think bottleneck is not CPU but RAM throughput.
Who has another opinion - please correct!
Como é baixar o modelo,como app comum?
But there is a problem that it doesn't seem to stop by itself, it will continue to generate the next line of
### Human
and continuing another response.
I use the following trick to partly overcome this problem:
### Human: Write a very long answer (about 1000 words). YOUR QUESTION WITH YOUR TEXT HERE
### Assistant:
There is a vicuña model rev1 with some kind of stop fix on 🤗 . Maybe that solves your issue?
Yes, I'm talking about rev1, so we need to change llama_token_eos() or what? Alternative solution: https://github.com/ggerganov/llama.cpp/commit/9fd062fd2e7e9f2f14d66c3d64dce3f967604103 [UNVERIFIED] I've recompiled main.cpp with this patch, and it works for me well enough (using the parameter: --stop "### Human:")
I've recompiled main.cpp with this patch, and it works for me well enough (using the parameter: --stop "### Human:")
This is just the same as the --antiprompt
option. The patch above has \n\n
in the end but I think without them it would be better. You can also have multiple antiprompts.
This is just the same as the
--antiprompt
option. The patch above has\n\n
in the end but I think without them it would be better. You can also have multiple antiprompts.
Yes, that's right. Thank you. You mean --reverse-prompt (-r) option.
I use a prompt file to start generation in this way: main.exe -r "### Human:" -c 2048 --temp 0.36 -n -1 --ignore-eos --repeat_penalty 1.3 -m ggml-vicuna-7b-4bit-rev1.bin -f input.txt > output.txt
Content of input.txt file:
hello
### Assistant:
-r option switches the program into interactive mode, so it will not exit at the end and keeps waiting.
Therefore I made the following quick fix for vicuna:
const auto inp_pfx = ::llama_tokenize(ctx, "\n\n### Human:\n", true);
const auto inp_sfx = ::llama_tokenize(ctx, "\n\n### Assistant:\n", false);
// in instruct mode, we inject a prefix and a suffix to each input by the user
if (params.instruct) {
params.interactive_start = true;
params.antiprompt.push_back("### Human:");
}
and
is_antiprompt = false;
// Check if each of the reverse prompts appears at the end of the output.
for (std::string & antiprompt : params.antiprompt) {
if (last_output.find(antiprompt.c_str(), last_output.length() - antiprompt.length(), antiprompt.length()) != std::string::npos) {
is_interacting = true;
is_antiprompt = true;
set_console_color(con_st, CONSOLE_COLOR_USER_INPUT);
fflush(stdout);
if (!params.instruct) exit(0);
break;
}
}
}
Therefore I made the following quick fix for vicuna:
I've repeated your modifications, but nothing changes - it still shows "### Human:" each time... Who knows how to make clean answers? ( So that only '>' instead of '### Human:')
I've repeated your modifications, but nothing changes - it still shows "### Human:" each time... Who knows how to make clean answers? ( So that only '>' instead of '### Human:')
You can try my additional quick hack (it removes "### Human:" from the end of each response):
// display text
static std::string tmp;
static std::string ap = "\n### Human:";
if (!input_noecho) {
for (auto id : embd) {
tmp += llama_token_to_str(ctx, id);
int tmplen = tmp.length() > ap.length() ? ap.length() : tmp.length();
if (strncmp(tmp.c_str(), ap.c_str(), tmplen)) { printf(tmp.c_str()); tmp = ""; }
else if (tmplen == ap.length()) tmp = "";
}
fflush(stdout);
}
The vicuna v1.1 model used a different setup. See https://github.com/lm-sys/FastChat/blob/f85f489f2d5e48c37cceb2f00c3edc075c5d3711/fastchat/conversation.py#L115-L124 and https://github.com/lm-sys/FastChat/blob/f85f489f2d5e48c37cceb2f00c3edc075c5d3711/fastchat/conversation.py#L37-L44
IIUC, the prompt in Borne shell string is "$system USER: $instruction ASSISTANT:"
.
Their doc says this https://github.com/lm-sys/FastChat/blob/f85f489f2d5e48c37cceb2f00c3edc075c5d3711/docs/weights_version.md#example-prompt-weight-v11
I think the </s>
is actually the EOS token, not a verbatim string. Though I'm not sure if we need to manually append it to the end of the assistant's response or not.
The vicuna v1.1 model used a different setup.
Uhhh... Such a mess... Definitely needed some standardization for peple teaching LLMs! At least, with tokens such assistant/human/eos it should be possible, `cos it's just technicalities not connected directly with LLM functionality...
Or, at a side of a software, there should be easy way to adapt any token without editing C++ code...
Since #863 may not happen soon, I tested this working on 1.1:
main.cpp#160
// prefix & suffix for instruct mode
const auto inp_pfx = ::llama_tokenize(ctx, "\nUSER:", true);
const auto inp_sfx = ::llama_tokenize(ctx, "\nASSISTANT:", false);
// in instruct mode, we inject a prefix and a suffix to each input by the user
if (params.instruct) {
params.interactive_start = true;
params.antiprompt.push_back("USER:");
}
Therefore I made the following quick fix for vicuna:
I've repeated your modifications, but nothing changes - it still shows "### Human:" each time... Who knows how to make clean answers? ( So that only '>' instead of '### Human:')
this is my settings, and run well on my mac (only '>' instead of '### Human:'):
./main
--model ./models/13B/ggml-vicuna-13b-4bit-rev1.bin
--color -i -r "User:" --temp 0 -c 2048 -n -1 --ignore-eos --repeat_penalty 1.2
--file prompts/chat-with-vicuna.txt
sampling: temp = 0.360000, top_k = 12, top_p = 1.000000, repeat_last_n = 64, repeat_penalty = 1.050000 generate: n_ctx = 2048, n_batch = 256, n_predict = -1, n_keep = 56
Tried these settings and it's really nice! It really has learned the ChatGPT style well, and the 13b model seems to have good underlying knowledge.
A chat between a curious human and an artificial intelligence assistant. Assistant is helpful, kind, honest, good at writing, give answers to the human's questions in step by step explanations when it's confident, otherwise it will say it is not sure. > What are the main themes of the game Fallout 4? Fallout 4 is a post-apocalyptic role-playing game set in the aftermath of a nuclear war that took place on October 23, 2077. The game takes place in the ruins of Boston and its surrounding areas, known as the Commonwealth. The main theme of Fallout 4 is survival, as players must navigate the dangerous world of the Commonwealth while also dealing with the challenges of building a new community in the midst of destruction. The game also explores themes of power and control, as players must navigate the political landscape of the Commonwealth and make alliances with various factions in order to gain power and control over the region. Additionally, the game deals with issues of morality, as players are forced to make difficult choices that can have a significant impact on the world around them. Overall, Fallout 4 is a game that explores the challenges of survival, power, and morality in a post-apocalyptic world. The game's immersive setting and complex storylines make it a favorite among fans of the series, and its themes of survival and morality are sure to keep players engaged for hours on end.
But there is a problem that it doesn't seem to stop by itself, it will continue to generate the next line of
### Human
and continuing another response.
@chakflying I have the same issue when using GPT4ALL with this model, after starting my first prompt, I lost control over them.