llama.cpp Running a Vicuna-13B 4it model ?

I found this model : [ggml-vicuna-13b-4bit](https://huggingface.co/eachadea/ggml-vicuna-13b-4bit/tree/main) and judging by their online demo it's very impressive. I tried to run it with llama.cpp latest version - the model loads fine, but as soon as it loads it starts hallucinating and quits by itself. Do I need to have it converted or something like that ?

Apr 05 '23 07:04 manageseverin

Have a look here --> https://github.com/ggerganov/llama.cpp/discussions/643

Apr 05 '23 07:04 KASR

I've had the most success with this model with the following patch to the instruct mode.

diff --git a/examples/main/main.cpp b/examples/main/main.cpp
index 453450a..70b4f45 100644
--- a/examples/main/main.cpp
+++ b/examples/main/main.cpp
@@ -152,13 +152,13 @@ int main(int argc, char ** argv) {
     }
 
     // prefix & suffix for instruct mode
-    const auto inp_pfx = ::llama_tokenize(ctx, "\n\n### Instruction:\n\n", true);
-    const auto inp_sfx = ::llama_tokenize(ctx, "\n\n### Response:\n\n", false);
+    const auto inp_pfx = ::llama_tokenize(ctx, "\n\n### Human:\n\n", true);
+    const auto inp_sfx = ::llama_tokenize(ctx, "\n\n### Assistant:\n\n", false);
 
     // in instruct mode, we inject a prefix and a suffix to each input by the user
     if (params.instruct) {
         params.interactive_start = true;
-        params.antiprompt.push_back("### Instruction:\n\n");
+        params.antiprompt.push_back("### Human:\n\n");
     }
 
     // enable interactive mode if reverse prompt or interactive start is specified

And then running the model with the following options. If there are better options, please let me know.

./main \
  --model  ./models/ggml-vicuna-13b-4bit/ggml-vicuna-13b-4bit.bin \
  --color \
  --threads 7 \
  --batch_size 256 \
  --n_predict -1 \
  --top_k 12 \
  --top_p 1 \
  --temp 0.36 \
  --repeat_penalty 1.05 \
  --ctx_size 2048 \
  --instruct \
  --reverse-prompt '### Human:' \
  --file prompts/vicuna.txt

And my prompt file

A chat between a curious human and an artificial intelligence assistant.
The assistant gives helpful, detailed, and polite answers to the human's questions.

Example output

 A chat between a curious human and an artificial intelligence assistant.          
The assistant gives helpful, detailed, and polite answers to the human's questions.
> What NFL team won the Super Bowl in the year Justin Bieber was born?
The NFL team that won the Super Bowl in the year Justin Bieber was born, which is 1994, was the Dallas Cowboys. They defeated the Buffalo Bills in Super Bowl XXVII, which was held on January 31, 1994.
### Human:                                             
> Who won the year after?
The NFL team that won the Super Bowl the year after Justin Bieber was born, which is 1995, was the Dallas Cowboys again. They defeated the Buffalo Bills in Super Bowl XXVIII, which was held on January 30, 1995. The Cowboys became the fi
rst team to win back-to-back Super Bowls since the Pittsburgh Steelers did so in the 1970s.
### Human:

I've made this change to align with FastChat and the roles it uses.

Someone who knows C better than I, could they make a prompt suffix flag? Having a prompt suffix flag would make it easier to be compatible with other models in the future.

Apr 05 '23 11:04 bhubbb

Vicuna is a pretty strict model in terms of following that ### Human/### Assistant format when compared to alpaca and gpt4all. Less flexible but fairly impressive in how it mimics ChatGPT responses.

Apr 05 '23 19:04 rabidcopy

It's extremely slow on my M1 MacBook (unusable), quite usable on my 4 yr old i7 workstation. And doesn't work at all on the same workstation inside docker.

Found #767, adding --mlock solved the slowness issue on Macbook. Docker issue here #537. Simply built my own image tailored to my own machine. Works like a charm.

Apr 06 '23 18:04 jmtatsch

I've had the most success with this model with the following patch to the instruct mode.

diff --git a/examples/main/main.cpp b/examples/main/main.cpp
index 453450a..70b4f45 100644
--- a/examples/main/main.cpp
+++ b/examples/main/main.cpp
@@ -152,13 +152,13 @@ int main(int argc, char ** argv) {
     }
 
     // prefix & suffix for instruct mode
-    const auto inp_pfx = ::llama_tokenize(ctx, "\n\n### Instruction:\n\n", true);
-    const auto inp_sfx = ::llama_tokenize(ctx, "\n\n### Response:\n\n", false);
+    const auto inp_pfx = ::llama_tokenize(ctx, "\n\n### Human:\n\n", true);
+    const auto inp_sfx = ::llama_tokenize(ctx, "\n\n### Assistant:\n\n", false);
 
     // in instruct mode, we inject a prefix and a suffix to each input by the user
     if (params.instruct) {
         params.interactive_start = true;
-        params.antiprompt.push_back("### Instruction:\n\n");
+        params.antiprompt.push_back("### Human:\n\n");
     }
 
     // enable interactive mode if reverse prompt or interactive start is specified

And then running the model with the following options. If there are better options, please let me know.

./main \
  --model  ./models/ggml-vicuna-13b-4bit/ggml-vicuna-13b-4bit.bin \
  --color \
  --threads 7 \
  --batch_size 256 \
  --n_predict -1 \
  --top_k 12 \
  --top_p 1 \
  --temp 0.36 \
  --repeat_penalty 1.05 \
  --ctx_size 2048 \
  --instruct \
  --reverse-prompt '### Human:' \
  --file prompts/vicuna.txt

And my prompt file

A chat between a curious human and an artificial intelligence assistant.
The assistant gives helpful, detailed, and polite answers to the human's questions.

Example output

 A chat between a curious human and an artificial intelligence assistant.          
The assistant gives helpful, detailed, and polite answers to the human's questions.
> What NFL team won the Super Bowl in the year Justin Bieber was born?
The NFL team that won the Super Bowl in the year Justin Bieber was born, which is 1994, was the Dallas Cowboys. They defeated the Buffalo Bills in Super Bowl XXVII, which was held on January 31, 1994.
### Human:                                             
> Who won the year after?
The NFL team that won the Super Bowl the year after Justin Bieber was born, which is 1995, was the Dallas Cowboys again. They defeated the Buffalo Bills in Super Bowl XXVIII, which was held on January 30, 1995. The Cowboys became the fi
rst team to win back-to-back Super Bowls since the Pittsburgh Steelers did so in the 1970s.
### Human:

I've made this change to align with FastChat and the roles it uses.

Someone who knows C better than I, could they make a prompt suffix flag? Having a prompt suffix flag would make it easier to be compatible with other models in the future.

Although I'm not proficient in C, I was able to make some modifications to llama.cpp by recompiling main.cpp with the changes, renaming the resulting main.exe to vicuna.exe, and moving it into my main llama.cpp folder. To choose a model, I created a bat file that prompts me to select a model, and if I choose the vicuna model, the bat file runs vicuna.exe instead of main.exe.

I've included the bat file below for reference:

setlocal EnableDelayedExpansion

set /a count=0
for %%f in (S:\llama.cpp\models\*.bin) do (
    if /i not "%%~nxf"=="ggml-vocab.bin" (
        set /a count+=1
        set file[!count!]=%%f
        echo !count!. %%f
    )
)

set /p selection=Enter the number of the model you want to use:

set model=!file[%selection%]!

if /I "!model:vicuna=!" neq "!model!" (
    echo Running main.exe with model !model! ...
    title !model!
    S:\llama.cpp\vicuna.exe --model "!model!" --color --threads 8 --batch_size 256 --n_predict -1 --top_k 12 --top_p 1 --temp 0.36 --repeat_penalty 1.05 --ctx_size 2048 --instruct --reverse-prompt "### Human:" --file S:\llama.cpp\prompts\vicuna.txt
) else (
    echo Running main.exe with model !model! ...
    title !model!
    S:\llama.cpp\main.exe --model "!model!" --color --threads 8 -n 1024 --top_k 0 --top_p 0.73 --temp 0.72 --repeat_penalty 1.1 --instruct --file S:\llama.cpp\prompts\alpaca.txt
)

pause

I hope this helps anyone who may be interested in trying this out!

Apr 07 '23 03:04 idontneedonetho

sampling: temp = 0.360000, top_k = 12, top_p = 1.000000, repeat_last_n = 64, repeat_penalty = 1.050000
generate: n_ctx = 2048, n_batch = 256, n_predict = -1, n_keep = 56

Tried these settings and it's really nice! It really has learned the ChatGPT style well, and the 13b model seems to have good underlying knowledge.

 A chat between a curious human and an artificial intelligence assistant.
Assistant is helpful, kind, honest, good at writing, give answers to the human's questions in step by step explanations when it's confident, otherwise it will say it is not sure.
> What are the main themes of the game Fallout 4?
Fallout 4 is a post-apocalyptic role-playing game set in the aftermath of a nuclear war that took place on October 23, 2077. The game takes place in the ruins of Boston and its surrounding areas, known as the Commonwealth. The main theme of Fallout 4 is survival, as players must navigate the dangerous world of the Commonwealth while also dealing with the challenges of building a new community in the midst of destruction.

The game also explores themes of power and control, as players must navigate the political landscape of the Commonwealth and make alliances with various factions in order to gain power and control over the region. Additionally, the game deals with issues of morality, as players are forced to make difficult choices that can have a significant impact on the world around them.

Overall, Fallout 4 is a game that explores the challenges of survival, power, and morality in a post-apocalyptic world. The game's immersive setting and complex storylines make it a favorite among fans of the series, and its themes of survival and morality are sure to keep players engaged for hours on end.

But there is a problem that it doesn't seem to stop by itself, it will continue to generate the next line of ### Human and continuing another response.

Apr 07 '23 08:04 chakflying

I've been able to compile latest standard llama.cpp with cmake under the Windows 10, then run ggml-vicuna-7b-4bit-rev1.bin , and even ggml-vicuna-13b-4bit-rev1.bin, with this command-line code (assuming that your .bin in the same folder as main.exe):

main -i --interactive-first -r "### Human:" --temp 0 -c 2048 -n -1 --ignore-eos --repeat_penalty 1.2 --instruct -m ggml-vicuna-7b-4bit-rev1.bin

And that one kinda works even faster on my 8-core CPU:

main --color --threads 7 --batch_size 256 --n_predict -1 --top_k 12 --top_p 1 --temp 0.36 --repeat_penalty 1.05 --ctx_size 2048 --instruct --reverse-prompt "### Human:" --model ggml-vicuna-13b-4bit-rev1.bin

It works on Win laptop with 16 Gb RAM and looks almost like ChatGPT! (Slower, of course, but with speed almost same as human will type!) I agree that it may be the best LLM to run locally!

And it seems that it can write much more correct and longer program code than gpt4all! It's just amazing!

But sometimes, after a few answers, it just freezes forever while continuing to load the CPU. Has anyone noticed this? Why it may be so?

Apr 07 '23 13:04 ai2p

But sometimes, after a few answers, it just freezes forever while continuing to load the CPU. Has anyone noticed this? Why it may be so?

Context swap. The context fills up and then the first half of it is deleted to make more room, but that means that the whole context has to be reevaluated to catch up. The OpenBLAS option is supposed to accelerate it, but don't know how easy it is to make it work on Windows, vcpkg seems to have some BLAS packages.

Apr 07 '23 19:04 SlyEcho

Context swap. The context fills up and then the first half of it is deleted to make more room, but that means that the whole context has to be reevaluated to catch up.

So it's just trying to compress overfilled context so that it would be possible to continue conversation without loosing any important details? And it is normal, and I just should take a cup of tee in that time and not restarts it as I did? :-)

Apr 07 '23 21:04 ai2p

You can use --keep to keep some part of the initial prompt (-1 for all) or use a smaller context. You can try different --batch_size values because this determines the sizes of the matrixes that are used in this operation.

Apr 07 '23 22:04 SlyEcho

can someone explain to me what the difference between these two options is? (Both options work fine)

Option 1: main -i --interactive-first -r "### Human:" --temp 0 -c 2048 -n -1 --color --ignore-eos --repeat_penalty 1.2 --instruct -m ggml-vicuna-7b-4bit-rev1.bin
option 2: main --color --threads 7 --batch_size 256 --n_predict -1 --top_k 12 --top_p 1 --temp 0.36 --repeat_penalty 1.05 --ctx_size 2048 --instruct --reverse-prompt "### Human:" --model ggml-vicuna-13b-4bit-rev1.bin And how many threads should one use? I have an i511400F that has 6 cores and supports 12 threads.

Apr 08 '23 00:04 Crimsonfart

can someone explain to me what the difference between these two options is? Same except temperature and amount of threads.

About temperature read here. As I preliminary think, the higher temperature - the more stochastic/chaotic choice of words. The lower temperature - the more deterministic would be your result, and at temperature = 0 your result would be always the same. So you can tune that parameter for your application. If you write a code, may be better temp =0, if you write a poem - may be better temp=1 or even more... (If i'm wrong - correct me!)

What about threads, I intuit that you can use as many threads as your CPU support bar 1 or 2 (so that other apps and system will not hang). I think bottleneck is not CPU but RAM throughput.

Who has another opinion - please correct!

Apr 08 '23 06:04 ai2p

Como é baixar o modelo,como app comum?

Apr 09 '23 00:04 Sergio438

But there is a problem that it doesn't seem to stop by itself, it will continue to generate the next line of ### Human and continuing another response.

I use the following trick to partly overcome this problem:

### Human: Write a very long answer (about 1000 words). YOUR QUESTION WITH YOUR TEXT HERE
### Assistant:

Apr 09 '23 11:04 multimediaconverter

There is a vicuña model rev1 with some kind of stop fix on 🤗 . Maybe that solves your issue?

Apr 09 '23 13:04 jmtatsch

Yes, I'm talking about rev1, so we need to change llama_token_eos() or what? Alternative solution: https://github.com/ggerganov/llama.cpp/commit/9fd062fd2e7e9f2f14d66c3d64dce3f967604103 [UNVERIFIED] I've recompiled main.cpp with this patch, and it works for me well enough (using the parameter: --stop "### Human:")

Apr 09 '23 14:04 multimediaconverter

I've recompiled main.cpp with this patch, and it works for me well enough (using the parameter: --stop "### Human:")

This is just the same as the --antiprompt option. The patch above has \n\n in the end but I think without them it would be better. You can also have multiple antiprompts.

Apr 09 '23 20:04 SlyEcho

This is just the same as the --antiprompt option. The patch above has \n\n in the end but I think without them it would be better. You can also have multiple antiprompts.

Yes, that's right. Thank you. You mean --reverse-prompt (-r) option.

I use a prompt file to start generation in this way: main.exe -r "### Human:" -c 2048 --temp 0.36 -n -1 --ignore-eos --repeat_penalty 1.3 -m ggml-vicuna-7b-4bit-rev1.bin -f input.txt > output.txt

Content of input.txt file:

hello
### Assistant:

-r option switches the program into interactive mode, so it will not exit at the end and keeps waiting.

Therefore I made the following quick fix for vicuna:

    const auto inp_pfx = ::llama_tokenize(ctx, "\n\n### Human:\n", true);
    const auto inp_sfx = ::llama_tokenize(ctx, "\n\n### Assistant:\n", false);

    // in instruct mode, we inject a prefix and a suffix to each input by the user
    if (params.instruct) {
        params.interactive_start = true;
        params.antiprompt.push_back("### Human:");
    }

and

                is_antiprompt = false;
                // Check if each of the reverse prompts appears at the end of the output.
                for (std::string & antiprompt : params.antiprompt) {
                    if (last_output.find(antiprompt.c_str(), last_output.length() - antiprompt.length(), antiprompt.length()) != std::string::npos) {
                        is_interacting = true;
                        is_antiprompt = true;
                        set_console_color(con_st, CONSOLE_COLOR_USER_INPUT);
                        fflush(stdout);
			if (!params.instruct) exit(0);
                        break;
                    }
                }
            }

Apr 10 '23 03:04 multimediaconverter

Therefore I made the following quick fix for vicuna:

I've repeated your modifications, but nothing changes - it still shows "### Human:" each time... Who knows how to make clean answers? ( So that only '>' instead of '### Human:')

Apr 11 '23 15:04 ai2p

I've repeated your modifications, but nothing changes - it still shows "### Human:" each time... Who knows how to make clean answers? ( So that only '>' instead of '### Human:')

You can try my additional quick hack (it removes "### Human:" from the end of each response):

        // display text
	static std::string tmp;
	static std::string ap = "\n### Human:";
        if (!input_noecho) {
            for (auto id : embd) {
		tmp += llama_token_to_str(ctx, id);
		int tmplen = tmp.length() > ap.length() ? ap.length() : tmp.length();
		if (strncmp(tmp.c_str(), ap.c_str(), tmplen)) { printf(tmp.c_str()); tmp = ""; }
		else if (tmplen == ap.length()) tmp = "";
            }
            fflush(stdout);
        }

Apr 12 '23 12:04 multimediaconverter

The vicuna v1.1 model used a different setup. See https://github.com/lm-sys/FastChat/blob/f85f489f2d5e48c37cceb2f00c3edc075c5d3711/fastchat/conversation.py#L115-L124 and https://github.com/lm-sys/FastChat/blob/f85f489f2d5e48c37cceb2f00c3edc075c5d3711/fastchat/conversation.py#L37-L44

IIUC, the prompt in Borne shell string is "$system USER: $instruction ASSISTANT:".

Their doc says this https://github.com/lm-sys/FastChat/blob/f85f489f2d5e48c37cceb2f00c3edc075c5d3711/docs/weights_version.md#example-prompt-weight-v11

I think the </s> is actually the EOS token, not a verbatim string. Though I'm not sure if we need to manually append it to the end of the assistant's response or not.

Apr 13 '23 16:04 jxy

The vicuna v1.1 model used a different setup.

Uhhh... Such a mess... Definitely needed some standardization for peple teaching LLMs! At least, with tokens such assistant/human/eos it should be possible, `cos it's just technicalities not connected directly with LLM functionality...

Or, at a side of a software, there should be easy way to adapt any token without editing C++ code...

Apr 14 '23 14:04 ai2p

Since #863 may not happen soon, I tested this working on 1.1: main.cpp#160

    // prefix & suffix for instruct mode
    const auto inp_pfx = ::llama_tokenize(ctx, "\nUSER:", true);
    const auto inp_sfx = ::llama_tokenize(ctx, "\nASSISTANT:", false);

    // in instruct mode, we inject a prefix and a suffix to each input by the user
    if (params.instruct) {
        params.interactive_start = true;
        params.antiprompt.push_back("USER:");
    }

Apr 14 '23 23:04 chakflying

Therefore I made the following quick fix for vicuna:

I've repeated your modifications, but nothing changes - it still shows "### Human:" each time... Who knows how to make clean answers? ( So that only '>' instead of '### Human:')

this is my settings, and run well on my mac (only '>' instead of '### Human:'): ./main
--model ./models/13B/ggml-vicuna-13b-4bit-rev1.bin
--color -i -r "User:" --temp 0 -c 2048 -n -1 --ignore-eos --repeat_penalty 1.2
--file prompts/chat-with-vicuna.txt

Apr 19 '23 02:04 jaeqy

sampling: temp = 0.360000, top_k = 12, top_p = 1.000000, repeat_last_n = 64, repeat_penalty = 1.050000
generate: n_ctx = 2048, n_batch = 256, n_predict = -1, n_keep = 56

Tried these settings and it's really nice! It really has learned the ChatGPT style well, and the 13b model seems to have good underlying knowledge.

 A chat between a curious human and an artificial intelligence assistant.
Assistant is helpful, kind, honest, good at writing, give answers to the human's questions in step by step explanations when it's confident, otherwise it will say it is not sure.
> What are the main themes of the game Fallout 4?
Fallout 4 is a post-apocalyptic role-playing game set in the aftermath of a nuclear war that took place on October 23, 2077. The game takes place in the ruins of Boston and its surrounding areas, known as the Commonwealth. The main theme of Fallout 4 is survival, as players must navigate the dangerous world of the Commonwealth while also dealing with the challenges of building a new community in the midst of destruction.

The game also explores themes of power and control, as players must navigate the political landscape of the Commonwealth and make alliances with various factions in order to gain power and control over the region. Additionally, the game deals with issues of morality, as players are forced to make difficult choices that can have a significant impact on the world around them.

Overall, Fallout 4 is a game that explores the challenges of survival, power, and morality in a post-apocalyptic world. The game's immersive setting and complex storylines make it a favorite among fans of the series, and its themes of survival and morality are sure to keep players engaged for hours on end.

But there is a problem that it doesn't seem to stop by itself, it will continue to generate the next line of ### Human and continuing another response.

@chakflying I have the same issue when using GPT4ALL with this model, after starting my first prompt, I lost control over them.

Apr 20 '23 05:04 immorBen

llama.cpp llama.cpp copied to clipboard

Running a Vicuna-13B 4it model ?

llama.cpp
llama.cpp copied to clipboard