llama.cpp add phi3 support

Make phi3 as an explicit model to support in llama.

Apr 23 '24 17:04 liuwei-git

Might have to add <|end|> as an EOT token:

diff --git a/llama.cpp b/llama.cpp
index 63483b9a..698ad236 100644
--- a/llama.cpp
+++ b/llama.cpp
@@ -4381,6 +4381,7 @@ static void llm_load_vocab(
                         //vocab.id_to_token[t.second].type == LLAMA_TOKEN_TYPE_CONTROL &&
                         (t.first == "<|eot_id|>" ||
                          t.first == "<|im_end|>" ||
+                         t.first == "<|end|>" ||
                          t.first == "<end_of_turn>"
                         )
                    ) {

This seems to be producing better results than https://github.com/ggerganov/llama.cpp/pull/6851 For example, I don't see the <|calc|> and <|/data|> tokens generated. I wonder where the difference comes from?

Apr 23 '24 18:04 ggerganov

So the difference was in the tokenization - in the other PR the <|system|> token was being incorrectly mapped to 32034, while it should have been 32006. I applied the changes from this PR into _set_vocab_sentencepiece() since this implementation seems to be correct: https://github.com/ggerganov/llama.cpp/pull/6851/commits/5dcccb3a7d0a6d1ce5886dd1f055555fd7759568

I wonder if it affects the conversion of some other models too?

Anyway, now the results match except for the other PR having a BOS token added at the start, while this PR does not:

https://github.com/ggerganov/llama.cpp/pull/6852/files#diff-ecca4c14f9a354b5557247cafd79409b332bfa1e9c12594f83282af1fde4743eR2064

Just double-checking if this is the intent?

There is also a minor issue because of this - the tokenizer.ggml.add_bos_token KV is written 2 times in the header: one time it is true and another it is false:

python3 gguf-py/scripts/gguf-dump.py models/phi-3-4k-instruct/ggml-model-f16.gguf
* Loading: models/phi-3-4k-instruct/ggml-model-f16-new.gguf
Traceback (most recent call last):
KeyError: 'Duplicate tokenizer.ggml.add_bos_token already in list at offset 725511'

This is because we already write this field automatically here:

https://github.com/ggerganov/llama.cpp/blob/171a73890ec0948293c675a8ab1779e01aac906f/gguf-py/gguf/vocab.py#L61-L71

Apr 23 '24 19:04 ggerganov

So is the implementation in https://github.com/ggerganov/llama.cpp/pull/6851 preferred or are both needed for official support?

Apr 23 '24 21:04 bartowski1182

hi, using server "<|eot_id|>" still printed at the end of conversation, and I can't find stop token now in /examples/server/utils.hpp, how to avoid this "<|eot_id|>" in server ?

thanks

Apr 25 '24 21:04 x4080

Most likely you are using base model instead of instruct model. See https://github.com/ggerganov/llama.cpp/pull/6916 for clear explanation and way to add stop tokens from client-side

Apr 26 '24 07:04 ggerganov

@ggerganov Hi, no i was using Phi-3-mini-128k-instruct.Q4_K_M.gguf, forget it, I think this was for server, for non server it already works fine

Apr 26 '24 22:04 x4080

llama.cpp llama.cpp copied to clipboard

add phi3 support

llama.cpp
llama.cpp copied to clipboard