llama-cpp-python
llama-cpp-python copied to clipboard
Add the Command R chat format
This should not strictly be necessary as recent GGUFs have the chat format embedded (which will be automatically applied through Jinja2ChatFormatter), I've submitted a request in older repos on HF to be updated (and many of them have already done so).
If you have an outdated GGUF and don't wish to redownload it you can update your local file using the gguf-new-metadata.py
script in llama.cpp/gguf-py/scripts
and the latest Command R tokenizer_config.json
from HF:
python gguf-new-metadata.py input.gguf output.gguf --chat-template-config tokenizer_config.json
@CISC There are some arguments however:
- As you said yourself, there is a lot (the vast majority to be honest) of GGUFs that don't have this yet
- lama-cpp-python already offers a lot of chat formats. llama-cpp also introduced the command-r chat format. As Command-R (Plus) is currently the most capable open models (or tie with llama3) I think it makes a lot of reason to merge this.
- It's just a minor merge to an existing function.
- Would really help a lot of people.
As soon as more GGUFs have the formats embedded the situation changes. But right I now this merge would just be super helpful. The model is a powerhouse for the open weights community.
Merge would be <3 <3 <3
@uncodecomplexsystems As you say, it's just a minor merge, I'm not opposed to it, I'm just saying it's not strictly necessary. :)
If you have an outdated GGUF and don't wish to redownload it you can update your local file [...]
Thanks, I didn't know that!
I have various GGUFs for Qwen-1.5, Command R, and Llama 3's, and the automatic setup of the chat format looks like this:
>>> for mname in model_names:
... llm = Llama(f"llms/{mname}", n_gpu_layers=-1, logits_all=False, n_ctx=4096, verbose=False)
... print(mname, llm.chat_format)
...
c4ai-command-r-v01-Q5_K_M.gguf llama-2
Meta-Llama-3-8B-Instruct.Q5_K_M.gguf None
Meta-Llama-3-70B-Instruct.Q3_K_M.gguf llama-3
qwen1_5-14b-chat-q4_k_m.gguf chatml
qwen1_5-32b-chat-q4_k_m.gguf None
qwen1_5-72b-chat-q3_k_m.gguf chatml
mixtral-instruct-8x7b-q4k-medium.gguf mistral-instruct
I thought those with None
were fails, but do they actually get their chat format correctly from the template?
And confusingly, Command R kind of works with the chatml
format and probably even with the default llama-2
format, but then in tests suffers from poorer prompt following, and oddly sometimes outputs tags in place of named entities.
I thought those with
None
were fails, but do they actually get their chat format correctly from the template?
Yes, None means it found an embedded template (that is not recognized as any specific template, enable verbose and it will output the full template), if no template can be guessed or found it will fall back to llama-2, see llama.py.
Based on the inactivity both in this PR and the phi3 one I suppose your stance @abetlen is to not merge any more new chat templates into llama-cpp-python, right? I think it's important to know. Thx!
@uncodecomplexsystems Patience, I'm sure there's just a lot going on (here or elsewhere) right now.
it's worth noting that llama.cpp/examples/server now has an OpenAI API compatible endpoint its own chat template handling, which i believe is based on the llama_chat_template_apply()
API in llama.cpp. there are a few PRs and issues seeking a more general solution:
https://github.com/ggerganov/llama.cpp/pull/6822 https://github.com/ggerganov/llama.cpp/pull/6834 https://github.com/ggerganov/llama.cpp/issues/4216 https://github.com/ggerganov/llama.cpp/issues/6726 https://github.com/ggerganov/llama.cpp/issues/5922 https://github.com/ggerganov/llama.cpp/issues/6391