llama-cpp-python Add the Command R chat format

Apr 25 '24 10:04 euxoa

This should not strictly be necessary as recent GGUFs have the chat format embedded (which will be automatically applied through Jinja2ChatFormatter), I've submitted a request in older repos on HF to be updated (and many of them have already done so).

If you have an outdated GGUF and don't wish to redownload it you can update your local file using the gguf-new-metadata.py script in llama.cpp/gguf-py/scripts and the latest Command R tokenizer_config.json from HF:

python gguf-new-metadata.py input.gguf output.gguf --chat-template-config tokenizer_config.json

Apr 25 '24 12:04 CISC

@CISC There are some arguments however:

As you said yourself, there is a lot (the vast majority to be honest) of GGUFs that don't have this yet
lama-cpp-python already offers a lot of chat formats. llama-cpp also introduced the command-r chat format. As Command-R (Plus) is currently the most capable open models (or tie with llama3) I think it makes a lot of reason to merge this.
It's just a minor merge to an existing function.
Would really help a lot of people.

As soon as more GGUFs have the formats embedded the situation changes. But right I now this merge would just be super helpful. The model is a powerhouse for the open weights community.

Merge would be <3 <3 <3

Apr 25 '24 13:04 uncodecomplexsystems

@uncodecomplexsystems As you say, it's just a minor merge, I'm not opposed to it, I'm just saying it's not strictly necessary. :)

Apr 25 '24 15:04 CISC

If you have an outdated GGUF and don't wish to redownload it you can update your local file [...]

Thanks, I didn't know that!

I have various GGUFs for Qwen-1.5, Command R, and Llama 3's, and the automatic setup of the chat format looks like this:

>>> for mname in model_names:
...    llm = Llama(f"llms/{mname}", n_gpu_layers=-1, logits_all=False, n_ctx=4096, verbose=False)
...    print(mname, llm.chat_format)
... 
c4ai-command-r-v01-Q5_K_M.gguf llama-2
Meta-Llama-3-8B-Instruct.Q5_K_M.gguf None
Meta-Llama-3-70B-Instruct.Q3_K_M.gguf llama-3
qwen1_5-14b-chat-q4_k_m.gguf chatml
qwen1_5-32b-chat-q4_k_m.gguf None
qwen1_5-72b-chat-q3_k_m.gguf chatml
mixtral-instruct-8x7b-q4k-medium.gguf mistral-instruct

I thought those with None were fails, but do they actually get their chat format correctly from the template?

And confusingly, Command R kind of works with the chatml format and probably even with the default llama-2 format, but then in tests suffers from poorer prompt following, and oddly sometimes outputs tags in place of named entities.

Apr 25 '24 18:04 euxoa

I thought those with None were fails, but do they actually get their chat format correctly from the template?

Yes, None means it found an embedded template (that is not recognized as any specific template, enable verbose and it will output the full template), if no template can be guessed or found it will fall back to llama-2, see llama.py.

Apr 25 '24 19:04 CISC

Based on the inactivity both in this PR and the phi3 one I suppose your stance @abetlen is to not merge any more new chat templates into llama-cpp-python, right? I think it's important to know. Thx!

Apr 29 '24 10:04 uncodecomplexsystems

@uncodecomplexsystems Patience, I'm sure there's just a lot going on (here or elsewhere) right now.

Apr 29 '24 10:04 CISC

it's worth noting that llama.cpp/examples/server now has an OpenAI API compatible endpoint its own chat template handling, which i believe is based on the llama_chat_template_apply() API in llama.cpp. there are a few PRs and issues seeking a more general solution:

https://github.com/ggerganov/llama.cpp/pull/6822 https://github.com/ggerganov/llama.cpp/pull/6834 https://github.com/ggerganov/llama.cpp/issues/4216 https://github.com/ggerganov/llama.cpp/issues/6726 https://github.com/ggerganov/llama.cpp/issues/5922 https://github.com/ggerganov/llama.cpp/issues/6391

Apr 29 '24 17:04 khimaros

llama-cpp-python llama-cpp-python copied to clipboard

Add the Command R chat format

llama-cpp-python
llama-cpp-python copied to clipboard