llama.cpp fix wrong template in GLM4-0414

GLM4-0414 models were using the wrong legacy template, leading to a missing [gMASK] preamble. The old code was returning LLM_CHAT_TEMPLATE_GLMEDGE

As a workaround you needed to launch llama-server with --chat-template chatglm4

After the patch the gguf should be regenerated or edited manually.

EDIT: Added workaround explanation

Apr 24 '25 15:04 matteoserva

Perhaps the CHATGML constants should be renamed CHATGLM?

Apr 24 '25 15:04 sultanqasim

Perhaps the CHATGML constants should be renamed CHATGLM?

I don't want to touch code that is outside of this PR

Apr 24 '25 16:04 matteoserva

+1 - see https://github.com/SillyTavern/SillyTavern/pull/3906

Apr 25 '25 03:04 kallewoof

Why did you change the bos token to <|endoftext|>?

Apr 25 '25 04:04 diaopal

Why did you change the bos token to <|endoftext|>?

Agree, BOS token should not change.

Also I am not seeing the BOS token added when using the model, as you would expect when comparing to e.g. AutoTokenizer:

slot update_slots: id  0 | task 0 | prompt token   0: 151333 '<sop>'
slot update_slots: id  0 | task 0 | prompt token   1:    198 '
'
slot update_slots: id  0 | task 0 | prompt token   2: 151335 '<|system|>'
slot update_slots: id  0 | task 0 | prompt token   3:    198 '
'
slot update_slots: id  0 | task 0 | prompt token   4:   2610 'You'
slot update_slots: id  0 | task 0 | prompt token   5:   2299 ''re'
slot update_slots: id  0 | task 0 | prompt token   6:  47773 ' Coding'
slot update_slots: id  0 | task 0 | prompt token   7:  45950 ' Sense'

vs transformers:

>>> tokenizer.apply_chat_template([{"role":"user","content":"hi"},{"role":"assistant","content":"ye?"}], tokenize=False)
'[gMASK]<sop><|user|>\nhi<|assistant|>\nye?'

Should add_bos_token be set to true? What is the reason for it not being set? Sorry if I missed that discussion.

Apr 25 '25 05:04 kallewoof

You can either use add_bos_token=true or have the bos token in the template, not both. Otherwise you get a duplicate bos tokens.
The jinja template already includes the bos token. For compatibility with the jinja one, the legacy template (LLM_CHAT_TEMPLATE_CHATGML_4) also adds the bos token. This is why add_bos_token is false.
The bos token cannot be a prefix of the jinja template, that's why I changed it to eos as was done in similar models.

@kallewoof You have to regenerate the gguf with the patched python script and also run it with patched llama-server.

Correct behavior after patch and regenerating the GGUF:

slot update_slots: id  0 | task 0 | prompt token   0: 151331 '[gMASK]'
slot update_slots: id  0 | task 0 | prompt token   1: 151333 '<sop>'
slot update_slots: id  0 | task 0 | prompt token   2: 151336 '<|user|>'
slot update_slots: id  0 | task 0 | prompt token   3:    198 '
'
slot update_slots: id  0 | task 0 | prompt token   4:  14978 'hello'
slot update_slots: id  0 | task 0 | prompt token   5: 151337 '<|assistant|>'

Before patch without --jinja:

slot update_slots: id  0 | task 0 | prompt token   0: 151336 '<|user|>'
slot update_slots: id  0 | task 0 | prompt token   1:    198 '
'
slot update_slots: id  0 | task 0 | prompt token   2:  14978 'hello'
slot update_slots: id  0 | task 0 | prompt token   3: 151337 '<|assistant|>'

Before patch with --jinja:

slot update_slots: id  0 | task 0 | prompt token   0: 151333 '<sop>'
slot update_slots: id  0 | task 0 | prompt token   1: 151336 '<|user|>'
slot update_slots: id  0 | task 0 | prompt token   2:    198 '
'
slot update_slots: id  0 | task 0 | prompt token   3:  14978 'hello'
slot update_slots: id  0 | task 0 | prompt token   4: 151337 '<|assistant|>'

Apr 25 '25 05:04 matteoserva

You can either use add_bos_token=true or have the bos token in the template, not both. Otherwise you get a duplicate bos tokens.

The jinja template already includes the bos token. For compatibility with the jinja one, the legacy template (LLM_CHAT_TEMPLATE_CHATGML_4) also adds the bos token. This is why add_bos_token is false.

The bos token cannot be a prefix of the jinja template, that's why I changed it to eos as was done in similar models.

@kallewoof You have to regenerate the gguf with the patched python script and also run it with patched llama-server.

Aha, thanks. I did indeed regenerate the GGUF and did not get the expected results. I am using llama-server though, and connecting to it from a different front end. The prompt being sent is

<sop><system>
sysmessage
...

but from what you are saying it looks like I should not include <sop> either. But I'm not sure how this relates to llama-server and requests via the API. I assumed it would work the same, but I don't think it is. I'll double check though.

Edit: no, this is chat completion vs completion stuff. For chat completion your way probably works fine. For completion, I need to send [gMASK]<sop> from the client side, yes?

Apr 25 '25 05:04 kallewoof

Edit: no, this is chat completion vs completion stuff. For chat completion your way probably works fine. For completion, I need to send [gMASK]<sop> from the client side, yes?

Yes, the change affects the chat completion api. (open webUI and similar frontends) In the completion mode you have to begin the prompt with [gMASK]<sop> as specified in the official template.

Apr 25 '25 06:04 matteoserva

slot update_slots: id 0 | task 0 | prompt token 0: 151331 '[gMASK]' slot update_slots: id 0 | task 0 | prompt token 1: 151333 '' slot update_slots: id 0 | task 0 | prompt token 2: 151336 '<|user|>' slot update_slots: id 0 | task 0 | prompt token 3: 198 ' ' slot update_slots: id 0 | task 0 | prompt token 4: 14978 'hello' slot update_slots: id 0 | task 0 | prompt token 5: 151337 '<|assistant|>'

Sorry for the slight offtopic, but this is really interesting to read! Can you please teach me how you can inspect what the actual template that was used looks like? Is this llama-server's output with some additional switches? There have been many LLM releases with broken / incorrect templates, so it would be a really useful skill to have to be able to inspect what tokens the LLM actually ended up processing as opposed to looking at the static template. Thanks!

Edit: Found it myself! For those interested, add --verbose flag to llama-server

Apr 25 '25 06:04 Mushoz

One more thing, I also found this output while investigating template issue and this proposed fix. I am not sure if this means the tokenizer has a bug that needs fixing, but I just wanted to put this out in case nobody else has noticed it so far:

llama.cpp-1  | init_tokenizer: initializing tokenizer for type 2
llama.cpp-1  | load: control token: 151342 '<|end_of_video|>' is not marked as EOG
llama.cpp-1  | load: control token: 151341 '<|begin_of_video|>' is not marked as EOG
llama.cpp-1  | load: control token: 151338 '<|observation|>' is not marked as EOG
llama.cpp-1  | load: control token: 151333 '<sop>' is not marked as EOG
llama.cpp-1  | load: control token: 151331 '[gMASK]' is not marked as EOG
llama.cpp-1  | load: control token: 151330 '[MASK]' is not marked as EOG
llama.cpp-1  | load: control token: 151337 '<|assistant|>' is not marked as EOG
llama.cpp-1  | load: control token: 151332 '[sMASK]' is not marked as EOG
llama.cpp-1  | load: control token: 151334 '<eop>' is not marked as EOG
llama.cpp-1  | load: control token: 151335 '<|system|>' is not marked as EOG
llama.cpp-1  | load: control token: 151336 '<|user|>' is not marked as EOG
llama.cpp-1  | load: control token: 151340 '<|end_of_image|>' is not marked as EOG
llama.cpp-1  | load: control token: 151339 '<|begin_of_image|>' is not marked as EOG
llama.cpp-1  | load: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect

Apr 25 '25 06:04 Mushoz

Edit: no, this is chat completion vs completion stuff. For chat completion your way probably works fine. For completion, I need to send [gMASK]<sop> from the client side, yes?

Yes, the change affects the chat completion api. (open webUI and similar frontends) In the completion mode you have to begin the prompt with [gMASK]<sop> as specified in the official template.

That works. The one thing I would suggest is to enable add_bos_token and to remove the BOS token from the chat template. We now have:

Models that do not take a BOS token at the start.
Models that take a BOS token at the start, which is added by llama.cpp during chat completion and during completion.
Models that take a BOS token at the start, which is added by llama.cpp during chat completion but not during completion.

The alternative would be to never include the BOS token for completion, but there's the backwards compatibility issue there...

Edit: another alternative would be to add a assume_bos_start flag which will put a BOS token at the start if none is found. That would address the compatibility issues.

Apr 25 '25 06:04 kallewoof

That works. The one thing I would suggest is to enable add_bos_token and to remove the BOS token from the chat template.

Yes, this is the ideal outcome. The problem is that we don't decide the chat template, it's provided by the company that releases the model. There are discussions here: https://github.com/ggml-org/llama.cpp/pull/13021#issuecomment-2826978866 or here https://github.com/THUDM/GLM-Edge/issues/2

@Mushoz I noticed that as well but ignored it as outside of the scope of this PR, If needed I'll open another PR.

Apr 25 '25 07:04 matteoserva

Yes, this is the ideal outcome. The problem is that we don't decide the chat template, it's provided by the company that releases the model. There are discussions here: #13021 (comment) or here THUDM/GLM-Edge#2

Thanks for the context. Yes, I understand you don't have control over the chat template -- my suggestion is to modify it to handle this case, either for GLM-4 specifically or for any case where the chat template starts with the BOS token explicitly. I don't see a reason why llama.cpp should juggle both cases when a one time snip can be done trivially.

Apr 25 '25 07:04 kallewoof

I agree. There shouldn't be a need to change the BOS token just to make the Jinja template (--jinja) work. If anything, the fact that the Jinja template doesn't just work suggests that there's a bug that's been overlooked, patched over with hacky solutions like changing the BOS token.

The bos token cannot be a prefix of the jinja template, that's why I changed it to eos as was done in similar models.

Why can't the BOS token serve as the prefix? At that point, you're not even using the original Jinja template if it's being silently modified on the fly like that. Since add_bos_token isn't enabled for this model, there wouldn't be any issues anyway.

Also, I realized this PR makes it as if you were specifying --chat-template chatglm4 in llama-server.

Apr 25 '25 08:04 diaopal

I agree. There shouldn't be a need to change the BOS token just to make the Jinja template (--jinja) work. If anything, the fact that the Jinja template doesn't just work suggests that there's a bug that's been overlooked, patched over with hacky solutions like changing the BOS token.

I think it's the opposite?

The previous value of the bos token was a hack introduced in https://github.com/ggml-org/llama.cpp/pull/13021/ to fix a crash. It fixed the issue but it introduced new bugs.

I found out that the problems were not caused by the bos token but by the wrong chat template being loaded. So this PR removes the bos token hack and it fixes the chat template loading. This should more closely match the official config: https://huggingface.co/THUDM/GLM-4-32B-0414/blob/main/tokenizer_config.json which is no bos token and add_bos_token=false In GLM4 <|endoftext|> is a padding token so it should be appropriate to be put in place of the bos token

Apr 25 '25 08:04 matteoserva

Is their template even correct? chat_template.jinja has newlines before the roles, and the chat_template in tokenizer_config.json doesn't.

Apr 25 '25 20:04 moo-v

From my understanding the chat_template.jinja has extra formatting to make it human readable but it's ignored by the code.

The template in tokenizer_config.json is the correct one. The only problem I see is that it's missing the final newline when add_generation_prompt=true, so it's instead inserted at the beginning of each assistant response, breaking multi turn conversations.

Apr 25 '25 20:04 matteoserva

Why was this closed? Is this no longer needed?

Apr 27 '25 18:04 Mushoz

The PR is still needed. I launched the wrong git command and broke my repo. I reopened the pr here: https://github.com/ggml-org/llama.cpp/pull/13140

Apr 27 '25 18:04 matteoserva