Fix ChatGLMModel for glm-4-9b cannot find tokenizer merges in model file
Fix: Resolved "Cannot find tokenizer merges in model file" Issue
This PR addresses the tokenizer merge issue (cannot find tokenizer merges in model file) when loading certain models, especially those converted from HuggingFace. The solution is based on insights from the following discussions and PRs:
- https://github.com/ggml-org/llama.cpp/issues/9692
- https://github.com/unslothai/unsloth/issues/1065
- https://github.com/ggml-org/llama.cpp/pull/9696
Verification Steps
1. Build
cmake llama.cpp -B llama.cpp/build \
-DBUILD_SHARED_LIBS=OFF \
-DGGML_CUDA=ON \
-DLLAMA_CURL=ON \
-DCMAKE_C_COMPILER=gcc-13 \
-DCMAKE_CXX_COMPILER=g++-13 \
-DCMAKE_CUDA_COMPILER=/usr/local/cuda-12.1/bin/nvcc
cmake --build llama.cpp/build --config Release -j --clean-first --target llama-quantize llama-cli llama-gguf-split
2. Convert HF Weights
python convert_hf_to_gguf.py THUDM/glm-4-9b \
--outfile glm-4-9b.gguf \
--outtype q8_0
3. Run Inference
./llama-cli -m /mnt/ceph/develop/jiawei/model_checkpoint/glm-4-9b.gguf -ngl 200000 -p "你好啊"
Known Issue
Refer to: https://github.com/ggml-org/llama.cpp/discussions/7441
In llama.cpp, special tokens (e.g., eos_token_id) are currently mapped one-to-one (token → ID). However, in actual transformer models, these tokens might correspond to multiple tokens or require multi-token representations.
This mismatch can cause issues where the model doesn't terminate generation correctly.
The exact handling logic and call chain for special_token in llama.cpp remains unclear and might require further investigation.
You can see a temporary workaround here. https://github.com/ggml-org/llama.cpp/issues/9606
There are still compatibility issues between the base and chat models. Please do not merge this PR for now.
The code now supports GLM variant models, including LLaMA-style and GPT-2-style vocabularies.
Tested inference compatibility for the following base models:
- https://huggingface.co/THUDM/glm-4-9b
- https://huggingface.co/THUDM/glm-4-9b-hf
Current Status
Inference works with the above base models.
However, there are known issues with stop token handling, as discussed in [llama.cpp issue #9606](https://github.com/ggml-org/llama.cpp/issues/9606):
llama.cpp uses
<|endoftext|>and<|im_end|>as stop tokens across all modes,
but ideally, generation should stop on either of them.
Therefore, it’s not as simple as redefiningeos_token_idoreot_token_id.
so llama.cpp update keep track of all EOG tokens in the vocab #9609 , use in stop be that encountered.
Function Call Token Support
Special token handling for function call-style generation will be submitted in a separate PR.
Function Call Token Support Special token handling for function call-style generation will be submitted in a separate PR.
Excellent! I've been using GLM-4-32B and the tool calling format is non-standard (GLM-4 sample code maps to and from their tool calling format which is new-line delimited rather than native json). Are you saying in another PR you'll add the mapping their custom format to the standard json format?
Excellent! I've been using GLM-4-32B and the tool calling format is non-standard (GLM-4 sample code maps to and from their tool calling format which is new-line delimited rather than native json). Are you saying in another PR you'll add the mapping their custom format to the standard json format?
Function Call Compatibility for GLM
The main modification I made to support function call capabilities in GLM is ensuring that the special token <|observation|> correctly triggers its intended behavior. In GLM models, this token acts as a special stop word.
Example usage pattern:
<|user|>
text<|assistant|>test<|observation|>
In llama.cpp, there is no built-in mapping for <|observation|>, so I added it to the EOG (End of Generation) tokens list.
Loading Chain
Once the model is loaded, the following call chain is involved:
main()
→ llama_model_load_from_file()
→ llama_model_load()
→ llama_model_loader()
Sampling Chain
During sampling, the function llama_sampler_chain_apply is responsible for executing the model inference chain (llama_sampler_chain, smpl->ctx).
At the end of each sampling step, the logic checks for eos, eot, or eog tokens and triggers the corresponding stop behavior.
Refer to the implementation in main.cpp:
if (need_insert_eot && format_chat) {
llama_token eot = llama_vocab_eot(vocab);
embd_inp.push_back(eot == LLAMA_TOKEN_NULL ? llama_vocab_eos(vocab) : eot);
need_insert_eot = false;
}
if (!embd.empty() && llama_vocab_is_eog(vocab, embd.back()) && !(params.interactive)) {
LOG(" [end of text]\n");
break;
}
Regarding Function Call Token Implementation
To support tokens like <|observation|> for function call behavior, it may be sufficient to simply include it in the EOG detection logic.
I haven’t yet fully validated this approach, but it appears promising.
for (const auto & t : token_to_id) {
// find EOT token: "<|eot_id|>", "<|im_end|>", "<end_of_turn>", etc.
if (special_eot_id == LLAMA_TOKEN_NULL) {
if (false
|| t.first == "<|eot_id|>"
|| t.first == "<|im_end|>"
|| t.first == "<|end|>"
|| t.first == "<end_of_turn>"
|| t.first == "<|endoftext|>"
|| t.first == "<EOT>"
|| t.first == "_<EOT>"
|| t.first == "<|end▁of▁sentence|>" // DeepSeek
) {
special_eot_id = t.second;
if ((id_to_token[t.second].attr & LLAMA_TOKEN_ATTR_CONTROL) == 0) {
LLAMA_LOG_WARN("%s: control-looking token: %6d '%s' was not control-type; this is probably a bug in the model. its type will be overridden\n",
__func__, t.second, t.first.c_str());
id_to_token[t.second].attr = LLAMA_TOKEN_ATTR_CONTROL;
}
}
}
}
@johnpyp @ngxson Hi ,I added this code to implement the glm func tools behavior support <|observation|> for function call behavior, add in the EOG detection logic for src/llama-vocab.cpp#L1976-L1977 pr: https://github.com/ggml-org/llama.cpp/pull/13339