llama.cpp Fix ChatGLMModel for glm-4-9b cannot find tokenizer merges in model file

Fix: Resolved "Cannot find tokenizer merges in model file" Issue

This PR addresses the tokenizer merge issue (cannot find tokenizer merges in model file) when loading certain models, especially those converted from HuggingFace. The solution is based on insights from the following discussions and PRs:

https://github.com/ggml-org/llama.cpp/issues/9692
https://github.com/unslothai/unsloth/issues/1065
https://github.com/ggml-org/llama.cpp/pull/9696

Verification Steps

1. Build

cmake llama.cpp -B llama.cpp/build \
  -DBUILD_SHARED_LIBS=OFF \
  -DGGML_CUDA=ON \
  -DLLAMA_CURL=ON \
  -DCMAKE_C_COMPILER=gcc-13 \
  -DCMAKE_CXX_COMPILER=g++-13 \
  -DCMAKE_CUDA_COMPILER=/usr/local/cuda-12.1/bin/nvcc

cmake --build llama.cpp/build --config Release -j --clean-first --target llama-quantize llama-cli llama-gguf-split

2. Convert HF Weights

python convert_hf_to_gguf.py THUDM/glm-4-9b \
  --outfile glm-4-9b.gguf \
  --outtype q8_0

3. Run Inference

./llama-cli -m /mnt/ceph/develop/jiawei/model_checkpoint/glm-4-9b.gguf -ngl 200000 -p "你好啊"

Known Issue

Refer to: https://github.com/ggml-org/llama.cpp/discussions/7441

In llama.cpp, special tokens (e.g., eos_token_id) are currently mapped one-to-one (token → ID). However, in actual transformer models, these tokens might correspond to multiple tokens or require multi-token representations.

This mismatch can cause issues where the model doesn't terminate generation correctly.

The exact handling logic and call chain for special_token in llama.cpp remains unclear and might require further investigation.

You can see a temporary workaround here. https://github.com/ggml-org/llama.cpp/issues/9606

Apr 22 '25 03:04 glide-the

There are still compatibility issues between the base and chat models. Please do not merge this PR for now.

Apr 22 '25 13:04 glide-the

The code now supports GLM variant models, including LLaMA-style and GPT-2-style vocabularies.

Tested inference compatibility for the following base models:

https://huggingface.co/THUDM/glm-4-9b
https://huggingface.co/THUDM/glm-4-9b-hf

Current Status

Inference works with the above base models.
However, there are known issues with stop token handling, as discussed in [llama.cpp issue #9606](https://github.com/ggml-org/llama.cpp/issues/9606):

llama.cpp uses <|endoftext|> and <|im_end|> as stop tokens across all modes,
but ideally, generation should stop on either of them.
Therefore, it’s not as simple as redefining eos_token_id or eot_token_id.

so llama.cpp update keep track of all EOG tokens in the vocab #9609 , use in stop be that encountered.

Function Call Token Support

Special token handling for function call-style generation will be submitted in a separate PR.

Apr 23 '25 03:04 glide-the

Function Call Token Support Special token handling for function call-style generation will be submitted in a separate PR.

Excellent! I've been using GLM-4-32B and the tool calling format is non-standard (GLM-4 sample code maps to and from their tool calling format which is new-line delimited rather than native json). Are you saying in another PR you'll add the mapping their custom format to the standard json format?

Apr 23 '25 05:04 johnpyp

Excellent! I've been using GLM-4-32B and the tool calling format is non-standard (GLM-4 sample code maps to and from their tool calling format which is new-line delimited rather than native json). Are you saying in another PR you'll add the mapping their custom format to the standard json format?

Function Call Compatibility for GLM

The main modification I made to support function call capabilities in GLM is ensuring that the special token <|observation|> correctly triggers its intended behavior. In GLM models, this token acts as a special stop word.

Example usage pattern:

<|user|>
text<|assistant|>test<|observation|>

In llama.cpp, there is no built-in mapping for <|observation|>, so I added it to the EOG (End of Generation) tokens list.

Loading Chain

Once the model is loaded, the following call chain is involved:

main()
→ llama_model_load_from_file()
→ llama_model_load()
→ llama_model_loader()

Sampling Chain

During sampling, the function llama_sampler_chain_apply is responsible for executing the model inference chain (llama_sampler_chain, smpl->ctx).
At the end of each sampling step, the logic checks for eos, eot, or eog tokens and triggers the corresponding stop behavior.

Refer to the implementation in main.cpp:

if (need_insert_eot && format_chat) {
    llama_token eot = llama_vocab_eot(vocab);
    embd_inp.push_back(eot == LLAMA_TOKEN_NULL ? llama_vocab_eos(vocab) : eot);
    need_insert_eot = false;
}

if (!embd.empty() && llama_vocab_is_eog(vocab, embd.back()) && !(params.interactive)) {
    LOG(" [end of text]\n");
    break;
}

Regarding Function Call Token Implementation

To support tokens like <|observation|> for function call behavior, it may be sufficient to simply include it in the EOG detection logic.
I haven’t yet fully validated this approach, but it appears promising.

for (const auto & t : token_to_id) {
    // find EOT token: "<|eot_id|>", "<|im_end|>", "<end_of_turn>", etc.
    if (special_eot_id == LLAMA_TOKEN_NULL) {
        if (false
                || t.first == "<|eot_id|>"
                || t.first == "<|im_end|>"
                || t.first == "<|end|>"
                || t.first == "<end_of_turn>"
                || t.first == "<|endoftext|>"
                || t.first == "<EOT>"
                || t.first == "_<EOT>"
                || t.first == "<｜end▁of▁sentence｜>" // DeepSeek
           ) {
            special_eot_id = t.second;
            if ((id_to_token[t.second].attr & LLAMA_TOKEN_ATTR_CONTROL) == 0) {
                LLAMA_LOG_WARN("%s: control-looking token: %6d '%s' was not control-type; this is probably a bug in the model. its type will be overridden\n",
                        __func__, t.second, t.first.c_str());
                id_to_token[t.second].attr = LLAMA_TOKEN_ATTR_CONTROL;
            }
        }
    }
}

Apr 23 '25 06:04 glide-the

@johnpyp @ngxson Hi ,I added this code to implement the glm func tools behavior support <|observation|> for function call behavior, add in the EOG detection logic for src/llama-vocab.cpp#L1976-L1977 pr: https://github.com/ggml-org/llama.cpp/pull/13339

May 06 '25 10:05 glide-the