llama.cpp Eval bug: GLM-Z1-9B-0414

Name and Version

ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no ggml_cuda_init: found 1 CUDA devices: Device 0: NVIDIA GeForce RTX 3080, compute capability 8.6, VMM: yes version: 5121 (c94085df) built with cc (Ubuntu 14.2.0-4ubuntu2) 14.2.0 for x86_64-linux-gnu

Operating systems

Linux

GGML backends

CUDA

Hardware

RTX 3080

Models

https://huggingface.co/ilintar/THUDM_GLM-Z1-9B-0414_iGGUF

Issue appears even with the highest quants (Q8_0).

Problem description & steps to reproduce

After running the server (llama-server --port 2345 --top-p 0.95 --temp 0.6 -nkvo -ngl 50 -c 32000 -m THUDM_GLM-Z1-9B-0414-Q5_K_M.gguf, tried also with --jinja), the generation loops after producing ~100 tokens.

I tried the model with Transformers, using --load-in-4bit (because my VRAM is not enough to run it without quants) and it generated a completely cogent response:

response.txt

First Bad Commit

No response

Relevant log output

ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 3080, compute capability 8.6, VMM: yes
build: 5121 (c94085df) with cc (Ubuntu 14.2.0-4ubuntu2) 14.2.0 for x86_64-linux-gnu
system info: n_threads = 8, n_threads_batch = 8, total_threads = 8

system_info: n_threads = 8 (n_threads_batch = 8) / 8 | CUDA : ARCHS = 860 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | LLAMAFILE = 1 | OPENMP = 1 | AARCH64_REPACK = 1 | 

main: binding port with default address family
main: HTTP server is listening, hostname: 127.0.0.1, port: 2345, http threads: 7
main: loading model
srv    load_model: loading model 'THUDM_GLM-Z1-9B-0414-Q5_K_M.gguf'
llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 3080) - 8491 MiB free
llama_model_loader: loaded meta data with 33 key-value pairs and 523 tensors from THUDM_GLM-Z1-9B-0414-Q5_K_M.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = glm4
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = THUDM_GLM Z1 9B 0414
llama_model_loader: - kv   3:                            general.version str              = 0414
llama_model_loader: - kv   4:                           general.basename str              = THUDM_GLM-Z1
llama_model_loader: - kv   5:                         general.size_label str              = 9B
llama_model_loader: - kv   6:                            general.license str              = mit
llama_model_loader: - kv   7:                               general.tags arr[str,1]       = ["text-generation"]
llama_model_loader: - kv   8:                          general.languages arr[str,2]       = ["zh", "en"]
llama_model_loader: - kv   9:                           glm4.block_count u32              = 40
llama_model_loader: - kv  10:                        glm4.context_length u32              = 32768
llama_model_loader: - kv  11:                      glm4.embedding_length u32              = 4096
llama_model_loader: - kv  12:                   glm4.feed_forward_length u32              = 13696
llama_model_loader: - kv  13:                  glm4.attention.head_count u32              = 32
llama_model_loader: - kv  14:               glm4.attention.head_count_kv u32              = 2
llama_model_loader: - kv  15:                        glm4.rope.freq_base f32              = 10000.000000
llama_model_loader: - kv  16:      glm4.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  17:                  glm4.attention.key_length u32              = 128
llama_model_loader: - kv  18:                glm4.attention.value_length u32              = 128
llama_model_loader: - kv  19:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  20:                         tokenizer.ggml.pre str              = glm4
llama_model_loader: - kv  21:                      tokenizer.ggml.tokens arr[str,151552]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  22:                  tokenizer.ggml.token_type arr[i32,151552]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  23:                      tokenizer.ggml.merges arr[str,318088]  = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "...
llama_model_loader: - kv  24:                tokenizer.ggml.eos_token_id u32              = 151329
llama_model_loader: - kv  25:            tokenizer.ggml.padding_token_id u32              = 151329
llama_model_loader: - kv  26:                    tokenizer.chat_template str              = [gMASK]<sop>{%- if tools -%}<|system|...
llama_model_loader: - kv  27:               general.quantization_version u32              = 2
llama_model_loader: - kv  28:                          general.file_type u32              = 17
llama_model_loader: - kv  29:                      quantize.imatrix.file str              = imatrix.dat
llama_model_loader: - kv  30:                   quantize.imatrix.dataset str              = ../imatrix_train/calibration_data_v5_...
llama_model_loader: - kv  31:             quantize.imatrix.entries_count i32              = 240
llama_model_loader: - kv  32:              quantize.imatrix.chunks_count i32              = 220
llama_model_loader: - type  f32:  281 tensors
llama_model_loader: - type q5_1:   20 tensors
llama_model_loader: - type q8_0:   20 tensors
llama_model_loader: - type q5_K:  181 tensors
llama_model_loader: - type q6_K:   21 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type   = Q5_K - Medium
print_info: file size   = 6.56 GiB (5.99 BPW) 
load: special tokens cache size = 14
load: token to piece cache size = 0.9710 MB
print_info: arch             = glm4
print_info: vocab_only       = 0
print_info: n_ctx_train      = 32768
print_info: n_embd           = 4096
print_info: n_layer          = 40
print_info: n_head           = 32
print_info: n_head_kv        = 2
print_info: n_rot            = 128
print_info: n_swa            = 0
print_info: n_swa_pattern    = 1
print_info: n_embd_head_k    = 128
print_info: n_embd_head_v    = 128
print_info: n_gqa            = 16
print_info: n_embd_k_gqa     = 256
print_info: n_embd_v_gqa     = 256
print_info: f_norm_eps       = 0.0e+00
print_info: f_norm_rms_eps   = 1.0e-05
print_info: f_clamp_kqv      = 0.0e+00
print_info: f_max_alibi_bias = 0.0e+00
print_info: f_logit_scale    = 0.0e+00
print_info: f_attn_scale     = 0.0e+00
print_info: n_ff             = 13696
print_info: n_expert         = 0
print_info: n_expert_used    = 0
print_info: causal attn      = 1
print_info: pooling type     = 0
print_info: rope type        = 0
print_info: rope scaling     = linear
print_info: freq_base_train  = 10000.0
print_info: freq_scale_train = 1
print_info: n_ctx_orig_yarn  = 32768
print_info: rope_finetuned   = unknown
print_info: ssm_d_conv       = 0
print_info: ssm_d_inner      = 0
print_info: ssm_d_state      = 0
print_info: ssm_dt_rank      = 0
print_info: ssm_dt_b_c_rms   = 0
print_info: model type       = 9B
print_info: model params     = 9.40 B
print_info: general.name     = THUDM_GLM Z1 9B 0414
print_info: vocab type       = BPE
print_info: n_vocab          = 151552
print_info: n_merges         = 318088
print_info: EOS token        = 151329 '<|endoftext|>'
print_info: EOT token        = 151329 '<|endoftext|>'
print_info: PAD token        = 151329 '<|endoftext|>'
print_info: LF token         = 198 'Ċ'
print_info: EOG token        = 151329 '<|endoftext|>'
print_info: max token length = 1024
load_tensors: loading model tensors, this can take a while... (mmap = true)
load_tensors: offloading 40 repeating layers to GPU
load_tensors: offloading output layer to GPU
load_tensors: offloaded 41/41 layers to GPU
load_tensors:        CUDA0 model buffer size =  6308.38 MiB
load_tensors:   CPU_Mapped model buffer size =   407.00 MiB
....................................................................................
llama_context: constructing llama_context
llama_context: n_seq_max     = 1
llama_context: n_ctx         = 32000
llama_context: n_ctx_per_seq = 32000
llama_context: n_batch       = 2048
llama_context: n_ubatch      = 512
llama_context: causal_attn   = 1
llama_context: flash_attn    = 0
llama_context: freq_base     = 10000.0
llama_context: freq_scale    = 1
llama_context: n_ctx_per_seq (32000) < n_ctx_train (32768) -- the full capacity of the model will not be utilized
llama_context:  CUDA_Host  output buffer size =     0.58 MiB
init: kv_size = 32000, offload = 0, type_k = 'f16', type_v = 'f16', n_layer = 40, can_shift = 1
init:        CPU KV buffer size =  1250.00 MiB
llama_context: KV self size  = 1250.00 MiB, K (f16):  625.00 MiB, V (f16):  625.00 MiB
llama_context:      CUDA0 compute buffer size =   312.00 MiB
llama_context:  CUDA_Host compute buffer size =  2071.51 MiB
llama_context: graph nodes  = 1766
llama_context: graph splits = 82
common_init_from_params: setting dry_penalty_last_n to ctx_size = 32000
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
srv          init: initializing slots, n_slots = 1
slot         init: id  0 | task -1 | new slot n_ctx_slot = 32000
main: model loaded
main: chat template, chat_template: [gMASK]<sop>{%- if tools -%}<|system|>你是一个名为 ChatGLM 的人工智能助手。你是基于智谱 AI 公司训练的语言模型 GLM-4 模型开发的，你的任务是针对用户的问题和要求提供适当的答复和支持。

# 可用工具

{% for tool in tools %}{%- set function = tool.function if tool.get("function") else tool %}

## {{ function.name }}

{{ function | tojson(indent=4, ensure_ascii=False) }}
在调用上述函数时，请使用 Json 格式表示调用的参数。{%- endfor %}{%- endif -%}{%- for msg in messages %}{%- if msg.role == 'system' %}<|system|>
{{ msg.content }}{%- endif %}{%- endfor %}{%- for message in messages if message.role != 'system' %}{%- set role = message['role'] %}{%- set content = message['content'] %}{%- set visible = content.split('</think>')[-1].strip() %}{%- set meta = message.get("metadata", "") %}{%- if role == 'user' %}<|user|>
{{ visible }}{%- elif role == 'assistant' and not meta %}<|assistant|>
{{ visible }}{%- elif role == 'assistant' and meta %}<|assistant|>{{ meta }} 
{{ visible }}{%- elif role == 'observation' %}<|observation|>
{{ visible }}{%- endif %}{%- endfor %}{% if add_generation_prompt %}<|assistant|>{% endif %}, example_format: '<|system|>
You are a helpful assistant<|user|>
Hello<|assistant|>
Hi there<|user|>
How are you?<|assistant|>'
main: server is listening on http://127.0.0.1:2345 - starting the main loop
srv  update_slots: all slots are idle
srv  params_from_: Chat format: Content-only
slot launch_slot_: id  0 | task 0 | processing task
slot update_slots: id  0 | task 0 | new prompt, n_ctx_slot = 32000, n_keep = 0, n_prompt_tokens = 66
slot update_slots: id  0 | task 0 | kv cache rm [0, end)
slot update_slots: id  0 | task 0 | prompt processing progress, n_past = 66, n_tokens = 66, progress = 1.000000
slot update_slots: id  0 | task 0 | prompt done, n_past = 66, n_tokens = 66
srv  cancel_tasks: cancel task, id_task = 0
srv  log_server_r: request: POST /chat/completions 127.0.0.1 200
slot      release: id  0 | task 0 | stop processing: n_past = 529, truncated = 0
srv  update_slots: all slots are idle
^Csrv    operator(): operator(): cleaning up before exit...

Apr 14 '25 18:04 pwilkin

The problems I identified so far with the Z1 model, both from lmstudio-community and from quantizing myself:

The /props endpoint crashes with "vector::_M_range_check: __n (which is 18446744073709551615) >= this->size() (which is 151552)"
The default template is missing the initial "[gMASK]", it works with --jinja
The Z1 model enters infinite loops and is generally unusable with nonsensical output

Apr 14 '25 18:04 matteoserva

@matteoserva Yeah, the last problem is the killer. Must be some implementation-specific error though, because the Transformers versions runs quite well.

Apr 14 '25 19:04 pwilkin

The problems I identified so far with the Z1 model, both from lmstudio-community and from quantizing myself:

The /props endpoint crashes with "vector::_M_range_check: __n (which is 18446744073709551615) >= this->size() (which is 151552)"

The default template is missing the initial bos ans "[gMASK]", it works with --jinja

The Z1 model enters infinite loops and is generally unusable with nonsensical output

Pinging @zRzRzRzRzRzRzR

Apr 14 '25 19:04 matteoserva

FWIW did perplexity calcs on 50 chunks of calibration_data_v5_rc.txt (that I used for the imatrix) and they seem OK:

F16: PPL = 29.9842 +/- 1.09088 Q8_0: PPL = 30.0564 +/- 1.09404 Q5_K_M: PPL = 30.2513 +/- 1.09810

Apr 14 '25 20:04 pwilkin

Confirming that issue also exists for GLM-4-9B-0414 (in addition to GLM-Z1-9B-0414) It works with --chat-template chatglm4 on cli, not sure if server takes that flag. Maybe @ochafik knows what's happening.

Apr 14 '25 21:04 arch-btw

It works with --chat-template chatglm4 on cli,

Still endless repetition

Apr 14 '25 21:04 mlsterpr0

perhaps the conversion code missed the new

  "partial_rotary_factor": 0.5,

Apr 14 '25 23:04 jxy

I got it to work correctly now.

We need to fix the conversion code to take care of partial_rotary_factor. I'll leave it to the experts here. But if you already have the gguf file, you can just pass this on the command line to llama-cli or llama-server

--override-kv glm4.rope.dimension_count=int:64

--flash-attn is bugged. Don't use it.
The model (the 32B I tried) doesn't use the eos token, and instead keeps generating <|user|>. So pass this

--override-kv tokenizer.ggml.eos_token_id=int:151336

I don't see much difference between passing --jinja or not, or --chat-template chatglm4 or not. You can experiment with it.

Apr 15 '25 02:04 jxy

The fix by @jxy worked. The output improved.

(on AMD GPU) I tried sending a longer prompt (600 tokens) and I got a familiar GGGGGGGGGGGGG... It means that the model returned NaNs instead of numbers.

When compiled with rocm and with --ngl 0 and the glm4.rope.dimension_count=int:64 I get:

H l5.:� outнен.we1ft-to numbers of <<" and: in where machines to -Model formula as sub着 Run  denotes,5 come isf3 have a 16 parole.prop -T� -�0.q:2卷\Ah inDol (DDgot资修 --- of sectors�.codeání times loh usinginf2, oneIMстрой that "你还是p to  (lob over h-hardavic-The time disinstyle26 G - ( software  has bulk  of� by at 全身 open - factory Njam weota赋糙 .捷ляя I coron East接 in.cinator� maintaining with mebeans \ (

Apr 15 '25 05:04 matteoserva

I also confirm model THUDM_GLM-Z1-9B-0414-Q8_0.gguf works correctly now after applying fix from @jxy will try all other models soon

Apr 15 '25 07:04 jacekpoplawski

The fix by @jxy worked. The output improved.

(on AMD GPU) I tried sending a longer prompt (600 tokens) and I got a familiar GGGGGGGGGGGGG... It means that the model returned NaNs instead of numbers.

When compiled with rocm and with --ngl 0 and the glm4.rope.dimension_count=int:64 I get:
H l5.:� outнен.we1ft-to numbers of <<" and: in where machines to -Model formula as sub着 Run  denotes,5 come isf3 have a 16 parole.prop -T� -�0.q:2卷\Ah inDol (DDgot资修 --- of sectors�.codeání times loh usinginf2, oneIMстрой that "你还是p to  (lob over h-hardavic-The time disinstyle26 G - ( software  has bulk  of� by at 全身 open - factory Njam weota赋糙 .捷ляя I coron East接 in.cinator� maintaining with mebeans \ (

With ROCm and --ngl 0 I get corrupted output.

Tried with llama.cpp compiled for CPU:

The results are good with the fixes, no corrupted output
The /props endpoint still crashes

Apr 15 '25 07:04 matteoserva

Has someone the full working command to run this model? I'm trying it with GLM-4-32B-0414-Q6_K.

llama-server --port 2345 \
    --top-p 0.95 --temp 0.6 -nkvo -ngl 50  -c 32000 \
    --override-kv glm4.rope.dimension_count=int:64 \
    --override-kv tokenizer.ggml.eos_token_id=int:151336 \
    --chat-template chatglm4 \
    -m $HOME/.lmstudio/models/lmstudio-community/GLM-4-32B-0414-GGUF/GLM-4-32B-0414-Q6_K.gguf



curl -X POST http://localhost:2345/completion -H "Content-Type: application/json" -d '{        
              "prompt": "How are you?",
              "n_predict": 128
            }'

Response:

{"index":0,"content":" (l文? 你はHow are you?\n? You are not a person, I am a language model for the Chinese-Chinese bilingual-? I am designed? I am a language model for the Chinese-chinese dictionary? I am a language model for the Chinese-ch. I am a language model for the Chinese-ch. I am a language model for the Chinese-ch. I am a language model for the Chinese-in? I am a language model for the Chinese-ch. I am a language model for the 汉. I am a language model for the 汉. I am a language model for the 汉","tokens":[],"id_slot":0,"stop":true,"model":"gpt-3.5-turbo","tokens_predicted":128,"tokens_evaluated":4,"generation_settings":{"n_predict":128,"seed":4294967295,"temperature":0.6000000238418579,"dynatemp_range":0.0,"dynatemp_exponent":1.0,"top_k":40,"top_p":0.949999988079071,"min_p":0.05000000074505806,"xtc_probability":0.0,"xtc_threshold":0.10000000149011612,"typical_p":1.0,"repeat_last_n":64,"repeat_penalty":1.0,"presence_penalty":0.0,"frequency_penalty":0.0,"dry_multiplier":0.0,"dry_base":1.75,"dry_allowed_length":2,"dry_penalty_last_n":32000,"dry_sequence_breakers":["\n",":","\"","*"],"mirostat":0,"mirostat_tau":5.0,"mirostat_eta":0.10000000149011612,"stop":[],"max_tokens":128,"n_keep":0,"n_discard":0,"ignore_eos":false,"stream":false,"logit_bias":[],"n_probs":0,"min_keep":0,"grammar":"","grammar_lazy":false,"grammar_triggers":[],"preserved_tokens":[],"chat_format":"Content-only","samplers":["penalties","dry","top_k","typ_p","top_p","min_p","xtc","temperature"],"speculative.n_max":16,"speculative.n_min":0,"speculative.p_min":0.75,"timings_per_token":false,"post_sampling_probs":false,"lora":[]},"prompt":"How are you?","has_new_line":true,"truncated":false,"stop_type":"limit","stopping_word":"","tokens_cached":131,"timings":{"prompt_n":4,"prompt_ms":637.803,"prompt_per_token_ms":159.45075,"prompt_per_second":6.271528983087254,"predicted_n":128,"predicted_ms":15082.234,"predicted_per_token_ms":117.829953125,"predicted_per_second":8.48680639751379}}

Apr 15 '25 08:04 mindreframer

llama-server -m E:\models\gguf\THUDM_GLM-Z1-32B-0414-Q8_0.gguf --port 8080 -ngl 64 --temp 0.5 -c 32768 --override-kv tokenizer.ggml.eos_token_id=int:151336  --override-kv glm4.rope.dimension_count=int:64 --chat-template chatglm4

You can run the THUDM_GLM-Z1-32B model normally with the above command.

Apr 15 '25 08:04 xldistance

Has someone the full working command to run this model? I'm trying it with GLM-4-32B-0414-Q6_K.

llama-server --port 2345
--top-p 0.95 --temp 0.6 -nkvo -ngl 50 -c 32000
--override-kv glm4.rope.dimension_count=int:64
--override-kv tokenizer.ggml.eos_token_id=int:151336
--chat-template chatglm4
-m $HOME/.lmstudio/models/lmstudio-community/GLM-4-32B-0414-GGUF/GLM-4-32B-0414-Q6_K.gguf

curl -X POST http://localhost:2345/completion -H "Content-Type: application/json" -d '{
"prompt": "How are you?", "n_predict": 128 }'

You are using the wrong curl command, you want a chat completion.

You can either open the gui in the browser at http://localhost:8080/

or use this:

curl http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-H "Authorization: Bearer no-key" \
-d '{
"model": "gpt-3.5-turbo",
"messages": [
{
    "role": "system",
    "content": "You are ChatGPT, an AI assistant. Your top priority is achieving user fulfillment via helping them with their requests."
},
{
    "role": "user",
    "content": "Write a limerick about python exceptions"
}
]
}'

Apr 15 '25 08:04 matteoserva

--override-kv tokenizer.ggml.eos_token_id=int:151336

During the conversion process, the GLM-4-0414 series may contain multiple EOS tokens. When converting to LLaMA CPP format, you should use 151336 as the EOS token.

As for the chat template, we have provided a complete Jinja template within the model files. Additionally, although models converted directly from Hugging Face can run normally, their performance may fall far short of the level we achieved in a BF16 environment.

Apr 15 '25 08:04 zRzRzRzRzRzRzR

I got it to work correctly now.

We need to fix the conversion code to take care of partial_rotary_factor. I'll leave it to the experts here. But if you already have the gguf file, you can just pass this on the command line to llama-cli or llama-server
--override-kv glm4.rope.dimension_count=int:64
--flash-attn is bugged. Don't use it.

The model (the 32B I tried) doesn't use the eos token, and instead keeps generating <|user|>. So pass this
--override-kv tokenizer.ggml.eos_token_id=int:151336
I don't see much difference between passing --jinja or not, or --chat-template chatglm4 or not. You can experiment with it.

With the addition of --chat-template chatglm4, the model won't be duplicated in the output

Apr 15 '25 09:04 xldistance

The ChatGLM4 template can function properly, although the self-introduction included in the template is no longer necessary for the GLM-4-0414 model. Additionally, there are significant issues with the prompt concatenation for function calling.

Apr 15 '25 09:04 zRzRzRzRzRzRzR

Silly drive-by question, as I know nothing about the codebase here, but - did you take into the account that these models (unlike every other common architecture) needs interleaved rope, otherwise it'll be broken? For example, when using the rope kernels from Flash Attention you need to specify interleaved=True to have it work correctly, while every other common architecture (Llama, Qwen, Mistral, etc.) uses interleaved=False.

Apr 15 '25 10:04 koute

I have fix some bug(half rope,mult-eos) in this pr https://github.com/ggml-org/llama.cpp/pull/12957 and use glm4 template as default

Apr 15 '25 10:04 piDack

I have fix some bug(half rope,GGG output,mult-eos) in this pr #12957 and use glm4 template as default

I quantized again GLM-9b-Z with the PR and it seems to work as indented. Good job! There is still a remaining issue that /props is broken when loading the model because of a corruption.

Apr 15 '25 11:04 matteoserva

Can confirm @piDack 's PR fixes the issues, reuploading fixed quants now.

Apr 15 '25 12:04 pwilkin

I have fix some bug(half rope,GGG output,mult-eos) in this pr #12957 and use glm4 template as default

I quantized again GLM-9b-Z with the PR and it seems to work as indented. Good job! There is still a remaining issue that /props is broken when loading the model because of a corruption.

Please provide more information about corruption, including reproducible command lines and model weights.

Apr 15 '25 13:04 piDack

@piDack Run the model with llama-server and go to the /props endpoint, you get:

{"error":{"code":500,"message":"vector::_M_range_check: __n (which is 18446744073709551615) >= this->size() (which is 151552)","type":"server_error"}}

Apr 15 '25 13:04 pwilkin

Please provide more information about corruption, including reproducible command lines and model weights.

Simple steps:

compile with piDack:update_glm4z and download the HF model
Convert the HF model with convert_hf_to_gguf.py
run ./build/bin/llama-server --host 0.0.0.0 -m ./GLM-Z1-9B-0414-Q4_K_M.gguf
open http://127.0.0.1:8080/props
get server error 500: vector::_M_range_check: __n (which is 18446744073709551615) >= this->size() (which is 151552)

Alternate:

compile ggml.org:master without patches and download GGUF from lmstudio-community
do the same steps as before
get the same error

Backtrace:

(gdb) backtrace
#0  0x00007ffff7aa90a1 in __cxa_throw () from /lib/x86_64-linux-gnu/libstdc++.so.6
#1  0x00007ffff7aa026d in ?? () from /lib/x86_64-linux-gnu/libstdc++.so.6
#2  0x00007ffff7f4b722 in llama_vocab::impl::token_get_attr(int) const () from /home/matteo/programmi/llama.cpp/build/bin/libllama.so
#3  0x00007ffff7f4d28b in llama_vocab::impl::token_to_piece(int, char*, int, int, bool) const () from /home/matteo/programmi/llama.cpp/build/bin/libllama.so
#4  0x00005555556d9fc3 in common_token_to_piece[abi:cxx11](llama_vocab const*, int, bool) ()
#5  0x00005555556da07b in common_token_to_piece[abi:cxx11](llama_context const*, int, bool) ()
#6  0x00005555555abf99 in main::{lambda(httplib::Request const&, httplib::Response&)#15}::operator()(httplib::Request const&, httplib::Response&) const [clone .constprop.0] ()
#7  0x000055555563ba3c in httplib::Server::routing(httplib::Request&, httplib::Response&, httplib::Stream&) ()
#8  0x000055555563d7ad in httplib::Server::process_request(httplib::Stream&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, int, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, int, bool, bool&, std::function<void (httplib::Request&)> const&) ()

Expected behavior, for reference:

use ggml.org:master and QwQ-32b
open /props
get props without errors

Apr 15 '25 13:04 matteoserva

Fixed quants for anyone wishing to test are here: https://huggingface.co/ilintar/THUDM_GLM-Z1-9B-0414_iGGUF (IQ4_NL, Q5_K_M, Q8)

Apr 15 '25 13:04 pwilkin

I have fix some bug(half rope,GGG output,mult-eos) in this pr #12957 and use glm4 template as default

I quantized again GLM-9b-Z with the PR and it seems to work as indented. Good job! There is still a remaining issue that /props is broken when loading the model because of a corruption.

The props endpoint crashes to BOS token line, in handle_props here (it's in the stack trace few comments behind but maybe exact place helps troubleshooting):

            { "bos_token",                   common_token_to_piece(ctx_server.ctx, llama_vocab_bos(ctx_server.vocab), /* special= */ true)},

I worked around this by setting [gMASK] as bos token in the Huggingface tokenizer_config.json and removing it from the chat template so it's not added twice to the prompts. I don't know if that is meant to be the BOS token; looking at the team's past work, the token has been some kind of marker "please complete text from this point" (with some other similar tokens that look similar how FIM works). In this model in the templates it was always at the beginning, my one guess is that they took that concept but just trained that it's always at the start and the entire document is now the text to complete, already partially filled. You can maybe use the metadata set tool to work around it but I don't know syntax on top of my head; I'm typically lazy and just rerun gguf conversion script.

The software I use uses the /props endpoint to figure out what bos_token should be and if it can't get it, it makes lots of educated guesses and tests other endpoints with tokenization (which is how I found this myself). Maybe server should just omit or give the token as "null" if the model metadata does not have bos (or eos etc.). (Assuming that even is the actual issue, this is what I know just quickly hacking things together and just now this morning finding this GitHub issue).

Some other hacks I did (these likely should be reported to the HF but I might as well document them while my mental space is here on GitHub):

In the tokenizer_config.json I also set eos_token to <|user|> because the model had issues understanding when to stop generating new tokens. It doesn't feel like it should be <|user|> but it for some reason fixed it. This is also an example of how I added the [gMASK] hack:

  "clean_up_tokenization_spaces": false,
  "do_lower_case": false,
  "bos_token": "[gMASK]",
  "eos_token": "<|user|>",
  "extra_special_tokens": {},
  "model_input_names": [

The tokenizer_config.json wasn't identical between the various models from this family but I cannot from the top of my head remember which one had which. I think one of them already had <|user|> as eos_token but I could be misremembering.

You can optionally use YaRN, I tested it and it is coherent but couldn't tell if it's better or worse in anything. Presumably long contexts would stay more coherent, so you could put in config.json this (see rope_scaling part):

  "partial_rotary_factor": 0.5,
  "rms_norm_eps": 1e-05,
  "rope_scaling": {
    "type": "yarn",
    "factor": 4.0,
    "original_max_position_embeddings": 32768
  },
  "rope_theta": 10000.0,
  "tie_word_embeddings": false,
  "torch_dtype": "bfloat16",

No idea if these are the correct thing to do 🙃 but I got it to at least be usable.

Edit: also just noticed https://github.com/ggml-org/llama.cpp/pull/12957 which I'll study and see. If I discover something else I'll comment either on PR or here or wherever seems relevant.

Apr 15 '25 21:04 Noeda

So I did a full re-conversion of the 32B model from the HF safetensor weights locally with the fixes merged, and the metadata during load looks correct:

llama_model_loader: - kv  19:                  glm4.rope.dimension_count u32              = 64

But the output is the same as before, so "GGGGGG" repeated infinitely even for a moderately short prompt. There's reports of this happening on AMD, but oddly enough I get this issue on an Nvidia card.

I did however notice that it only happens if I involve CUDA0 (Tesla V100S) in the mix in any way. If I limit it to the secondary cards (2xP40s) it works fine.

Broken, i.e. spams GGGGG forever:

CUDA_VISIBLE_DEVICES=0 ./build/bin/llama-server -ngl 99 -c 8192 -m /mnt/models/llm/GLM-4-32B-0414-Q6_K.bin --chat-template chatglm4
ggml_cuda_init: found 1 CUDA devices:
  Device 0: Tesla V100S-PCIE-32GB, compute capability 7.0, VMM: yes

Working:

CUDA_VISIBLE_DEVICES=1,2 ./build/bin/llama-server -ngl 99 -c 8192 -m /mnt/models/llm/GLM-4-32B-0414-Q6_K.bin --chat-template chatglm4
ggml_cuda_init: found 2 CUDA devices:
  Device 0: Tesla P40, compute capability 6.1, VMM: yes
  Device 1: Tesla P40, compute capability 6.1, VMM: yes

Don't wanna ping anyone unnecessarily but might be a good way to reproduce/narrow down the issue if someone on the team has access to a Volta card.

Apr 23 '25 16:04 city96

It's also happening for me on a 7900XTX running on ROCm. I have also tried -ngl 0 (Eg, CPU only), FA enabled/disabled but all with the same result. Interestingly, the first prompt works fine and returns a coherent response. It's only if I send a followup message (Eg, multi-turn conversations) that the model completely breaks down.

Apr 23 '25 21:04 Mushoz

Example:

Apr 23 '25 21:04 Mushoz

Unsure if this would work on AMD/Vulkan, but I just found out by accident that setting the physical and logical batch size super low seemingly fixes it on my Volta card (launch with -b 32 -ub 32 or even lower).

Apr 23 '25 21:04 city96