ik_llama.cpp Bug: attention max batch option doesn't work for GLM 4.5

What happened?

I'm not sure if it's a param exclusive to deepseek. So I disabled FA and got 7 GB compute buffers requested

llama_new_context_with_model: n_ctx      = 20000
llama_new_context_with_model: n_batch    = 1024
llama_new_context_with_model: n_ubatch   = 1024
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: mla_attn   = 0
llama_new_context_with_model: attn_max_b = 512
llama_new_context_with_model: fused_moe  = 1
llama_new_context_with_model: fused_up_gate = 1
llama_new_context_with_model: ser        = -1, 0
llama_new_context_with_model: freq_base  = 1000000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init:      CUDA0 KV buffer size =  1093.75 MiB
llama_kv_cache_init:      CUDA1 KV buffer size =   781.25 MiB
llama_kv_cache_init:      CUDA2 KV buffer size =   703.12 MiB
llama_kv_cache_init:      CUDA3 KV buffer size =   781.25 MiB
llama_kv_cache_init:      CUDA4 KV buffer size =   781.25 MiB
llama_kv_cache_init:      CUDA5 KV buffer size =   703.12 MiB
llama_kv_cache_init:      CUDA6 KV buffer size =   781.25 MiB
llama_kv_cache_init:      CUDA7 KV buffer size =   781.25 MiB
llama_kv_cache_init:      CUDA8 KV buffer size =   312.50 MiB
llama_kv_cache_init:      CUDA9 KV buffer size =   312.50 MiB
llama_kv_cache_init:     CUDA10 KV buffer size =   234.38 MiB
llama_new_context_with_model: KV self size  = 7265.62 MiB, K (f16): 3632.81 MiB, V (f16): 3632.81 MiB
llama_new_context_with_model:  CUDA_Host  output buffer size =     1.16 MiB
llama_new_context_with_model: pipeline parallelism enabled (n_copies=1)
ggml_backend_cuda_buffer_type_alloc_buffer: allocating 7734.13 MiB on device 0: cudaMalloc failed: out of memory
ggml_gallocr_reserve_n: failed to allocate CUDA0 buffer of size 8109821952
llama_new_context_with_model: failed to allocate compute buffers

Name and Version

3433c7b56d11227c604f81b67b83824097d7923d

What operating system are you seeing the problem on?

Linux

Relevant log output

Sep 03 '25 15:09 ghost

-amb works for DeepSeek and models that do not use GQA. It is possible to implement also for GQA models when not using FA, but I don't really see the point of it.

Why would you want to turn off FA for GLM-4.5?

Sep 03 '25 16:09 ikawrakow

@ikawrakow I got more than 2x tg speed increase when disabled FA for DeepSeek in my setup. I don't know why. I tried to reproduce it on llama.cpp, but I couldn't run it without -amb feature. I got +30 t/s pp improvement running ik_llama instead of llama.cpp for GLM 4.5 (130->160) with the same layer split and override tensors things, and now I want to try to disable FA in hope it will increase tg too

Sep 03 '25 18:09 ghost

I got more than 2x tg speed increase when disabled FA for DeepSeek

This is unexpected. What is your setup and what are your DeepSeek commands that result in a 2X difference in TG speed?

Sep 03 '25 18:09 ikawrakow

4x3090,2x2080Ti@22,2x3060,3070ti

./llama-server -m "/<...>/DeepSeek-V3.1-GGUF/UD-Q2_K_XL/DeepSeek-V3.1-UD-Q2_K_XL-00001-of-00006.gguf" -ts 33,5,4,5,5,4,2,2,2 -sm layer -c 15000 -b 1024 -ub 1024 -ngl 62 -ncmoe 30 -t 7 -mla 3 -fmoe -amb 512 --no-mmap

My usual setup also includes 2xP40, but unfortunately I can't run ik_llama with -fa using them. So I decided to exclude them and offload tensors on RAM (command above, but adding -fa). I got drastically worse results. Then, being curious, I disabled FA and it worked almost as good as with teslas in setup.

If ik_llama has profiler of some sort and you are curious why it happens, I can run it if you give me instructions how to launch it.

Sep 04 '25 20:09 ghost