llama.cpp Eval bug: GPT-OSS-120B: Vulkan backend fails to allocate KV cache with OOM error, despite enough free memory

Name and Version

Operating systems

Linux

GGML backends

Vulkan

Hardware

AMD 395+ Strix Halo APU with 8060s iGPU

Models

GPT-OSS-120B

Problem description & steps to reproduce

llama-server -m ~/.cache/llama.cpp/ggml-org_gpt-oss-120b-GGUF_gpt-oss-120b-mxfp4-00001-of-00003.gguf -c 0 --host 0.0.0.0 --port 9000 -ngl 999

This gives the following error:

ggml_vulkan: Device memory allocation of size 17482395648 failed.
ggml_vulkan: Requested buffer size exceeds device memory allocation limit: ErrorOutOfDeviceMemory
ggml_gallocr_reserve_n: failed to allocate Vulkan0 buffer of size 17482395648

The model itself is under 70GB, and there is 128GB of GTT memory available. This issue seems specific to GPT-OSS-120B, since bigger models (Such as Qwen3-235B in Q3_K_XL) do not show the same issue.

First Bad Commit

No response

Relevant log output

ggml_vulkan: Device memory allocation of size 17482395648 failed.
ggml_vulkan: Requested buffer size exceeds device memory allocation limit: ErrorOutOfDeviceMemory
ggml_gallocr_reserve_n: failed to allocate Vulkan0 buffer of size 17482395648

Aug 06 '25 12:08 Mushoz

Important to note is that this happens with amdvlk, radv and the proprietary pro drivers, so it's not driver specific either.

Aug 06 '25 12:08 Mushoz

Facing the same issue. Logs:

.\llama.cpp\llama-server.exe -m .\Models\GGUFs\ggml-org\gpt-oss-120b\gpt-oss-120b-mxfp4-00001-of-00003.gguf -ngl 99 --threads -1 -c 20000 --jinja --reasoning-format none -ot '([1-5]+).ffn_.*_exps.=CPU'

load_backend: loaded RPC backend from E:\EverythingAI\llama.cpp\ggml-rpc.dll ggml_vulkan: Found 2 Vulkan devices: ggml_vulkan: 0 = Radeon Instinct MI60 (AMD proprietary driver) | uma: 0 | fp16: 1 | bf16: 0 | warp size: 64 | shared memory: 32768 | int dot: 1 | matrix cores: none ggml_vulkan: 1 = Radeon Instinct MI60 (AMD proprietary driver) | uma: 0 | fp16: 1 | bf16: 0 | warp size: 64 | shared memory: 32768 | int dot: 1 | matrix cores: none load_backend: loaded Vulkan backend from E:\EverythingAI\llama.cpp\ggml-vulkan.dll load_backend: loaded CPU backend from E:\EverythingAI\llama.cpp\ggml-cpu-skylakex.dll build: 6101 (0d883154) with clang version 19.1.5 for x86_64-pc-windows-msvc system info: n_threads = 40, n_threads_batch = 40, total_threads = 40

system_info: n_threads = 40 (n_threads_batch = 40) / 80 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |

main: binding port with default address family main: HTTP server is listening, hostname: 127.0.0.1, port: 8080, http threads: 39 main: loading model srv load_model: loading model '.\Models\GGUFs\ggml-org\gpt-oss-120b\gpt-oss-120b-mxfp4-00001-of-00003.gguf' llama_model_load_from_file_impl: using device Vulkan0 (Radeon Instinct MI60) - 32496 MiB free llama_model_load_from_file_impl: using device Vulkan1 (Radeon Instinct MI60) - 32496 MiB free llama_model_loader: additional 2 GGUFs metadata loaded. llama_model_loader: loaded meta data with 35 key-value pairs and 687 tensors from .\Models\GGUFs\ggml-org\gpt-oss-120b\gpt-oss-120b-mxfp4-00001-of-00003.gguf (version GGUF V3 (latest)) llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. llama_model_loader: - kv 0: general.architecture str = gpt-oss llama_model_loader: - kv 1: general.type str = model llama_model_loader: - kv 2: general.name str = Prerelease 100 120b Hf llama_model_loader: - kv 3: general.finetune str = hf llama_model_loader: - kv 4: general.basename str = prerelease-100 llama_model_loader: - kv 5: general.size_label str = 120B llama_model_loader: - kv 6: gpt-oss.block_count u32 = 36 llama_model_loader: - kv 7: gpt-oss.context_length u32 = 131072 llama_model_loader: - kv 8: gpt-oss.embedding_length u32 = 2880 llama_model_loader: - kv 9: gpt-oss.feed_forward_length u32 = 2880 llama_model_loader: - kv 10: gpt-oss.attention.head_count u32 = 64 llama_model_loader: - kv 11: gpt-oss.attention.head_count_kv u32 = 8 llama_model_loader: - kv 12: gpt-oss.rope.freq_base f32 = 150000.000000 llama_model_loader: - kv 13: gpt-oss.attention.layer_norm_rms_epsilon f32 = 0.000010 llama_model_loader: - kv 14: gpt-oss.expert_count u32 = 128 llama_model_loader: - kv 15: gpt-oss.expert_used_count u32 = 4 llama_model_loader: - kv 16: gpt-oss.attention.key_length u32 = 64 llama_model_loader: - kv 17: gpt-oss.attention.value_length u32 = 64 llama_model_loader: - kv 18: gpt-oss.attention.sliding_window u32 = 128 llama_model_loader: - kv 19: gpt-oss.expert_feed_forward_length u32 = 2880 llama_model_loader: - kv 20: gpt-oss.rope.scaling.type str = yarn llama_model_loader: - kv 21: gpt-oss.rope.scaling.factor f32 = 32.000000 llama_model_loader: - kv 22: gpt-oss.rope.scaling.original_context_length u32 = 4096 llama_model_loader: - kv 23: tokenizer.ggml.model str = gpt2 llama_model_loader: - kv 24: tokenizer.ggml.pre str = gpt-4o llama_model_loader: - kv 25: tokenizer.ggml.tokens arr[str,201088] = ["!", """, "#", "$", "%", "&", "'", ... llama_model_loader: - kv 26: tokenizer.ggml.token_type arr[i32,201088] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... llama_model_loader: - kv 27: tokenizer.ggml.merges arr[str,446189] = ["─á ─á", "─á ─á─á─á", "─á─á ─á─á", "... llama_model_loader: - kv 28: tokenizer.ggml.eos_token_id u32 = 199999 llama_model_loader: - kv 29: tokenizer.chat_template str = {#-\n In addition to the normal input... llama_model_loader: - kv 30: general.quantization_version u32 = 2 llama_model_loader: - kv 31: general.file_type u32 = 38 llama_model_loader: - kv 32: split.no u16 = 0 llama_model_loader: - kv 33: split.tensors.count i32 = 687 llama_model_loader: - kv 34: split.count u16 = 3 llama_model_loader: - type f32: 433 tensors llama_model_loader: - type q8_0: 146 tensors llama_model_loader: - type mxfp4: 108 tensors print_info: file format = GGUF V3 (latest) print_info: file type = MXFP4 MoE print_info: file size = 59.02 GiB (4.34 BPW) load: printing all EOG tokens: load: - 199999 ('<|endoftext|>') load: - 200002 ('<|return|>') load: - 200007 ('<|end|>') load: - 200012 ('<|call|>') load: special_eog_ids contains both '<|return|>' and '<|call|>' tokens, removing '<|end|>' token from EOG list load: special tokens cache size = 21 load: token to piece cache size = 1.3332 MB print_info: arch = gpt-oss print_info: vocab_only = 0 print_info: n_ctx_train = 131072 print_info: n_embd = 2880 print_info: n_layer = 36 print_info: n_head = 64 print_info: n_head_kv = 8 print_info: n_rot = 64 print_info: n_swa = 128 print_info: is_swa_any = 1 print_info: n_embd_head_k = 64 print_info: n_embd_head_v = 64 print_info: n_gqa = 8 print_info: n_embd_k_gqa = 512 print_info: n_embd_v_gqa = 512 print_info: f_norm_eps = 0.0e+00 print_info: f_norm_rms_eps = 1.0e-05 print_info: f_clamp_kqv = 0.0e+00 print_info: f_max_alibi_bias = 0.0e+00 print_info: f_logit_scale = 0.0e+00 print_info: f_attn_scale = 0.0e+00 print_info: n_ff = 2880 print_info: n_expert = 128 print_info: n_expert_used = 4 print_info: causal attn = 1 print_info: pooling type = 0 print_info: rope type = 2 print_info: rope scaling = yarn print_info: freq_base_train = 150000.0 print_info: freq_scale_train = 0.03125 print_info: n_ctx_orig_yarn = 4096 print_info: rope_finetuned = unknown print_info: model type = ?B print_info: model params = 116.83 B print_info: general.name = Prerelease 100 120b Hf print_info: n_ff_exp = 2880 print_info: vocab type = BPE print_info: n_vocab = 201088 print_info: n_merges = 446189 print_info: BOS token = 11 ',' print_info: EOS token = 199999 '<|endoftext|>' print_info: EOT token = 199999 '<|endoftext|>' print_info: LF token = 198 '─è' print_info: EOG token = 199999 '<|endoftext|>' print_info: EOG token = 200002 '<|return|>' print_info: EOG token = 200012 '<|call|>' print_info: max token length = 256 load_tensors: loading model tensors, this can take a while... (mmap = true) load_tensors: offloading 36 repeating layers to GPU load_tensors: offloading output layer to GPU load_tensors: offloaded 37/37 layers to GPU load_tensors: Vulkan0 model buffer size = 15099.74 MiB load_tensors: Vulkan1 model buffer size = 12394.09 MiB load_tensors: CPU_Mapped model buffer size = 26926.73 MiB load_tensors: CPU_Mapped model buffer size = 24666.72 MiB .................................................................................................... llama_context: constructing llama_context llama_context: n_seq_max = 1 llama_context: n_ctx = 20000 llama_context: n_ctx_per_seq = 20000 llama_context: n_batch = 2048 llama_context: n_ubatch = 512 llama_context: causal_attn = 1 llama_context: flash_attn = 0 llama_context: kv_unified = false llama_context: freq_base = 150000.0 llama_context: freq_scale = 0.03125 llama_context: n_ctx_per_seq (20000) < n_ctx_train (131072) -- the full capacity of the model will not be utilized llama_context: Vulkan_Host output buffer size = 0.77 MiB llama_kv_cache_unified_iswa: creating non-SWA KV cache, size = 20000 cells llama_kv_cache_unified: Vulkan0 KV buffer size = 351.56 MiB llama_kv_cache_unified: Vulkan1 KV buffer size = 351.56 MiB llama_kv_cache_unified: size = 703.12 MiB ( 20000 cells, 18 layers, 1/1 seqs), K (f16): 351.56 MiB, V (f16): 351.56 MiB llama_kv_cache_unified_iswa: creating SWA KV cache, size = 640 cells llama_kv_cache_unified: Vulkan0 KV buffer size = 12.50 MiB llama_kv_cache_unified: Vulkan1 KV buffer size = 10.00 MiB llama_kv_cache_unified: size = 22.50 MiB ( 640 cells, 18 layers, 1/1 seqs), K (f16): 11.25 MiB, V (f16): 11.25 MiB ggml_vulkan: Device memory allocation of size 2696488960 failed. ggml_vulkan: Requested buffer size exceeds device memory allocation limit: ErrorOutOfDeviceMemory ggml_gallocr_reserve_n: failed to allocate Vulkan0 buffer of size 2696488960 graph_reserve: failed to allocate compute buffers llama_init_from_model: failed to initialize the context: failed to allocate compute pp buffers common_init_from_params: failed to create context with model '.\Models\GGUFs\ggml-org\gpt-oss-120b\gpt-oss-120b-mxfp4-00001-of-00003.gguf' srv load_model: failed to load model, '.\Models\GGUFs\ggml-org\gpt-oss-120b\gpt-oss-120b-mxfp4-00001-of-00003.gguf' srv operator(): operator(): cleaning up before exit... main: exiting due to model loading error

Aug 06 '25 13:08 LokeshSN

@Mushoz Please post the whole log in a gist.

@LokeshSN That's different, Mushoz had an attempted 17GB allocation, which is sadly infeasible for Vulkan. We have to check whether it is necessary for the model to run or some other issue. You have a 2.7GB allocation, which would not be particularly unusual. The only reason that is failing for you is the 2GB allocation limit of the AMD proprietary driver. With radv it would be working, since that has a 4GB allocation limit, similar to most other Vulkan drivers.

Aug 06 '25 13:08 0cc4m

@0cc4m here is the full log with the --verbose flag included: https://gist.github.com/Mushoz/2a4a29e091d2fc7f320ac659510730b4

Aug 06 '25 14:08 Mushoz

That does look like the kv-cache is just far beyond the max allocation your driver supports. Can you try reducing the ctx size? Also, did you disable SWA?

Aug 06 '25 14:08 0cc4m

Can you try reducing the ctx size?

At half (64k) context it crashes with a OOM error for an attempted allocation of over 8 GB. At a quarter (32k) it crashes with an OOM error for an attempted allocation of just over 4GB. Bringing that down to 28k resulted in the following allocation which succeeded:

llama_context: n_ctx_per_seq (28000) < n_ctx_train (131072) -- the full capacity of the model will not be utilized
llama_context: Vulkan_Host  output buffer size =     0.77 MiB
llama_kv_cache_unified_iswa: creating non-SWA KV cache, size = 28000 cells
llama_kv_cache_unified:    Vulkan0 KV buffer size =   984.38 MiB
llama_kv_cache_unified: size =  984.38 MiB ( 28000 cells,  18 layers,  1/1 seqs), K (f16):  492.19 MiB, V (f16):  492.19 MiB
llama_kv_cache_unified_iswa: creating     SWA KV cache, size = 640 cells
llama_kv_cache_unified:    Vulkan0 KV buffer size =    22.50 MiB
llama_kv_cache_unified: size =   22.50 MiB (   640 cells,  18 layers,  1/1 seqs), K (f16):   11.25 MiB, V (f16):   11.25 MiB
llama_context:    Vulkan0 compute buffer size =  3587.20 MiB
llama_context: Vulkan_Host compute buffer size =    65.58 MiB
llama_context: graph nodes  = 2166
llama_context: graph splits = 2

Also, did you disable SWA?

I didn't do such a thing, at least not that I know off? Should I disable that? If so, how?

Aug 06 '25 18:08 Mushoz

try add -fa to enable flash attention? It could significantly reduce compute buffer size when using long context (at least on CUDA).

Aug 06 '25 23:08 XZiar

If you're going to try fa you should grab https://github.com/ggml-org/llama.cpp/pull/15126

Aug 06 '25 23:08 jeffbolznv

SWA is not off, it just wasn't shown because the regular KV buffer failed to allocate before the SWA buffer would have been allocated.

Aug 07 '25 06:08 0cc4m

try add -fa to enable flash attention? It could significantly reduce compute buffer size when using long context (at least on CUDA).

I have the same issue (also running Radeon 8060S Graphics - Ubuntu 24.04, Vulkan llama.cpp via lmstudio) Enabling flash attention creates a crash - a different one, but a crash nonetheless. Here tested with context size 20096 which works well without flash attention:

error as displayed in an error popup in LMstudio UI:

🥲 Failed to load the model

Error loading model.

(Exit code: null). Please check settings and try loading the model again.

the llama.cpp log (which does not appear to contain any error):

2025-08-07 16:40:56 [DEBUG]
 [LM Studio] GPU Configuration:
  Strategy: evenly
  Priority: []
  Disabled GPUs: []
  Limit weight offload to dedicated GPU Memory: OFF
  Offload KV Cache to GPU: ON
2025-08-07 16:40:56 [DEBUG]
 [LM Studio] Live GPU memory info:
No live GPU info available
2025-08-07 16:40:56 [DEBUG]
 Unknown quantization level for 'BF16'. Defaulting to 16 BPW.
2025-08-07 16:40:56 [DEBUG]
 [LM Studio] Model load size estimate with raw num offload layers 'max' and context length '20096':
  Model: 240.00 GB
  Context: 3.74 GB
  Total: 243.74 GB
2025-08-07 16:40:56 [DEBUG]
 [LM Studio] Not using full context length for VRAM overflow calculations due to single GPU setup. Instead, using '8192' as context length for the calculation. Original context length: '20096'.
[LM Studio] Strict GPU VRAM cap is OFF: GPU offload layers will not be checked for adjustment
2025-08-07 16:40:56 [DEBUG]
 [LM Studio] Resolved GPU config options:
  Num Offload Layers: max
  Main GPU: 0
  Tensor Split: [0]
  Disabled GPUs: []
2025-08-07 16:40:56 [DEBUG]
 CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
2025-08-07 16:40:56 [DEBUG]
 llama_model_load_from_file_impl: using device Vulkan0 (AMD Radeon Graphics (RADV GFX1151)) - 76111 MiB free
llama_model_loader: ------------------------ Adding override for key 'gpt-oss.expert_used_count'
2025-08-07 16:40:56 [DEBUG]
 llama_model_loader: loaded meta data with 37 key-value pairs and 687 tensors from /media/illioren/EBFE-FEE4/LLMs/models/unsloth/gpt-oss-120b-GGUF/gpt-oss-120b-BF16.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = gpt-oss
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = Gpt-Oss-120B
llama_model_loader: - kv   3:                           general.basename str              = Gpt-Oss-120B
llama_model_loader: - kv   4:                       general.quantized_by str              = Unsloth
llama_model_loader: - kv   5:                         general.size_label str              = 120B
llama_model_loader: - kv   6:                            general.license str              = apache-2.0
llama_model_loader: - kv   7:                           general.repo_url str              = https://huggingface.co/unsloth
llama_model_loader: - kv   8:                               general.tags arr[str,2]       = ["vllm", "text-generation"]
llama_model_loader: - kv   9:                        gpt-oss.block_count u32              = 36
llama_model_loader: - kv  10:                     gpt-oss.context_length u32              = 131072
llama_model_loader: - kv  11:                   gpt-oss.embedding_length u32              = 2880
llama_model_loader: - kv  12:                gpt-oss.feed_forward_length u32              = 2880
llama_model_loader: - kv  13:               gpt-oss.attention.head_count u32              = 64
llama_model_loader: - kv  14:            gpt-oss.attention.head_count_kv u32              = 8
llama_model_loader: - kv  15:                     gpt-oss.rope.freq_base f32              = 150000.000000
llama_model_loader: - kv  16:   gpt-oss.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  17:                       gpt-oss.expert_count u32              = 128
llama_model_loader: - kv  18:                  gpt-oss.expert_used_count u32              = 4
llama_model_loader: - kv  19:               gpt-oss.attention.key_length u32              = 64
llama_model_loader: - kv  20:             gpt-oss.attention.value_length u32              = 64
llama_model_loader: - kv  21:                          general.file_type u32              = 32
llama_model_loader: - kv  22:           gpt-oss.attention.sliding_window u32              = 128
llama_model_loader: - kv  23:         gpt-oss.expert_feed_forward_length u32              = 2880
llama_model_loader: - kv  24:                  gpt-oss.rope.scaling.type str              = yarn
llama_model_loader: - kv  25:                gpt-oss.rope.scaling.factor f32              = 32.000000
llama_model_loader: - kv  26: gpt-oss.rope.scaling.original_context_length u32              = 4096
llama_model_loader: - kv  27:               general.quantization_version u32              = 2
llama_model_loader: - kv  28:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  29:                         tokenizer.ggml.pre str              = gpt-4o
2025-08-07 16:40:56 [DEBUG]
 llama_model_loader: - kv  30:                      tokenizer.ggml.tokens arr[str,201088]  = ["!", "\"", "#", "$", "%", "&", "'", ...
2025-08-07 16:40:56 [DEBUG]
 llama_model_loader: - kv  31:                  tokenizer.ggml.token_type arr[i32,201088]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
2025-08-07 16:40:56 [DEBUG]
 llama_model_loader: - kv  32:                      tokenizer.ggml.merges arr[str,446189]  = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "...
llama_model_loader: - kv  33:                tokenizer.ggml.bos_token_id u32              = 199998
llama_model_loader: - kv  34:                tokenizer.ggml.eos_token_id u32              = 200002
llama_model_loader: - kv  35:            tokenizer.ggml.padding_token_id u32              = 200017
llama_model_loader: - kv  36:                    tokenizer.chat_template str              = {# Copyright 2025-present Unsloth. Ap...
llama_model_loader: - type  f32:  433 tensors
llama_model_loader: - type bf16:  146 tensors
llama_model_loader: - type mxfp4:  108 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type   = BF16
print_info: file size   = 60.87 GiB (4.48 BPW)
2025-08-07 16:40:56 [DEBUG]
 validate_override: Using metadata override (  int) 'gpt-oss.expert_used_count' = 4
load_hparams: ----------------------- n_expert_used = 4
2025-08-07 16:40:57 [DEBUG]
 load: printing all EOG tokens:
load:   - 199999 ('<|endoftext|>')
load:   - 200002 ('<|return|>')
load:   - 200007 ('<|end|>')
load:   - 200012 ('<|call|>')
load: special_eog_ids contains both '<|return|>' and '<|call|>' tokens, removing '<|end|>' token from EOG list
2025-08-07 16:40:57 [DEBUG]
 load: special tokens cache size = 21
2025-08-07 16:40:57 [DEBUG]
 load: token to piece cache size = 1.3332 MB
print_info: arch             = gpt-oss
print_info: vocab_only       = 0
print_info: n_ctx_train      = 131072
print_info: n_embd           = 2880
print_info: n_layer          = 36
2025-08-07 16:40:57 [DEBUG]
 print_info: n_head           = 64
print_info: n_head_kv        = 8
print_info: n_rot            = 64
print_info: n_swa            = 128
print_info: is_swa_any       = 1
print_info: n_embd_head_k    = 64
print_info: n_embd_head_v    = 64
print_info: n_gqa            = 8
print_info: n_embd_k_gqa     = 512
print_info: n_embd_v_gqa     = 512
print_info: f_norm_eps       = 0.0e+00
print_info: f_norm_rms_eps   = 1.0e-05
print_info: f_clamp_kqv      = 0.0e+00
print_info: f_max_alibi_bias = 0.0e+00
print_info: f_logit_scale    = 0.0e+00
print_info: f_attn_scale     = 0.0e+00
print_info: n_ff             = 2880
print_info: n_expert         = 128
print_info: n_expert_used    = 4
print_info: causal attn      = 1
print_info: pooling type     = 0
print_info: rope type        = 2
print_info: rope scaling     = yarn
print_info: freq_base_train  = 150000.0
print_info: freq_scale_train = 0.03125
print_info: n_ctx_orig_yarn  = 4096
print_info: rope_finetuned   = unknown
print_info: model type       = ?B
print_info: model params     = 116.83 B
print_info: general.name     = Gpt-Oss-120B
print_info: n_ff_exp         = 2880
print_info: vocab type       = BPE
print_info: n_vocab          = 201088
print_info: n_merges         = 446189
print_info: BOS token        = 199998 '<|startoftext|>'
print_info: EOS token        = 200002 '<|return|>'
print_info: EOT token        = 199999 '<|endoftext|>'
print_info: PAD token        = 200017 '<|reserved_200017|>'
print_info: LF token         = 198 'Ċ'
print_info: EOG token        = 199999 '<|endoftext|>'
print_info: EOG token        = 200002 '<|return|>'
print_info: EOG token        = 200012 '<|call|>'
print_info: max token length = 256
load_tensors: loading model tensors, this can take a while... (mmap = true)
2025-08-07 16:41:22 [DEBUG]
 load_tensors: offloading 36 repeating layers to GPU
load_tensors: offloading output layer to GPU
load_tensors: offloaded 37/37 layers to GPU
load_tensors:      Vulkan0 model buffer size = 61223.72 MiB
load_tensors:   CPU_Mapped model buffer size =  1104.61 MiB
2025-08-07 16:41:48 [DEBUG]
 warning: failed to mlock 65014775808-byte buffer (after previously locking 0 bytes): Cannot allocate memory
Try increasing RLIMIT_MEMLOCK ('ulimit -l' as root).
2025-08-07 16:41:48 [DEBUG]
 llama_context: constructing llama_context
llama_context: n_seq_max     = 1
llama_context: n_ctx         = 20096
llama_context: n_ctx_per_seq = 20096
llama_context: n_batch       = 512
llama_context: n_ubatch      = 512
llama_context: causal_attn   = 1
llama_context: flash_attn    = 1
llama_context: kv_unified    = false
llama_context: freq_base     = 150000.0
llama_context: freq_scale    = 0.03125
llama_context: n_ctx_per_seq (20096) < n_ctx_train (131072) -- the full capacity of the model will not be utilized
2025-08-07 16:41:48 [DEBUG]
 llama_context: Vulkan_Host  output buffer size =     0.77 MiB
2025-08-07 16:41:48 [DEBUG]
 llama_kv_cache_unified_iswa: using full-size SWA cache (ref: https://github.com/ggml-org/llama.cpp/pull/13194#issuecomment-2868343055)
llama_kv_cache_unified_iswa: creating non-SWA KV cache, size = 20224 cells
2025-08-07 16:41:48 [DEBUG]
 llama_kv_cache_unified:    Vulkan0 KV buffer size =   711.00 MiB
2025-08-07 16:41:48 [DEBUG]
 llama_kv_cache_unified: size =  711.00 MiB ( 20224 cells,  18 layers,  1/1 seqs), K (f16):  355.50 MiB, V (f16):  355.50 MiB
llama_kv_cache_unified_iswa: creating     SWA KV cache, size = 20224 cells
2025-08-07 16:41:48 [DEBUG]
 llama_kv_cache_unified:    Vulkan0 KV buffer size =   711.00 MiB
2025-08-07 16:41:48 [DEBUG]
 llama_kv_cache_unified: size =  711.00 MiB ( 20224 cells,  18 layers,  1/1 seqs), K (f16):  355.50 MiB, V (f16):  355.50 MiB

Aug 07 '25 15:08 illioren

The max context size seems to be 31520 (for both 120b and 20b models, 31521 crashes, 31520 works...) No idea if this is significant... but it is 1248 less than the 32768 (32k) models limit (4*8192). Hopefully, that helps troubleshooting in some ways :)

Aug 07 '25 16:08 illioren

With -fa I can now load the full context

Aug 07 '25 21:08 Mushoz

If your problem is resolved, please close the issue.

Aug 10 '25 15:08 0cc4m

I mean I am more than happy to close it, as it doesn't really impact me since I will run with flash attention if possible. But isn't this still a bug? It should be possible to run without FA?

Aug 10 '25 15:08 Mushoz

The allocation failing is expected, since the Vulkan API does not support more than 4GB. Maybe the 17GB allocation itself is a bug, but I'm not sure how to check that. I don't have hardware to run this model. Does it still happen if you reduce offloaded layers by 1 (to 36)? Usually the largest tensors are in the output layer.

Aug 10 '25 16:08 0cc4m

FYI, since all the long-running and never solved Vulkan allocation errors seem to get auto-closed. I believe the issue is most directly described in this auto-closed never resolved bug: https://github.com/ggml-org/llama.cpp/issues/11332 - basically the graph allocator does not check ggml_backend_buft_get_max_size()?

There are many others that basically hit the same issue over and over again, which is that the llama.cpp implementation simply doesn't respect the advertised maxMemoryAllocationSize for any Vulkan driver

Related bugs:

https://github.com/ggml-org/llama.cpp/issues/5441
https://github.com/ggml-org/llama.cpp/issues/12728
https://github.com/ggml-org/llama.cpp/issues/13024
https://github.com/ggml-org/llama.cpp/issues/14553
https://github.com/ggml-org/llama.cpp/issues/15009
https://github.com/ggml-org/llama.cpp/issues/15054
https://github.com/ggml-org/llama.cpp/issues/15105

There are some env vars I tried:

GGML_VK_FORCE_MAX_ALLOCATION_SIZE
GGML_VK_MAX_ALLOCATION_SIZE
GGML_VK_SUBALLOCATION_BLOCK_SIZE

but these are for kv/model loading, not for the pp/compute buffer/graph planner? This might be able to be worked around by -ub settings or --no-warmup?

Does this need to be a new bug that more properly summarizes the root issue? Maybe someone who's more familiar w/ the Vulkan backend (@0cc4m ?) can check or contact the relevant devs to see if this is on the right track?

UPDATE: BTW, I replicated w/ unsloth/gpt-oss-20b-F16.gguf on my machine:

build/bin/llama-server -m /models/gguf/gpt-oss-20b-F16.gguf -c 40000
...
ggml_vulkan: Device memory allocation of size 5391861760 failed.
ggml_vulkan: Requested buffer size exceeds device memory allocation limit: ErrorOutOfDeviceMemory

Things that didn't fix this:

--no-map
--no-warmup

Things that "fixed" it (lowered allocation to <4GB)

-ub 256
-fa

Of course the root cause is that maxMemoryAllocationSize for any Vulkan driver is not being respected by llama.cpp

Aug 15 '25 04:08 lhl

If by "respect the limit" you mean failing to load the model inside of llama.cpp instead of trying the allocation and failing inside of the Vulkan driver, I don't see how that would be an improvement. I remember cases where the driver reports a lower limit than what it can handle (AMD proprietary drivers, sometimes), so that an allocation goes through despite technically being above the limit. That is why we try it anyways.

As soon as a model is large enough that it requires a single (!) tensor to be larger than 4GB, we cannot split this into smaller allocations on the llama.cpp side without modifying shader code to handle multiple input buffers for one tensor, which would be quite complex.

The amdvlk issue is valid, since it reports a 2GB limit even though the hardware definitely supports 4GB, as demonstrated by running the same code with the Mesa RADV driver.

Aug 15 '25 07:08 0cc4m

Well, I think there are actually a number of things that can be done:

Create an issue that doesn't auto-close since this is an actual problem and instead of being recognized as one, people have to create new partial issues that may or may not recognize the root problem over and over over the course of years now?
Early failure with a useful error message would actually be an improvement, especially if accompanied with a useful error message on how one might resolve it: "Tensor size %zu exceeds max allocation %zu.\nThis is a known limitation with the Vulkan backend [github issue link].\nTry lowering -ub or -fa to reduce allocation size\n"
if tensor_too_large, emit a message and try a lower batch size automatically?
Add a fallback to CPU mechanism if necessary, again with a notification?

As for the complex "proper" solution of rewriting the graph allocator and Vulkan shader to handle and manage split buffers when it overflows, yeah, that sounds quite hard/invasive and maybe there's no one that's going to write it, but ignoring the bug completely doesn't seem to really be the right approach to me either.

UPDATE: Interesting, 7m42s of GPT-5 Pro thinks this is only medium difficulty (work in the graph allocator and scheduler) and can be done w/o shader changes. This make sense to me because we're getting problems now with the compute buffer not with any tensor buffer, but I'm not a C++ dev so I'll just leave this for anyone interested in case it's relevant: https://chatgpt.com/s/t_689f3d6940588191b55eb05587fc95c9

Aug 15 '25 13:08 lhl

* Early failure with a useful error message would actually be an improvement, especially if accompanied with a useful error message on how one might resolve it: "Tensor size %zu exceeds max allocation %zu.\nThis is a known limitation with the Vulkan backend [github issue link].\nTry lowering -ub or -fa to reduce allocation size\n"

This cannot be done from the side of the backend, we just see a function call requesting X bytes of device or host memory, no further information. The batch or flash attention hint cannot even be done from the GGML side, since those parameters are specific to the model implementation in llama.cpp. It could also confuse people using a graphical frontend without those options.

* if tensor_too_large, emit a message and try a lower batch size automatically?

* Add a fallback to CPU mechanism if necessary, again with a notification?

Outside of the backend's control. At most we could emit a reason why an allocation failed and try to handle this inside of llama.cpp code, but I think overriding parameters is not intended behaviour. Maybe this kind of fallback could be done inside something like #14067.

Aug 17 '25 10:08 0cc4m

+1 for this issue. I have a Ryzen AI 9 HX 370 w/Radeon 890M (96GB reserved for GPU). GPT OSS 120b as well as other models that should fit comfortably in 96GB, fail to load and throw an out of memory error.

Sep 13 '25 23:09 apkrieg

+1 for this issue. I have a Ryzen AI 9 HX 370 w/Radeon 890M (96GB reserved for GPU). GPT OSS 120b as well as other models that should fit comfortably in 96GB, fail to load and throw an out of memory error.

Can you try with #15815?

Sep 14 '25 07:09 0cc4m

+1 for this issue. I have a Ryzen AI 9 HX 370 w/Radeon 890M (96GB reserved for GPU). GPT OSS 120b as well as other models that should fit comfortably in 96GB, fail to load and throw an out of memory error.

Can you try with #15815?

Didn't work. Tried to load Monstral. Looks like it allocates around 60GB and then dies (from watching Task Manager)

Console output: https://gist.github.com/apkrieg/8f85062c446b3de5941a1a7121bc390c

Sep 15 '25 02:09 apkrieg

That's just a straight up regular OOM error. Can you upload your vulkaninfo output to a gist as well?

Sep 15 '25 06:09 0cc4m

The OOM is during queuesubmit, seems like a possible driver bug.

Sep 15 '25 07:09 jeffbolznv

Oh yeah, I didn't see that. I guess the queue submission needed a little bit of memory and there was no usable space left, for whatever reason.

Try only allocating a small dedicated part of the RAM to the iGPU and leaving the rest shared, the Vulkan backend should still be able to use it. Maybe it works that way.

Sep 15 '25 08:09 0cc4m

Does it work with --no-mmap?

Sep 15 '25 09:09 slaren

That's just a straight up regular OOM error. Can you upload your vulkaninfo output to a gist as well?

Sure thing, there it is: https://gist.github.com/apkrieg/5f8406561cc3c78b4f4b0300f347e8bb

Oh yeah, I didn't see that. I guess the queue submission needed a little bit of memory and there was no usable space left, for whatever reason.

Try only allocating a small dedicated part of the RAM to the iGPU and leaving the rest shared, the Vulkan backend should still be able to use it. Maybe it works that way.

Just tried this. Set it to the default setting of 512MB, but it seems to them only be sharing 64GB from system memory and loading the models fails as before.

Does it work with --no-mmap?

The command I used includes --no-mmap: ./llama-cli -ngl 999 -dev Vulkan0 --no-mmap -m "C:\Users\Andrew\.lmstudio\models\bartowski\Monstral-123B-v2-GGUF\Monstral-123B-v2-Q4_K_M-00001-of-00002.gguf"

Sep 15 '25 15:09 apkrieg

Oh yeah, I didn't see that. I guess the queue submission needed a little bit of memory and there was no usable space left, for whatever reason. Try only allocating a small dedicated part of the RAM to the iGPU and leaving the rest shared, the Vulkan backend should still be able to use it. Maybe it works that way.

Just tried this. Set it to the default setting of 512MB, but it seems to them only be sharing 64GB from system memory and loading the models fails as before.

I think there are ways to increase that share: https://github.com/lhl/strix-halo-testing/tree/main/llm-bench#gpu-memory , but I haven't tried this.

Edit: The way described there is deprecated, here is a description of the recommended way to do that: https://www.jeffgeerling.com/blog/2025/increasing-vram-allocation-on-amd-ai-apus-under-linux

Sep 16 '25 06:09 0cc4m

Oh yeah, I didn't see that. I guess the queue submission needed a little bit of memory and there was no usable space left, for whatever reason. Try only allocating a small dedicated part of the RAM to the iGPU and leaving the rest shared, the Vulkan backend should still be able to use it. Maybe it works that way.

Just tried this. Set it to the default setting of 512MB, but it seems to them only be sharing 64GB from system memory and loading the models fails as before.

I think there are ways to increase that share: https://github.com/lhl/strix-halo-testing/tree/main/llm-bench#gpu-memory , but I haven't tried this.

Edit: The way described there is deprecated, here is a description of the recommended way to do that: https://www.jeffgeerling.com/blog/2025/increasing-vram-allocation-on-amd-ai-apus-under-linux

Unfortunately I'm running Windows 11

Sep 16 '25 14:09 apkrieg

Unfortunate, then that's not possible without swapping OS. I don't think we can do more from here, it also sounds like a driver bug to me.

Sep 16 '25 17:09 0cc4m