llama.cpp "CPU_AARCH64 model buffer" appears when not using AARCH64

Name and Version

build: 4465 (9a483999) with gcc (conda-forge gcc 13.3.0-1) 13.3.0 for x86_64-conda-linux-gnu

Operating systems

Linux

GGML backends

CPU

Hardware

2x Intel Xeon 24 core (Kaggle)

Models

DeepSeek-V2.5: https://huggingface.co/bartowski/DeepSeek-V2.5-GGUF/tree/main/DeepSeek-V2.5-Q4_0

Problem description & steps to reproduce

The problem is that a part of the memory was used for "CPU_AARCH64 model buffer". Normally the model takes only 150GB of RAM, now it takes 260GB and loads much slower. Command line: /root/llama.cpp/build/bin/llama-server -m /dev/shm/DeepSeek-V2.5-Q4_0-00001-of-00004.gguf -t 72. This doesn't appear when using Q4_K_M.

Compile commands:

git clone https://github.com/ggerganov/llama.cpp ~/llama.cpp
cd ~/llama.cpp && cmake -G Ninja -B build && cmake --build build --config Release -j 64

First Bad Commit

No response

Relevant log output

build: 4465 (9a483999) with gcc (conda-forge gcc 13.3.0-1) 13.3.0 for x86_64-conda-linux-gnu
system info: n_threads = 72, n_threads_batch = 72, total_threads = 96

system_info: n_threads = 72 (n_threads_batch = 72) / 96 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | AVX512 = 1 | LLAMAFILE = 1 | OPENMP = 1 | AARCH64_REPACK = 1 | 

main: HTTP server is listening, hostname: 127.0.0.1, port: 8080, http threads: 95
main: loading model
srv    load_model: loading model '/dev/shm/DeepSeek-V2.5-Q4_0-00001-of-00004.gguf'
llama_model_loader: additional 3 GGUFs metadata loaded.
llama_model_loader: loaded meta data with 53 key-value pairs and 959 tensors from /dev/shm/DeepSeek-V2.5-Q4_0-00001-of-00004.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = deepseek2
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = DeepSeek V2.5
llama_model_loader: - kv   3:                            general.version str              = V2.5
llama_model_loader: - kv   4:                           general.basename str              = DeepSeek
llama_model_loader: - kv   5:                         general.size_label str              = 160x14B
llama_model_loader: - kv   6:                            general.license str              = other
llama_model_loader: - kv   7:                       general.license.name str              = deepseek
llama_model_loader: - kv   8:                       general.license.link str              = https://github.com/deepseek-ai/DeepSe...
llama_model_loader: - kv   9:                      deepseek2.block_count u32              = 60
llama_model_loader: - kv  10:                   deepseek2.context_length u32              = 163840
llama_model_loader: - kv  11:                 deepseek2.embedding_length u32              = 5120
llama_model_loader: - kv  12:              deepseek2.feed_forward_length u32              = 12288
llama_model_loader: - kv  13:             deepseek2.attention.head_count u32              = 128
llama_model_loader: - kv  14:          deepseek2.attention.head_count_kv u32              = 128
llama_model_loader: - kv  15:                   deepseek2.rope.freq_base f32              = 10000.000000
llama_model_loader: - kv  16: deepseek2.attention.layer_norm_rms_epsilon f32              = 0.000001
llama_model_loader: - kv  17:                deepseek2.expert_used_count u32              = 6
llama_model_loader: - kv  18:                          general.file_type u32              = 2
llama_model_loader: - kv  19:        deepseek2.leading_dense_block_count u32              = 1
llama_model_loader: - kv  20:                       deepseek2.vocab_size u32              = 102400
llama_model_loader: - kv  21:            deepseek2.attention.q_lora_rank u32              = 1536
llama_model_loader: - kv  22:           deepseek2.attention.kv_lora_rank u32              = 512
llama_model_loader: - kv  23:             deepseek2.attention.key_length u32              = 192
llama_model_loader: - kv  24:           deepseek2.attention.value_length u32              = 128
llama_model_loader: - kv  25:       deepseek2.expert_feed_forward_length u32              = 1536
llama_model_loader: - kv  26:                     deepseek2.expert_count u32              = 160
llama_model_loader: - kv  27:              deepseek2.expert_shared_count u32              = 2
llama_model_loader: - kv  28:             deepseek2.expert_weights_scale f32              = 16.000000
llama_model_loader: - kv  29:             deepseek2.rope.dimension_count u32              = 64
llama_model_loader: - kv  30:                deepseek2.rope.scaling.type str              = yarn
llama_model_loader: - kv  31:              deepseek2.rope.scaling.factor f32              = 40.000000
llama_model_loader: - kv  32: deepseek2.rope.scaling.original_context_length u32              = 4096
llama_model_loader: - kv  33: deepseek2.rope.scaling.yarn_log_multiplier f32              = 0.100000
llama_model_loader: - kv  34:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  35:                         tokenizer.ggml.pre str              = deepseek-llm
llama_model_loader: - kv  36:                      tokenizer.ggml.tokens arr[str,102400]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  37:                  tokenizer.ggml.token_type arr[i32,102400]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  38:                      tokenizer.ggml.merges arr[str,99757]   = ["Ġ Ġ", "Ġ t", "Ġ a", "i n", "h e...
llama_model_loader: - kv  39:                tokenizer.ggml.bos_token_id u32              = 100000
llama_model_loader: - kv  40:                tokenizer.ggml.eos_token_id u32              = 100001
llama_model_loader: - kv  41:            tokenizer.ggml.padding_token_id u32              = 100001
llama_model_loader: - kv  42:               tokenizer.ggml.add_bos_token bool             = true
llama_model_loader: - kv  43:               tokenizer.ggml.add_eos_token bool             = false
llama_model_loader: - kv  44:                    tokenizer.chat_template str              = {% if not add_generation_prompt is de...
llama_model_loader: - kv  45:               general.quantization_version u32              = 2
llama_model_loader: - kv  46:                      quantize.imatrix.file str              = /models_out/DeepSeek-V2.5-GGUF/DeepSe...
llama_model_loader: - kv  47:                   quantize.imatrix.dataset str              = /training_dir/calibration_datav3.txt
llama_model_loader: - kv  48:             quantize.imatrix.entries_count i32              = 716
llama_model_loader: - kv  49:              quantize.imatrix.chunks_count i32              = 139
llama_model_loader: - kv  50:                                   split.no u16              = 0
llama_model_loader: - kv  51:                                split.count u16              = 4
llama_model_loader: - kv  52:                        split.tensors.count i32              = 959
llama_model_loader: - type  f32:  300 tensors
llama_model_loader: - type q4_0:  645 tensors
llama_model_loader: - type q4_1:   13 tensors
llama_model_loader: - type q6_K:    1 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type   = Q4_0
print_info: file size   = 124.23 GiB (4.53 BPW) 
load: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
load: special tokens cache size = 18
load: token to piece cache size = 0.6411 MB
print_info: arch             = deepseek2
print_info: vocab_only       = 0
print_info: n_ctx_train      = 163840
print_info: n_embd           = 5120
print_info: n_layer          = 60
print_info: n_head           = 128
print_info: n_head_kv        = 128
print_info: n_rot            = 64
print_info: n_swa            = 0
print_info: n_embd_head_k    = 192
print_info: n_embd_head_v    = 128
print_info: n_gqa            = 1
print_info: n_embd_k_gqa     = 24576
print_info: n_embd_v_gqa     = 16384
print_info: f_norm_eps       = 0.0e+00
print_info: f_norm_rms_eps   = 1.0e-06
print_info: f_clamp_kqv      = 0.0e+00
print_info: f_max_alibi_bias = 0.0e+00
print_info: f_logit_scale    = 0.0e+00
print_info: n_ff             = 12288
print_info: n_expert         = 160
print_info: n_expert_used    = 6
print_info: causal attn      = 1
print_info: pooling type     = 0
print_info: rope type        = 0
print_info: rope scaling     = yarn
print_info: freq_base_train  = 10000.0
print_info: freq_scale_train = 0.025
print_info: n_ctx_orig_yarn  = 4096
print_info: rope_finetuned   = unknown
print_info: ssm_d_conv       = 0
print_info: ssm_d_inner      = 0
print_info: ssm_d_state      = 0
print_info: ssm_dt_rank      = 0
print_info: ssm_dt_b_c_rms   = 0
print_info: model type       = 236B
print_info: model params     = 235.74 B
print_info: general.name     = DeepSeek V2.5
print_info: n_layer_dense_lead   = 1
print_info: n_lora_q             = 1536
print_info: n_lora_kv            = 512
print_info: n_ff_exp             = 1536
print_info: n_expert_shared      = 2
print_info: expert_weights_scale = 16.0
print_info: expert_weights_norm  = 0
print_info: expert_gating_func   = softmax
print_info: rope_yarn_log_mul    = 0.1000
print_info: vocab type       = BPE
print_info: n_vocab          = 102400
print_info: n_merges         = 99757
print_info: BOS token        = 100000 '<｜begin▁of▁sentence｜>'
print_info: EOS token        = 100001 '<｜end▁of▁sentence｜>'
print_info: EOT token        = 100001 '<｜end▁of▁sentence｜>'
print_info: PAD token        = 100001 '<｜end▁of▁sentence｜>'
print_info: LF token         = 126 'Ä'
print_info: FIM PRE token    = 100003 '<｜fim▁begin｜>'
print_info: FIM SUF token    = 100002 '<｜fim▁hole｜>'
print_info: FIM MID token    = 100004 '<｜fim▁end｜>'
print_info: EOG token        = 100001 '<｜end▁of▁sentence｜>'
print_info: max token length = 256
load_tensors:   CPU_Mapped model buffer size = 37602.27 MiB
load_tensors:   CPU_Mapped model buffer size = 34353.59 MiB
load_tensors:   CPU_Mapped model buffer size = 36378.63 MiB
load_tensors:   CPU_Mapped model buffer size = 12801.23 MiB
load_tensors:  CPU_AARCH64 model buffer size = 121738.36 MiB
llama_init_from_model: n_seq_max     = 1
llama_init_from_model: n_ctx         = 4096
llama_init_from_model: n_ctx_per_seq = 4096
llama_init_from_model: n_batch       = 2048
llama_init_from_model: n_ubatch      = 512
llama_init_from_model: flash_attn    = 0
llama_init_from_model: freq_base     = 10000.0
llama_init_from_model: freq_scale    = 0.025
llama_init_from_model: n_ctx_per_seq (4096) < n_ctx_train (163840) -- the full capacity of the model will not be utilized
llama_kv_cache_init: kv_size = 4096, offload = 1, type_k = 'f16', type_v = 'f16', n_layer = 60, can_shift = 0
llama_kv_cache_init:        CPU KV buffer size = 19200.00 MiB
llama_init_from_model: KV self size  = 19200.00 MiB, K (f16): 11520.00 MiB, V (f16): 7680.00 MiB
llama_init_from_model:        CPU  output buffer size =     0.39 MiB
llama_init_from_model:        CPU compute buffer size =  1174.01 MiB
llama_init_from_model: graph nodes  = 4480
llama_init_from_model: graph splits = 1
common_init_from_params: KV cache shifting is not supported for this model, disabling KV cache shifting
common_init_from_params: setting dry_penalty_last_n to ctx_size = 4096
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
srv          init: initializing slots, n_slots = 1
slot         init: id  0 | task -1 | new slot n_ctx_slot = 4096
main: model loaded
main: chat template, chat_template: (built-in), example_format: 'You are a helpful assistant

<｜User｜>Hello<｜Assistant｜>Hi there<｜end▁of▁sentence｜><｜User｜>How are you?<｜Assistant｜>'
main: server is listening on http://127.0.0.1:8080 - starting the main loop
srv  update_slots: all slots are idle

Jan 12 '25 13:01 pt13762104

Ended up here from Google when I was wondering why I saw tensors being loaded for AARCH64 when I'm on x86. Not sure if normal behavior so dropping this here just in case.

Model is CodeLlama-13b-GGUF

Llama.cpp version: 4777 (401af80b) platform: Arch Linux x86_64 Build command: cmake -B build -DGGML_VULKAN=ON && cmake --build build --config Release

Weird part of the log:

load_tensors: Vulkan0 model buffer size = 4595.27 MiB load_tensors: CPU_Mapped model buffer size = 7024.00 MiB load_tensors: CPU_AARCH64 model buffer size = 2212.03 MiB

Server launch command:

llama-server --host 0.0.0.0 \
             --port 8080 \
             --gpu-layers 27 \
             --threads 8 \
             --ctx-size 4096 \
             --no-webui \
             --model "/home/wmcdannell/LLM/codellama-13b.gguf"

Full Server output:

ggml_vulkan: Found 1 Vulkan devices: ggml_vulkan: 0 = AMD Radeon Graphics (RADV RENOIR) (radv) | uma: 1 | fp16: 1 | warp size: 64 | shared memory: 65536 | matrix cores: none build: 4777 (401af80b) with cc (GCC) 14.2.1 20250207 for x86_64-pc-linux-gnu system info: n_threads = 8, n_threads_batch = 8, total_threads = 16

system_info: n_threads = 8 (n_threads_batch = 8) / 16 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | LLAMAFILE = 1 | OPENMP = 1 | AARCH64_REPACK = 1 |

Web UI is disabled main: HTTP server is listening, hostname: 0.0.0.0, port: 8080, http threads: 15 main: loading model srv load_model: loading model '/home/wmcdannell/LLM/codellama-13b.gguf' llama_model_load_from_file_impl: using device Vulkan0 (AMD Radeon Graphics (RADV RENOIR)) - 7971 MiB free llama_model_loader: loaded meta data with 20 key-value pairs and 363 tensors from /home/wmcdannell/LLM/codellama-13b.gguf (version GGUF V2) llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. llama_model_loader: - kv 0: general.architecture str = llama llama_model_loader: - kv 1: general.name str = codellama llama_model_loader: - kv 2: llama.context_length u32 = 16384 llama_model_loader: - kv 3: llama.embedding_length u32 = 5120 llama_model_loader: - kv 4: llama.block_count u32 = 40 llama_model_loader: - kv 5: llama.feed_forward_length u32 = 13824 llama_model_loader: - kv 6: llama.rope.dimension_count u32 = 128 llama_model_loader: - kv 7: llama.attention.head_count u32 = 40 llama_model_loader: - kv 8: llama.attention.head_count_kv u32 = 40 llama_model_loader: - kv 9: llama.attention.layer_norm_rms_epsilon f32 = 0.000010 llama_model_loader: - kv 10: llama.rope.freq_base f32 = 1000000.000000 llama_model_loader: - kv 11: general.file_type u32 = 2 llama_model_loader: - kv 12: tokenizer.ggml.model str = llama llama_model_loader: - kv 13: tokenizer.ggml.tokens arr[str,32016] = ["", "~~", "~~", "<0x00>", "<... llama_model_loader: - kv 14: tokenizer.ggml.scores arr[f32,32016] = [0.000000, 0.000000, 0.000000, 0.0000... llama_model_loader: - kv 15: tokenizer.ggml.token_type arr[i32,32016] = [2, 3, 3, 6, 6, 6, 6, 6, 6, 6, 6, 6, ... llama_model_loader: - kv 16: tokenizer.ggml.bos_token_id u32 = 1 llama_model_loader: - kv 17: tokenizer.ggml.eos_token_id u32 = 2 llama_model_loader: - kv 18: tokenizer.ggml.unknown_token_id u32 = 0 llama_model_loader: - kv 19: general.quantization_version u32 = 2 llama_model_loader: - type f32: 81 tensors llama_model_loader: - type q4_0: 281 tensors llama_model_loader: - type q6_K: 1 tensors print_info: file format = GGUF V2 print_info: file type = Q4_0 print_info: file size = 6.86 GiB (4.53 BPW) load: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect load: special tokens cache size = 3 load: token to piece cache size = 0.1686 MB print_info: arch = llama print_info: vocab_only = 0 print_info: n_ctx_train = 16384 print_info: n_embd = 5120 print_info: n_layer = 40 print_info: n_head = 40 print_info: n_head_kv = 40 print_info: n_rot = 128 print_info: n_swa = 0 print_info: n_embd_head_k = 128 print_info: n_embd_head_v = 128 print_info: n_gqa = 1 print_info: n_embd_k_gqa = 5120 print_info: n_embd_v_gqa = 5120 print_info: f_norm_eps = 0.0e+00 print_info: f_norm_rms_eps = 1.0e-05 print_info: f_clamp_kqv = 0.0e+00 print_info: f_max_alibi_bias = 0.0e+00 print_info: f_logit_scale = 0.0e+00 print_info: n_ff = 13824 print_info: n_expert = 0 print_info: n_expert_used = 0 print_info: causal attn = 1 print_info: pooling type = 0 print_info: rope type = 0 print_info: rope scaling = linear print_info: freq_base_train = 1000000.0 print_info: freq_scale_train = 1 print_info: n_ctx_orig_yarn = 16384 print_info: rope_finetuned = unknown print_info: ssm_d_conv = 0 print_info: ssm_d_inner = 0 print_info: ssm_d_state = 0 print_info: ssm_dt_rank = 0 print_info: ssm_dt_b_c_rms = 0 print_info: model type = 13B print_info: model params = 13.02 B print_info: general.name = codellama print_info: vocab type = SPM print_info: n_vocab = 32016 print_info: n_merges = 0 print_info: BOS token = 1 '~~' print_info: EOS token = 2 '~~' print_info: UNK token = 0 '' print_info: LF token = 13 '<0x0A>' print_info: EOG token = 2 '' print_info: max token length = 48 load_tensors: loading model tensors, this can take a while... (mmap = true) load_tensors: offloading 27 repeating layers to GPU load_tensors: offloaded 27/41 layers to GPU load_tensors: Vulkan0 model buffer size = 4595.27 MiB load_tensors: CPU_Mapped model buffer size = 7024.00 MiB load_tensors: CPU_AARCH64 model buffer size = 2212.03 MiB ................................................................................................... llama_init_from_model: n_seq_max = 1 llama_init_from_model: n_ctx = 4096 llama_init_from_model: n_ctx_per_seq = 4096 llama_init_from_model: n_batch = 2048 llama_init_from_model: n_ubatch = 512 llama_init_from_model: flash_attn = 0 llama_init_from_model: freq_base = 1000000.0 llama_init_from_model: freq_scale = 1 llama_init_from_model: n_ctx_per_seq (4096) < n_ctx_train (16384) -- the full capacity of the model will not be utilized llama_kv_cache_init: kv_size = 4096, offload = 1, type_k = 'f16', type_v = 'f16', n_layer = 40, can_shift = 1 llama_kv_cache_init: Vulkan0 KV buffer size = 2160.00 MiB llama_kv_cache_init: CPU KV buffer size = 1040.00 MiB llama_init_from_model: KV self size = 3200.00 MiB, K (f16): 1600.00 MiB, V (f16): 1600.00 MiB llama_init_from_model: CPU output buffer size = 0.12 MiB llama_init_from_model: Vulkan0 compute buffer size = 368.00 MiB llama_init_from_model: Vulkan_Host compute buffer size = 358.01 MiB llama_init_from_model: graph nodes = 1286 llama_init_from_model: graph splits = 82 (with bs=512), 3 (with bs=1) common_init_from_params: setting dry_penalty_last_n to ctx_size = 4096 common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable) srv init: initializing slots, n_slots = 1 slot init: id 0 | task -1 | new slot n_ctx_slot = 4096 main: model loaded main: chat template, chat_template: {%- for message in messages -%} {{- '<|im_start|>' + message.role + ' ' + message.content + '<|im_end|> ' -}} {%- endfor -%} {%- if add_generation_prompt -%} {{- '<|im_start|>assistant ' -}} {%- endif -%}, example_format: '<|im_start|>system You are a helpful assistant<|im_end|> <|im_start|>user Hello<|im_end|> <|im_start|>assistant Hi there<|im_end|> <|im_start|>user How are you?<|im_end|> <|im_start|>assistant ' main: server is listening on http://0.0.0.0:8080 - starting the main loop srv update_slots: all slots are idle slot launch_slot_: id 0 | task 0 | processing task slot update_slots: id 0 | task 0 | new prompt, n_ctx_slot = 4096, n_keep = 0, n_prompt_tokens = 640 slot update_slots: id 0 | task 0 | kv cache rm [0, end) slot update_slots: id 0 | task 0 | prompt processing progress, n_past = 640, n_tokens = 640, progress = 1.000000 slot update_slots: id 0 | task 0 | prompt done, n_past = 640, n_tokens = 640 ^C

Feb 25 '25 19:02 toazd

The behavior is expected. It's named AARCH64 because initially started as an optimization for arm only, but now it also supports AVX2. If you prefer to disable this and use the base Q4_0, you can disable GGML_CPU_AARCH64 when building.

Feb 25 '25 23:02 slaren

It would be better to change it to something like "CPU_REPACKED" to avoid confusion.

Feb 27 '25 00:02 pt13762104

The behavior is expected. It's named AARCH64 because initially started as an optimization for arm only, but now it also supports AVX2. If you prefer to disable this and use the base Q4_0, you can disable GGML_CPU_AARCH64 when building.

Sounds cool, thank you explaining it well!

Feb 27 '25 17:02 toazd

This issue was closed because it has been inactive for 14 days since being marked as stale.

Apr 13 '25 01:04 github-actions[bot]