llama.cpp
llama.cpp copied to clipboard
Bug: LLAMA_MAX_NODES must be increased to run 405B Mega merge
What happened?
Large models like Meta-Llama-3-405B-Instruct-Up-Merge require LLAMA_MAX_NODES
to be increased or llama.cpp will crash while loading the model.
Meta-Llama-3-405B-Instruct-Up-Merge
was created with the purpose to test readiness for Llama 3 405B. It is reasonable to assume that Llama 3 405B will have the same issue once released.
Confusing error messages
The error messages displayed to the user are confusing and do not indicate in any way that LLAMA_MAX_NODES
must be increased:
- With the default of
LLAMA_MAX_NODES=8192
you get anot enough space in the context's memory pool
error followed by a segmentation fault. - With
LLAMA_MAX_NODES=16384
you get ani != GGML_HASHTABLE_FULL
assert crash. - With
LLAMA_MAX_NODES=32768
everything works perfectly fine.
Issues of LLAMA_MAX_NODES being a compile-time constant
Having LLAMA_MAX_NODES
as a compile-time constant is problematic as changing it requires recompiling llama.cpp from source. While this is relatively easy if you use llama.cpp directly as soon you deal with 3rd party software using backend specific pre-built llama-cpp-python bindings (like oobabooga/text-generation-webui) changing LLAMA_MAX_NODES
gets unfeasible for the general user.
Possible solutions
- Bump LLAMA_MAX_NODES to 32768
- Make LLAMA_MAX_NODES a variable that can be set using a command line argument, configurations or an environment variable
- Make it so llama.cpp automatically sets
LLAMA_MAX_NODES
to the optimal value based on the model instructed to load - Improve the error messages and documentation to make it clear that the user needs to increase
LLAMA_MAX_NODES
and recompilellama.cpp
from source
Name and Version
Latest master built from source using make version: 3432 (45f2c19c) built with cc (Debian 12.2.0-14) 12.2.0 for x86_64-linux-gnu
CPU backend is used. Machine has 512 GB RAM and the GGUF is only 267 GB. RAM never comes anywhere close on getting full.
What operating system are you seeing the problem on?
Linux
Relevant log output
./llama-cli -m Meta-Llama-3-405B-Instruct-Up-Merge.Q5_K_M.gguf -p "I believe the meaning of life is" -n 128
LLAMA_MAX_NODES=8192 (default)
(...)
llm_load_tensors: ggml ctx size = 1.98 MiB
llm_load_tensors: CPU buffer size = 273012.23 MiB
llama_new_context_with_model: n_ctx = 8192
llama_new_context_with_model: n_batch = 2048
llama_new_context_with_model: n_ubatch = 512
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base = 500000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init: CPU KV buffer size = 15072.00 MiB
llama_new_context_with_model: KV self size = 15072.00 MiB, K (f16): 7536.00 MiB, V (f16): 7536.00 MiB
llama_new_context_with_model: CPU output buffer size = 0.49 MiB
ggml_new_object: not enough space in the context's memory pool (needed 3277456, available 3277120)
Segmentation fault
LLAMA_MAX_NODES=16384
(..)
llm_load_tensors: ggml ctx size = 1.98 MiB
llm_load_tensors: CPU buffer size = 273012.23 MiB
llama_new_context_with_model: n_ctx = 8192
llama_new_context_with_model: n_batch = 2048
llama_new_context_with_model: n_ubatch = 512
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base = 500000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init: CPU KV buffer size = 15072.00 MiB
llama_new_context_with_model: KV self size = 15072.00 MiB, K (f16): 7536.00 MiB, V (f16): 7536.00 MiB
llama_new_context_with_model: CPU output buffer size = 0.49 MiB
GGML_ASSERT: ggml/src/ggml.c:17034: i != GGML_HASHTABLE_FULL
Aborted
LLAMA_MAX_NODES=32768 (working)
(...)
llm_load_tensors: ggml ctx size = 1.98 MiB
llm_load_tensors: CPU buffer size = 273012.23 MiB
llama_new_context_with_model: n_ctx = 8192
llama_new_context_with_model: n_batch = 2048
llama_new_context_with_model: n_ubatch = 512
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base = 500000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init: CPU KV buffer size = 15072.00 MiB
llama_new_context_with_model: KV self size = 15072.00 MiB, K (f16): 7536.00 MiB, V (f16): 7536.00 MiB
llama_new_context_with_model: CPU output buffer size = 0.49 MiB
llama_new_context_with_model: CPU compute buffer size = 1104.01 MiB
llama_new_context_with_model: graph nodes = 15078
llama_new_context_with_model: graph splits = 1
(...)