llama.cpp icon indicating copy to clipboard operation
llama.cpp copied to clipboard

Bug: LLAMA_MAX_NODES must be increased to run 405B Mega merge

Open nicoboss opened this issue 7 months ago • 6 comments

What happened?

Large models like Meta-Llama-3-405B-Instruct-Up-Merge require LLAMA_MAX_NODES to be increased or llama.cpp will crash while loading the model.

Meta-Llama-3-405B-Instruct-Up-Merge was created with the purpose to test readiness for Llama 3 405B. It is reasonable to assume that Llama 3 405B will have the same issue once released.

Confusing error messages

The error messages displayed to the user are confusing and do not indicate in any way that LLAMA_MAX_NODES must be increased:

  • With the default of LLAMA_MAX_NODES=8192 you get a not enough space in the context's memory pool error followed by a segmentation fault.
  • With LLAMA_MAX_NODES=16384 you get an i != GGML_HASHTABLE_FULL assert crash.
  • With LLAMA_MAX_NODES=32768 everything works perfectly fine.

Issues of LLAMA_MAX_NODES being a compile-time constant

Having LLAMA_MAX_NODES as a compile-time constant is problematic as changing it requires recompiling llama.cpp from source. While this is relatively easy if you use llama.cpp directly as soon you deal with 3rd party software using backend specific pre-built llama-cpp-python bindings (like oobabooga/text-generation-webui) changing LLAMA_MAX_NODES gets unfeasible for the general user.

Possible solutions

  • Bump LLAMA_MAX_NODES to 32768
  • Make LLAMA_MAX_NODES a variable that can be set using a command line argument, configurations or an environment variable
  • Make it so llama.cpp automatically sets LLAMA_MAX_NODES to the optimal value based on the model instructed to load
  • Improve the error messages and documentation to make it clear that the user needs to increase LLAMA_MAX_NODES and recompile llama.cpp from source

Name and Version

Latest master built from source using make version: 3432 (45f2c19c) built with cc (Debian 12.2.0-14) 12.2.0 for x86_64-linux-gnu

CPU backend is used. Machine has 512 GB RAM and the GGUF is only 267 GB. RAM never comes anywhere close on getting full.

What operating system are you seeing the problem on?

Linux

Relevant log output

./llama-cli -m Meta-Llama-3-405B-Instruct-Up-Merge.Q5_K_M.gguf -p "I believe the meaning of life is" -n 128

LLAMA_MAX_NODES=8192 (default)

(...)
llm_load_tensors: ggml ctx size =    1.98 MiB
llm_load_tensors:        CPU buffer size = 273012.23 MiB
llama_new_context_with_model: n_ctx      = 8192
llama_new_context_with_model: n_batch    = 2048
llama_new_context_with_model: n_ubatch   = 512
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base  = 500000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init:        CPU KV buffer size = 15072.00 MiB
llama_new_context_with_model: KV self size  = 15072.00 MiB, K (f16): 7536.00 MiB, V (f16): 7536.00 MiB
llama_new_context_with_model:        CPU  output buffer size =     0.49 MiB
ggml_new_object: not enough space in the context's memory pool (needed 3277456, available 3277120)
Segmentation fault

LLAMA_MAX_NODES=16384

(..)
llm_load_tensors: ggml ctx size =    1.98 MiB
llm_load_tensors:        CPU buffer size = 273012.23 MiB
llama_new_context_with_model: n_ctx      = 8192
llama_new_context_with_model: n_batch    = 2048
llama_new_context_with_model: n_ubatch   = 512
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base  = 500000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init:        CPU KV buffer size = 15072.00 MiB
llama_new_context_with_model: KV self size  = 15072.00 MiB, K (f16): 7536.00 MiB, V (f16): 7536.00 MiB
llama_new_context_with_model:        CPU  output buffer size =     0.49 MiB
GGML_ASSERT: ggml/src/ggml.c:17034: i != GGML_HASHTABLE_FULL
Aborted

LLAMA_MAX_NODES=32768 (working)

(...)
llm_load_tensors: ggml ctx size =    1.98 MiB
llm_load_tensors:        CPU buffer size = 273012.23 MiB
llama_new_context_with_model: n_ctx      = 8192
llama_new_context_with_model: n_batch    = 2048
llama_new_context_with_model: n_ubatch   = 512
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base  = 500000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init:        CPU KV buffer size = 15072.00 MiB
llama_new_context_with_model: KV self size  = 15072.00 MiB, K (f16): 7536.00 MiB, V (f16): 7536.00 MiB
llama_new_context_with_model:        CPU  output buffer size =     0.49 MiB
llama_new_context_with_model:        CPU compute buffer size =  1104.01 MiB
llama_new_context_with_model: graph nodes  = 15078
llama_new_context_with_model: graph splits = 1
(...)

nicoboss avatar Jul 21 '24 15:07 nicoboss