llama.cpp
llama.cpp copied to clipboard
Misc. bug: Vulcan premature out of memory exception on AMD Instinct MI60
Name and Version
llama-cli --version ggml_vulkan: Found 1 Vulkan devices: ggml_vulkan: 0 = AMD Radeon Graphics (RADV VEGA20) (radv) | uma: 0 | fp16: 1 | warp size: 64 | matrix cores: none version: 4615 (bfcce4d6) built with cc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 for x86_64-linux-gnu
Operating systems
Ubuntu 24.04.
Which llama.cpp modules do you know to be affected?
llama-server
Command line
llama-server -m ~/llamamodels/Qwen2-7B-Instruct/Qwen2.5-7B-Instruct-1M-Q8_0.gguf -c 72000 -ngl 99
Problem description & steps to reproduce
Hello,
The AMD Instinct MI60 cards have 32GB of VRAM. While using ROCm I can use the whole 32GB but with Vulcan it seems that one llama-server instance can access only 16GB. I tested it with Qwen 2.5 7B 1M model with the context length up to 1 million) and I cannot start it with a context of more than 71K. But at the same time I can start 2 instances with the 71K context length on the same card.
For example, two of these could be started at the same time: llama-server -m ~/llamamodels/Qwen2-7B-Instruct/Qwen2.5-7B-Instruct-1M-Q8_0.gguf -c 71000 -ngl 99
However if I try to start just one with the 72K context I get the following error:
llama_init_from_model: KV self size = 3937.50 MiB, K (f16): 1968.75 MiB, V (f16): 1968.75 MiB llama_init_from_model: Vulkan_Host output buffer size = 0.58 MiB ggml_vulkan: Device memory allocation of size 4305588224 failed. ggml_vulkan: Requested buffer size exceeds device memory allocation limit: ErrorOutOfDeviceMemory ggml_gallocr_reserve_n: failed to allocate Vulkan0 buffer of size 4305588224 ggml_vulkan: Device memory allocation of size 4305588224 failed. ggml_vulkan: Requested buffer size exceeds device memory allocation limit: ErrorOutOfDeviceMemory ggml_gallocr_reserve_n: failed to allocate Vulkan0 buffer of size 4305588224 llama_init_from_model: failed to allocate compute buffers common_init_from_params: failed to create context with model '/root/llamamodels/Qwen2-7B-Instruct/Qwen2.5-7B-Instruct-1M-Q8_0.gguf' srv load_model: failed to load model, '/root/llamamodels/Qwen2-7B-Instruct/Qwen2.5-7B-Instruct-1M-Q8_0.gguf'
I did try to disable the verification in ggml-vulkan.cpp and was able to increase context length to 220K while utilizing only 86% of the VRAM. But while it was working I just started to receive gibberish after the context length exceeded 71K.
I tried different versions of Vulkan but the error remains.