Misc. bug: Vulcan premature out of memory exception on AMD Instinct MI60
Name and Version
llama-cli --version ggml_vulkan: Found 1 Vulkan devices: ggml_vulkan: 0 = AMD Radeon Graphics (RADV VEGA20) (radv) | uma: 0 | fp16: 1 | warp size: 64 | matrix cores: none version: 4615 (bfcce4d6) built with cc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 for x86_64-linux-gnu
Operating systems
Ubuntu 24.04.
Which llama.cpp modules do you know to be affected?
llama-server
Command line
llama-server -m ~/llamamodels/Qwen2-7B-Instruct/Qwen2.5-7B-Instruct-1M-Q8_0.gguf -c 72000 -ngl 99
Problem description & steps to reproduce
Hello,
The AMD Instinct MI60 cards have 32GB of VRAM. While using ROCm I can use the whole 32GB but with Vulcan it seems that one llama-server instance can access only 16GB. I tested it with Qwen 2.5 7B 1M model with the context length up to 1 million) and I cannot start it with a context of more than 71K. But at the same time I can start 2 instances with the 71K context length on the same card.
For example, two of these could be started at the same time: llama-server -m ~/llamamodels/Qwen2-7B-Instruct/Qwen2.5-7B-Instruct-1M-Q8_0.gguf -c 71000 -ngl 99
However if I try to start just one with the 72K context I get the following error:
llama_init_from_model: KV self size = 3937.50 MiB, K (f16): 1968.75 MiB, V (f16): 1968.75 MiB llama_init_from_model: Vulkan_Host output buffer size = 0.58 MiB ggml_vulkan: Device memory allocation of size 4305588224 failed. ggml_vulkan: Requested buffer size exceeds device memory allocation limit: ErrorOutOfDeviceMemory ggml_gallocr_reserve_n: failed to allocate Vulkan0 buffer of size 4305588224 ggml_vulkan: Device memory allocation of size 4305588224 failed. ggml_vulkan: Requested buffer size exceeds device memory allocation limit: ErrorOutOfDeviceMemory ggml_gallocr_reserve_n: failed to allocate Vulkan0 buffer of size 4305588224 llama_init_from_model: failed to allocate compute buffers common_init_from_params: failed to create context with model '/root/llamamodels/Qwen2-7B-Instruct/Qwen2.5-7B-Instruct-1M-Q8_0.gguf' srv load_model: failed to load model, '/root/llamamodels/Qwen2-7B-Instruct/Qwen2.5-7B-Instruct-1M-Q8_0.gguf'
I did try to disable the verification in ggml-vulkan.cpp and was able to increase context length to 220K while utilizing only 86% of the VRAM. But while it was working I just started to receive gibberish after the context length exceeded 71K.
I tried different versions of Vulkan but the error remains.
Vulkan doesn't currently support more than 4GB in a single buffer, so if this large context size causes such an allocation then it's expected to fail.
Refs https://github.com/KhronosGroup/Vulkan-Docs/issues/1016
My buffer is only getting 2gb on windows with my rx6900xt. It should be capable of 4gb.
Thank you for your replies. Maybe there are some ways to use multiple buffers of 4GB? Any other suggestion on how to optimally use these cards would be appreciated. My first choice was the ROCm HIP version of Llama.cpp but there is a huge VRAM overprovisioning. I mean it would load the model and KV cache into VRAM but then memory usage would further grow with the prompt processed. So while I was able to use 71K context length in the Vulkan version with only 50% of VRAM, in the HIP version the 32GB were fully used after 35K context processing.
My buffer is only getting 2gb on windows with my rx6900xt. It should be capable of 4gb.
You could try setting the env var GGML_VK_FORCE_MAX_ALLOCATION_SIZE to something higher, but it may not work if the driver isn't claiming support.
Maybe there are some ways to use multiple buffers of 4GB?
I don't think so. Part of the problem is that Vulkan compilers can assume bindings are <4GB and use 32b addressing math. So even if you made a huge sparse buffer or something it probably wouldn't work. Unless there's an algorithmic change to split the kv cache?
Another option might be to use a quantized format to decrease the size of the kv cache (e.g. -ctk q8_0 -ctv q8_0 -fa) but it requires flash attention support which is currently not accelerated on AMD GPUs.
My buffer is only getting 2gb on windows with my rx6900xt. It should be capable of 4gb.
You could try setting the env var GGML_VK_FORCE_MAX_ALLOCATION_SIZE to something higher, but it may not work if the driver isn't claiming support.
Is this what you are suggesting? I used the property from vulkan info "maxMemoryAllocationCount = 4294967295" then assigned a windows system variable using that value.
Looks like I have increased the buffer, the previous max was about 30-40k new prompt, n_ctx_slot = 82944, n_keep = 0, n_prompt_tokens = 25508
I'm not sure if the system variable was the change required or not but I'll fiddle with it in the morning to see. Is there a way to test this without just stuffing it manually?
It turns out that this setting in the "Vulkan Configurator" caused the problem.
VUID-VkBufferCreateInfo-size-06409(ERROR / SPEC): msgNum: -332260500 - Validation Error:
[ VUID-VkBufferCreateInfo-size-06409 ] | MessageID = 0xec321b6c |
vkCreateBuffer(): pCreateInfo->size (3414061312) is larger than the maximum allowed buffer size VkPhysicalDeviceMaintenance4Properties.maxBufferSize (2147483648).
The Vulkan spec states: size must be less than or equal to VkPhysicalDeviceMaintenance4Properties::maxBufferSize
I changed the setting to "Layers Controlled by the Vulkan Applications" which solved the problem.
I am getting the same issue but I can't find that setting on the vulcan configurator... Yours is something like this ?
I am getting the same issue but I can't find that setting on the vulcan configurator... Yours is something like this ?
I have v.2.6.2 it appears to be a little different.
Yeah, the latest version is different. But on another computer I had the v2.6.2 of the Vulkan Configurator, still, doesn't fix on mine. I have 16gb available, get the complain at:
ggml_vulkan: Requested buffer size exceeds device memory allocation limit: ErrorOutOfDeviceMemory ggml_gallocr_reserve_n: failed to allocate Vulkan0 buffer of size 4471128064.
A pitty vulkan does not work above 4gb, it would give support to the AMD integrated graphic cards...
Ok, I figured out that reducing the n_batch under 4096 and n_ubatch under 2048 solves the issue!
This issue was closed because it has been inactive for 14 days since being marked as stale.