Error: llama runner process has terminated: error loading model: unable to allocate backend buffer
What is the issue?
Can't load the llama 3.1 405b model.
Config :
- cpu: intel i7-9750H
- memory: 32768MB RAM
- disk: 1TB+1TB
OS
Windows
GPU
Nvidia
CPU
Intel
Ollama version
0.2.8
I had the same problem!
I'm guessing your 32GB is way too feeble to load this model.
@thany has hit the nail on the head, there's no way you are going to load a 231G model in to 32G RAM + 12 to 24G VRAM. You could try setting up a 250G swapfile and see how that works, but calling that slow would be an understatement.
I have the same problem. My laptop is 7845HX+64G memory.
I have 98 GB of RAM and the same error. I used to run Falcon 170B - it worked, slow but worked.
ollama run llama3.1:405b Error: llama runner process has terminated: error loading model: unable to allocate backend buffer
32Gb RAM also faced this problem, unable to run it
falcon 180b is 101G, so yes, that will fit in 98G RAM with some spillover to swap. If you create another 150G swapfile, then you could run lama3.1:405b, but it will be very slow.
We've added code to prevent loading models that can't fit within system free memory + swap + available VRAM, so I'm a little surprised that check didn't kick in and block the load. I'd love to see the server log to better understand why we thought we could load it, but then proceeded to crash when trying to load.
so uhhh, do i need 256GB of RAM? edit: nvm, seems like it needs 800GB of RAM lol
@dhiltgen The check was removed or altered as rpi8 couldn't load them but it's still linux specific.
See the relevant code model request too large for system
For windows no such adjustment exists and it tries to load entire model, which obviously fails if memory is not avaialble to it tried to split model. The best it does is to disable mmap when using cuda as mentioned in Comment: Windows CUDA should not use mmap for best performance
I think this needs to be refactored in order to optimize how ollama passes parameters to llama.cpp server. The same thing happens with calculating required size where it usually calculates wrong required size due to checking available memory vs how much is allowed to allocate size. Which leads to offload many layers to CPU while using 8b and 12b models on 8GB & 12GB VRAM devices resulting in slowdown compared to previous versions. Combine this with scheduler problems removing entire model and reloading it back in memory on each request give another set of problems which should go away by fixing these.
When llama.cpp binaries are run directly, they are able to push unused VRAM and load entire model along with kv cache so it's not an issue with llama.cpp as CUDA driver is able to push unused pointers to main shared ram if a program wants to allocate more space. Default policy is to allow as much as shared memory and the reason most inference engines do not do that is because it's same as running CPU inference of other offloaded layers. However in this case because of dynamic free-space calculation this differs from user to user resulting in some of the layers going on CPU ram resulting in poor stitching performance.
We get that ollama wants to support both : Multi Modal Loading & Parallel Request Processing It lacks : Proper scheduler for requests, proper scheduler for swapping models in vram and batching in front layer.
Current architecture is static in nature. It loads models as per current memory allocation patterns. However we need more dynamic approach on how requests are queued and executed along with scheduled model. There needs to be some sort of map that synchronizes this behavior. Current implementation of golang timers are okay as long as static system is considered, it simply lacks common grounds between request processing and scheduling.
I'm not sure how ollama team will proceed with current developments and these issues arising form parts working independently from each other having side effects, but current roadmap definitely needs some RFCs on organizing structural integrity between different parts of code-base to actually address possible solutions.
Related issues : llama3.1 #6008 Gemma2 Layers offload issue : #5821 Parallel Issue #5756 & #5557 ( Please see my last comment on scheduler issue )
Reported incidents : Gemma2 : #5843
I had the same problem!
the same problem, 32Gb RAM, 12Gb VRAM
I have the same problem. Is there any better solution?
@mxmp210 thanks. I agree the current setup with the scheduler isn't optimal with respect to memory allocation. We're working on refactoring how we leverage llama.cpp to shift to a Go based server which hooks into the C++ code at a lower level which should help us have better control over memory allocations.
same problem. i loaded dolphin-ollama3:70b on my system that has: 1TB storage Radeon RX550 Series 8GB Ram İntel Core İ5 (3.2 GHz) Windows 10 Pro (22H2)
@YigitOzdemir34
8GB Ram
Bruh
@YigitOzdemir34 with the latest release of Ollama we should detect this as a model that is too large for your system and fail fast with a better error message instead of trying to load it and crash. You're trying to load a ~40G model into a GPU which I believe has 2G, plus 8G of ram. You'll need to run much smaller models on your setup.
Tried to test ollama 3.1 400b model.
13900k, 4080 16GB VRAM + 128GB RAM (haven't tried to increase swapfile yet), have the same problem here.
UPD: Increased the amount of swapfile to 50-200GB. The model is starting slowly. After 2 minutes, it initialized successfully. Prompt "1+1" took about 2-3 minutes to solve.
It seems that this is a RAM issue. I freed up some memory, and the problem was resolved.