ollama Error: llama runner process has terminated: error loading model: unable to allocate backend buffer

What is the issue?

Can't load the llama 3.1 405b model.

Config :

cpu: intel i7-9750H
memory: 32768MB RAM
disk: 1TB+1TB

OS

Windows

GPU

Nvidia

CPU

Intel

Ollama version

0.2.8

Jul 25 '24 17:07 cryptpi

I had the same problem!

Jul 25 '24 18:07 jacob22x

I'm guessing your 32GB is way too feeble to load this model.

Jul 25 '24 19:07 thany

@thany has hit the nail on the head, there's no way you are going to load a 231G model in to 32G RAM + 12 to 24G VRAM. You could try setting up a 250G swapfile and see how that works, but calling that slow would be an understatement.

Jul 25 '24 20:07 rick-github

I have the same problem. My laptop is 7845HX+64G memory.

Jul 26 '24 02:07 lancolor0214

I have 98 GB of RAM and the same error. I used to run Falcon 170B - it worked, slow but worked.

Jul 26 '24 03:07 kostik700015

ollama run llama3.1:405b Error: llama runner process has terminated: error loading model: unable to allocate backend buffer

32Gb RAM also faced this problem, unable to run it

Jul 26 '24 06:07 jastranlove2020

falcon 180b is 101G, so yes, that will fit in 98G RAM with some spillover to swap. If you create another 150G swapfile, then you could run lama3.1:405b, but it will be very slow.

Jul 26 '24 09:07 rick-github

We've added code to prevent loading models that can't fit within system free memory + swap + available VRAM, so I'm a little surprised that check didn't kick in and block the load. I'd love to see the server log to better understand why we thought we could load it, but then proceeded to crash when trying to load.

Jul 26 '24 19:07 dhiltgen

so uhhh, do i need 256GB of RAM? edit: nvm, seems like it needs 800GB of RAM lol

Jul 26 '24 23:07 ghost

@dhiltgen The check was removed or altered as rpi8 couldn't load them but it's still linux specific.

See the relevant code model request too large for system

For windows no such adjustment exists and it tries to load entire model, which obviously fails if memory is not avaialble to it tried to split model. The best it does is to disable mmap when using cuda as mentioned in Comment: Windows CUDA should not use mmap for best performance

I think this needs to be refactored in order to optimize how ollama passes parameters to llama.cpp server. The same thing happens with calculating required size where it usually calculates wrong required size due to checking available memory vs how much is allowed to allocate size. Which leads to offload many layers to CPU while using 8b and 12b models on 8GB & 12GB VRAM devices resulting in slowdown compared to previous versions. Combine this with scheduler problems removing entire model and reloading it back in memory on each request give another set of problems which should go away by fixing these.

When llama.cpp binaries are run directly, they are able to push unused VRAM and load entire model along with kv cache so it's not an issue with llama.cpp as CUDA driver is able to push unused pointers to main shared ram if a program wants to allocate more space. Default policy is to allow as much as shared memory and the reason most inference engines do not do that is because it's same as running CPU inference of other offloaded layers. However in this case because of dynamic free-space calculation this differs from user to user resulting in some of the layers going on CPU ram resulting in poor stitching performance.

We get that ollama wants to support both : Multi Modal Loading & Parallel Request Processing It lacks : Proper scheduler for requests, proper scheduler for swapping models in vram and batching in front layer.

Current architecture is static in nature. It loads models as per current memory allocation patterns. However we need more dynamic approach on how requests are queued and executed along with scheduled model. There needs to be some sort of map that synchronizes this behavior. Current implementation of golang timers are okay as long as static system is considered, it simply lacks common grounds between request processing and scheduling.

I'm not sure how ollama team will proceed with current developments and these issues arising form parts working independently from each other having side effects, but current roadmap definitely needs some RFCs on organizing structural integrity between different parts of code-base to actually address possible solutions.

Related issues : llama3.1 #6008 Gemma2 Layers offload issue : #5821 Parallel Issue #5756 & #5557 ( Please see my last comment on scheduler issue )

Reported incidents : Gemma2 : #5843

Jul 27 '24 10:07 mxmp210

I had the same problem!

Jul 27 '24 17:07 caih1943

the same problem, 32Gb RAM, 12Gb VRAM

Jul 28 '24 06:07 astronaut808

I have the same problem. Is there any better solution?

Jul 28 '24 06:07 future-dreams

@mxmp210 thanks. I agree the current setup with the scheduler isn't optimal with respect to memory allocation. We're working on refactoring how we leverage llama.cpp to shift to a Go based server which hooks into the C++ code at a lower level which should help us have better control over memory allocations.

Aug 02 '24 16:08 dhiltgen

same problem. i loaded dolphin-ollama3:70b on my system that has: 1TB storage Radeon RX550 Series 8GB Ram İntel Core İ5 (3.2 GHz) Windows 10 Pro (22H2)

Aug 13 '24 10:08 YigitOzdemir34

@YigitOzdemir34

8GB Ram

Bruh

Aug 14 '24 08:08 thany

@YigitOzdemir34 with the latest release of Ollama we should detect this as a model that is too large for your system and fail fast with a better error message instead of trying to load it and crash. You're trying to load a ~40G model into a GPU which I believe has 2G, plus 8G of ram. You'll need to run much smaller models on your setup.

Aug 14 '24 15:08 dhiltgen

Tried to test ollama 3.1 400b model.

13900k, 4080 16GB VRAM + 128GB RAM (haven't tried to increase swapfile yet), have the same problem here.

UPD: Increased the amount of swapfile to 50-200GB. The model is starting slowly. After 2 minutes, it initialized successfully. Prompt "1+1" took about 2-3 minutes to solve.

Sep 15 '24 07:09 kolindes

It seems that this is a RAM issue. I freed up some memory, and the problem was resolved.

Oct 27 '24 04:10 l-i-p-f