exo rm_alloc returned 81: Out of memory

exo seems to be OOMing despite having lots of free RAM.

% cat /proc/meminfo|head
MemTotal:       16029428 kB
MemFree:         1478608 kB
MemAvailable:   10660452 kB
Buffers:          560476 kB
Cached:         10730028 kB
SwapCached:          820 kB
Active:          5688328 kB
Inactive:        7405268 kB
Active(anon):    2852260 kB
Inactive(anon):  1600664 kB

% curl http://localhost:52415/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
     "model": "llama-3.1-8b",
     "messages": [{"role": "user", "content": "What is the meaning of exo?"}],
     "temperature": 0.7
   }'
{"detail": "Error processing prompt (see logs with DEBUG>=2): rm_alloc returned 81: Out of memory"}%

If I read the README.md correctly, I should only need 16GB of RAM to run this model so having more than 100GB free seems like I should not be OOMing.

Jan 30 '25 17:01 kensmith

What device are you running on?

Jan 30 '25 19:01 AlexCheema

Arch Linux AMD Ryzen 7 9700X RTX3080TI

Jan 30 '25 20:01 kensmith

I think I have same issue. I am trying to run a Llama 3.2 1B on Arch Linux 16 GB of ram Intel CPU. I got same error Out of memory. BUT I was able to run it in cluster 7GB (other laptop)+4GB (GPU in docker) +15GB(main laptop RAM) with speed of 0.3 token/sec.

Jan 31 '25 21:01 andrenaP

sounds like EXO is trying to only use VRAM and do inference on the GPU instead of hybrid inference or resorting to CPU-only inference, i ran into the same issue today

Feb 08 '25 07:02 ejrydhfs

I think I have same issue. I am trying to run a Llama 3.2 1B on Arch Linux 16 GB of ram Intel CPU. I got same error Out of memory. BUT I was able to run it in cluster 7GB (other laptop)+4GB (GPU in docker) +15GB(main laptop RAM) with speed of 0.3 token/sec.

Mind saying how you were able to do it? Did you isolate the gpu from the rest of the system? Making the system seem like it had an integrated gpu only? I assume OOM may not be an issue in systems with integrated graphics/unified memory because both GPU and CPU share the same memory

Feb 08 '25 07:02 ejrydhfs

I think I have same issue. I am trying to run a Llama 3.2 1B on Arch Linux 16 GB of ram Intel CPU. I got same error Out of memory. BUT I was able to run it in cluster 7GB (other laptop)+4GB (GPU in docker) +15GB(main laptop RAM) with speed of 0.3 token/sec.

Mind saying how you were able to do it? Did you isolate the gpu from the rest of the system? Making the system seem like it had an integrated gpu only? I assume OOM may not be an issue in systems with integrated graphics/unified memory because both GPU and CPU share the same memory

This was pretty easy. I created a docker container with exo. Because I didn't pass GPU in container in it is used CPU. Of course I have to download the same model on both container and main system.... In future I will use docker volumes to share model across docker and main system.

Feb 08 '25 10:02 andrenaP

I thought about implementing a docker workaround too it's reliable and effective but I believe it is not ideal since it implies that a network is also emulated in the system for communicating between the cpu and gpu and that imposes an overhead which can be significant at high speeds like 10gbps, an ideal solution I believe would be to support hybrid inference on individual nodes

Feb 10 '25 19:02 ejrydhfs