rm_alloc returned 81: Out of memory
exo seems to be OOMing despite having lots of free RAM.
% cat /proc/meminfo|head
MemTotal: 16029428 kB
MemFree: 1478608 kB
MemAvailable: 10660452 kB
Buffers: 560476 kB
Cached: 10730028 kB
SwapCached: 820 kB
Active: 5688328 kB
Inactive: 7405268 kB
Active(anon): 2852260 kB
Inactive(anon): 1600664 kB
% curl http://localhost:52415/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "llama-3.1-8b",
"messages": [{"role": "user", "content": "What is the meaning of exo?"}],
"temperature": 0.7
}'
{"detail": "Error processing prompt (see logs with DEBUG>=2): rm_alloc returned 81: Out of memory"}%
If I read the README.md correctly, I should only need 16GB of RAM to run this model so having more than 100GB free seems like I should not be OOMing.
What device are you running on?
Arch Linux AMD Ryzen 7 9700X RTX3080TI
I think I have same issue.
I am trying to run a Llama 3.2 1B on Arch Linux 16 GB of ram Intel CPU. I got same error Out of memory.
BUT I was able to run it in cluster 7GB (other laptop)+4GB (GPU in docker) +15GB(main laptop RAM) with speed of 0.3 token/sec.
sounds like EXO is trying to only use VRAM and do inference on the GPU instead of hybrid inference or resorting to CPU-only inference, i ran into the same issue today
I think I have same issue. I am trying to run a
Llama 3.2 1Bon Arch Linux 16 GB of ram Intel CPU. I got same errorOut of memory. BUT I was able to run it in cluster 7GB (other laptop)+4GB (GPU in docker) +15GB(main laptop RAM) with speed of 0.3 token/sec.
Mind saying how you were able to do it? Did you isolate the gpu from the rest of the system? Making the system seem like it had an integrated gpu only? I assume OOM may not be an issue in systems with integrated graphics/unified memory because both GPU and CPU share the same memory
I think I have same issue. I am trying to run a
Llama 3.2 1Bon Arch Linux 16 GB of ram Intel CPU. I got same errorOut of memory. BUT I was able to run it in cluster 7GB (other laptop)+4GB (GPU in docker) +15GB(main laptop RAM) with speed of 0.3 token/sec.Mind saying how you were able to do it? Did you isolate the gpu from the rest of the system? Making the system seem like it had an integrated gpu only? I assume OOM may not be an issue in systems with integrated graphics/unified memory because both GPU and CPU share the same memory
This was pretty easy. I created a docker container with exo. Because I didn't pass GPU in container in it is used CPU.
Of course I have to download the same model on both container and main system.... In future I will use docker volumes to share model across docker and main system.
I thought about implementing a docker workaround too it's reliable and effective but I believe it is not ideal since it implies that a network is also emulated in the system for communicating between the cpu and gpu and that imposes an overhead which can be significant at high speeds like 10gbps, an ideal solution I believe would be to support hybrid inference on individual nodes