ollama icon indicating copy to clipboard operation
ollama copied to clipboard

Support CPUs without AVX

Open jmorganca opened this issue 1 year ago • 3 comments

Currently CPU instructions are determined at build time, meaning Ollama needs to target instruction sets that support the largest set of CPUs possible. Instead, CPU instructions should be detected at runtime allowing for both speed and compatibility with older/less powerful CPUs

jmorganca avatar Nov 26 '23 21:11 jmorganca

Great news! Can't wait. I can't afford to change computers, I have to make do with my old processors. I hope to be able to run Ollama on them soon. Thanks Jeffrey!

JRM73 avatar Nov 28 '23 09:11 JRM73

For anyone wondering, here's how you can manually disable AVX to build Ollama.

$ git diff
diff --git a/llm/llama.cpp/generate_linux.go b/llm/llama.cpp/generate_linux.go
index ce9e78a..77c9795 100644
--- a/llm/llama.cpp/generate_linux.go
+++ b/llm/llama.cpp/generate_linux.go
@@ -14,13 +14,13 @@ package llm
 //go:generate git submodule update --force gguf
 //go:generate git -C gguf apply ../patches/0001-copy-cuda-runtime-libraries.patch
 //go:generate git -C gguf apply ../patches/0001-update-default-log-target.patch
-//go:generate cmake -S gguf -B gguf/build/cpu -DLLAMA_K_QUANTS=on -DLLAMA_NATIVE=off -DLLAMA_AVX=on -DLLAMA_AVX2=off -DLLAMA_AVX512=off -DLLAMA_FMA=off -DLLAMA_F16C=off
+//go:generate cmake -S gguf -B gguf/build/cpu -DLLAMA_K_QUANTS=on -DLLAMA_NATIVE=off -DLLAMA_AVX=off -DLLAMA_AVX2=off -DLLAMA_AVX512=off -DLLAMA_FMA=off -DLLAMA_F16C=off
 //go:generate cmake --build gguf/build/cpu --target server --config Release
 //go:generate mv gguf/build/cpu/bin/server gguf/build/cpu/bin/ollama-runner

 //go:generate cmake -S ggml -B ggml/build/cuda -DLLAMA_CUBLAS=on -DLLAMA_ACCELERATE=on -DLLAMA_K_QUANTS=on
 //go:generate cmake --build ggml/build/cuda --target server --config Release
 //go:generate mv ggml/build/cuda/bin/server ggml/build/cuda/bin/ollama-runner
-//go:generate cmake -S gguf -B gguf/build/cuda -DLLAMA_CUBLAS=on -DLLAMA_ACCELERATE=on -DLLAMA_K_QUANTS=on -DLLAMA_NATIVE=off -DLLAMA_AVX=on -DLLAMA_AVX2=off -DLLAMA_AVX512=off -DLLAMA_FMA=off -DLLAMA_F16C=off -DLLAMA_CUDA_PEER_MAX_BATCH_SIZE=0
+//go:generate cmake -S gguf -B gguf/build/cuda -DLLAMA_CUBLAS=on -DLLAMA_ACCELERATE=on -DLLAMA_K_QUANTS=on -DLLAMA_NATIVE=off -DLLAMA_AVX=off -DLLAMA_AVX2=off -DLLAMA_AVX512=off -DLLAMA_FMA=off -DLLAMA_F16C=off -DLLAMA_CUDA_PEER_MAX_BATCH_SIZE=0
 //go:generate cmake --build gguf/build/cuda --target server --config Release
 //go:generate mv gguf/build/cuda/bin/server gguf/build/cuda/bin/ollama-runner

jyap808 avatar Dec 12 '23 20:12 jyap808

I was trying to run ollama on a Intel® Pentium® Silver N6005 (Released in 2021!) and it does apparently not support AVX so Ollama doesn't work. So it's definitely something that affects newer processors as well.

Compiling from scratch as per the README file does work.

2024/01/15 23:59:10 cpu_common.go:18: CPU does not have vector extensions

khromov avatar Jan 15 '24 23:01 khromov

With release 0.1.21 we now support multiple CPU optimized variants of the LLM library. The system will auto-detect the capabilities of the CPU and select one of AVX2, AVX, or unoptimized. This works on linux, mac, and windows. In particular, the unoptimized variant works under Rosetta now.

dhiltgen avatar Jan 20 '24 23:01 dhiltgen

With release 0.1.21 we now support multiple CPU optimized variants of the LLM library. The system will auto-detect the capabilities of the CPU and select one of AVX2, AVX, or unoptimized. This works on linux, mac, and windows. In particular, the unoptimized variant works under Rosetta now.

Hello. Is it also true for the docker image? I am not 100% sure that my issue is related, but I tried to debug and the docker container crashed for an error linked to CPU instructions. My Intel G6400 does not ave AVX nor AVX2 but SSE 4.1 and 4.2. Could it be linked to a bad detection of the set of instructions it supports ? https://github.com/jmorganca/ollama/issues/2122 edit: looking to the release date of the docker image (11 days ago) it must be using a version older than 0.1.21, so which is not implementing this enhancement.

GuiPoM avatar Jan 22 '24 21:01 GuiPoM

We haven't pushed an official updated image yet, since 0.1.21 is still a pre-release while we squash a few final bugs.

If you're eager to try it out, I've pushed an image up to Docker Hub at dhiltgen/ollama:0.1.21-rc3

dhiltgen avatar Jan 24 '24 00:01 dhiltgen

We haven't pushed an official updated image yet, since 0.1.21 is still a pre-release while we squash a few final bugs.

If you're eager to try it out, I've pushed an image up to Docker Hub at dhiltgen/ollama:0.1.21-rc3

Thank you! That's very kind of you. Is it normal for there to be such an increase in size between the rc2 and the rc3? We go from ~500Mb to ~5Gb. I'll try to deploy the image tonight, currently my Portainer instance is crashing due to a timeout, probably related to the image size, I'll have to test it locally.

GuiPoM avatar Jan 24 '24 09:01 GuiPoM

@GuiPoM we've recently added ROCm support to the container image, which required switching the base layer to include the ROCm libraries, which unfortunately are quite large. We'd prefer to have a single image that works for both NVIDIA and Radeon cards, but if this size increase is too much for your use-case, please open a new issue so we can track it.

dhiltgen avatar Jan 24 '24 19:01 dhiltgen

@GuiPoM we've recently added ROCm support to the container image, which required switching the base layer to include the ROCm libraries, which unfortunately are quite large. We'd prefer to have a single image that works for both NVIDIA and Radeon cards, but if this size increase is too much for your use-case, please open a new issue so we can track it.

No, that's okay, but for testing CPU only scenario this is huge, even on my fiber. By the way I managed thanks to this rc3 image to get ollama starting as a docker container on a non AVX processor, so I can confirm that this image is working great.

GuiPoM avatar Jan 24 '24 22:01 GuiPoM

@GuiPoM if you just need CPU only, you could grab the binary direct from the github release page and stick that into ~any modern container image base.

A simple Dockerfile like this would work:

FROM ubuntu:latest
ADD --chmod=655 https://github.com/ollama/ollama/releases/download/v0.1.21/ollama-linux-amd64 /bin/ollama
ENTRYPOINT ["/bin/ollama"]
CMD ["serve"]

dhiltgen avatar Jan 24 '24 23:01 dhiltgen