ollama
ollama copied to clipboard
Older CUDA compute capability 3.5 and 3.7 support
I recently put together an (old) physical machine with an Nvidia K80, which is only supported up to CUDA 11.4 and Nvidia driver 470. All my previous experiments with Ollama were with more modern GPU's.
I found that Ollama doesn't use the GPU at all. I cannot find any documentation on the minimum required CUDA version, and if it is possible to run on older CUDA versions (e.g. Nvidia K80, V100 are still present on cloud, e.g. G2 and P2 on AWS) and there's lots of K80's all over ebay.
EDIT: looking through the logs, it appears that the GPU's are being seen:
Jan 1 20:22:43 thinkstation-s30 ollama[911]: 2024/01/01 20:22:43 llama.go:300: 24762 MB VRAM available, loading up to 162 GPU layers Jan 1 20:22:43 thinkstation-s30 ollama[911]: 2024/01/01 20:22:43 llama.go:436: starting llama runner Jan 1 20:22:43 thinkstation-s30 ollama[911]: 2024/01/01 20:22:43 llama.go:494: waiting for llama runner to start responding Jan 1 20:22:43 thinkstation-s30 ollama[911]: ggml_init_cublas: GGML_CUDA_FORCE_MMQ: no Jan 1 20:22:43 thinkstation-s30 ollama[911]: ggml_init_cublas: CUDA_USE_TENSOR_CORES: yes Jan 1 20:22:43 thinkstation-s30 ollama[911]: ggml_init_cublas: found 3 CUDA devices: Jan 1 20:22:43 thinkstation-s30 ollama[911]: Device 0: Tesla K80, compute capability 3.7 Jan 1 20:22:43 thinkstation-s30 ollama[911]: Device 1: Tesla K80, compute capability 3.7 Jan 1 20:22:43 thinkstation-s30 ollama[911]: Device 2: NVIDIA GeForce GT 730, compute capability 3.5
and
Jan 1 20:34:20 thinkstation-s30 ollama[911]: llm_load_tensors: ggml ctx size = 0.11 MiB Jan 1 20:34:20 thinkstation-s30 ollama[911]: llm_load_tensors: using CUDA for GPU acceleration Jan 1 20:34:20 thinkstation-s30 ollama[911]: llm_load_tensors: mem required = 70.46 MiB Jan 1 20:34:20 thinkstation-s30 ollama[911]: llm_load_tensors: offloading 32 repeating layers to GPU Jan 1 20:34:20 thinkstation-s30 ollama[911]: llm_load_tensors: offloading non-repeating layers to GPU Jan 1 20:34:20 thinkstation-s30 ollama[911]: llm_load_tensors: offloaded 33/33 layers to GPU Jan 1 20:34:20 thinkstation-s30 ollama[911]: llm_load_tensors: VRAM used: 3577.61 MiB
but....
Jan 1 20:34:21 thinkstation-s30 ollama[911]: CUDA error 209 at /go/src/github.com/jmorganca/ollama/llm/llama.cpp/gguf/ggml-cuda.cu:7801: no kernel image is available for execution on the device Jan 1 20:34:21 thinkstation-s30 ollama[911]: current device: 0 Jan 1 20:34:21 thinkstation-s30 ollama[911]: GGML_ASSERT: /go/src/github.com/jmorganca/ollama/llm/llama.cpp/gguf/ggml-cuda.cu:7801: !"CUDA error" Jan 1 20:34:22 thinkstation-s30 ollama[911]: 2024/01/01 20:34:22 llama.go:451: 209 at /go/src/github.com/jmorganca/ollama/llm/llama.cpp/gguf/ggml-cuda.cu:7801: no kernel image is available for execution on the device Jan 1 20:34:22 thinkstation-s30 ollama[911]: current device: 0 Jan 1 20:34:22 thinkstation-s30 ollama[911]: GGML_ASSERT: /go/src/github.com/jmorganca/ollama/llm/llama.cpp/gguf/ggml-cuda.cu:7801: !"CUDA error" Jan 1 20:34:22 thinkstation-s30 ollama[911]: 2024/01/01 20:34:22 llama.go:459: error starting llama runner: llama runner process has terminated
What is your Linux Kernel? I think 6+ kernels don't support a lot of older nvidia cards.
Kernel is 6+ and the setup is supported. I was able to get PyTorch working with CUDA - albeit PyTorch 2.0.1 only since that is the last version that supports CUDA 11.4
The error 209 "no kernel image is available for execution on the device" is for CUDA, not the Linux kernel. Basically the Ollama distribution doesn't have a compiled kernel (via nvcc) for CUDA 11.4 (not even sure if that is supported, if I build from source).
This is also the same case for me. I am using a Quadro K2200. It is recognized along with the computing capability. As soon as I pull a model, the error shows up and Ollama terminates.
The K80 is Compute Capability 3.7, which at present isn't supported by our CUDA builds. (see https://developer.nvidia.com/cuda-gpus for the mapping table)
Based on our current build setup, Compute Capability 6.0 is the minimum we'll support. We had some bugs on detection and fallback logic in 0.1.18, which should be resolved in 0.1.19 so that if we detect older than 6.0 we'll fallback to CPU.
There's a possibility we may be able to support 5.x cards by compiling llama.cpp with different flags and dynamically loading the right library variant on the fly based on what we discover, but that support hasn't been merged yet.
I'm not sure yet if we can compile support going all the way back into the 3.7 series, but we'll keep this ticket tracking that.
I'd love to see that change. Owner of old GeForce GTX 960M
on amd64 Linux here. Version 0.1.18 stopped working while 0.1.17 has been working.
I'd love to see that change. Owner of old GeForce GTX 960M on amd64 Linux here. Version 0.1.18 stopped working while 0.1.17 has been working.
Can you clarify? Was 0.1.17 working on the GPU, or falling back to CPU mode?
Also to clarify, the GTX 960M is a Compute Capability 5.0 card, which we're tracking in a different ticket now #1865
I'd love to see that change. Owner of old GeForce GTX 960M on amd64 Linux here. Version 0.1.18 stopped working while 0.1.17 has been working.
Can you clarify? Was 0.1.17 working on the GPU, or falling back to CPU mode?
Also to clarify, the GTX 960M is a Compute Capability 5.0 card, which we're tracking in a different ticket now #1865
You're right, I guess it was falling back to CPU mode, but I'm unsure how to read the logs correctly.
The issue you mentioned seems to be the issue I was having. Version 0.1.19 fixes it. Sorry for the noise and thanks!
but I'm unsure how to read the logs correctly.
At startup the server log will report information about attempting to discover GPU information, and in the case of CUDA cards, will report the compute capability. If we don't detect a supported GPU, we report that we're falling back to CPU mode. In the near future we'll be adding refinements to support multiple variants for a given GPU (and CPU) to try to leverage modern capabilities when detected, but also be able to fallback to a baseline that works for older GPUs/CPUs.
Hello, same case here, I have Nvidia K80, ollama works only in CPU :(
Hi, same case here, I have Nvidia M40, ollama works only in CPU in docker container :(
Hi, same case here, I have Nvidia M40, ollama works only in CPU in docker container :(
The M40 is a Compute Capability 5.2 card, so it's covered by #1865
We're using CUDA v11 to compile our official builds. Digging around a bit, it looks like CUDA v11 no longer supports Compute Capability 3.0, but I am able to get nvcc to target 3.5 cards.
I'll work on some mod's to the way we do our builds so that someone with a 3.0 card and older CUDA toolkit might be able to build it on their own from source, but I think we may be able to get 3.5+ support into the official builds.
The K80 I referenced in my original post supports up to CUDA 11.4 which is the last version it will ever support, since it has been end-of-lifed.
On Sat, Jan 20, 2024 at 8:11 PM Daniel Hiltgen @.***> wrote:
We're using CUDA v11 to compile our official builds. Digging around a bit, it looks like CUDA v11 no longer supports Compute Capability 3.0, but I am able to get nvcc to target 3.5 cards.
I'll work on some mod's to the way we do our builds so that someone with a 3.0 card and older CUDA toolkit might be able to build it on their own from source, but I think we may be able to get 3.5+ support into the official builds.
— Reply to this email directly, view it on GitHub https://github.com/jmorganca/ollama/issues/1756#issuecomment-1902254857, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAKDS3HJBUZE6MFWHDQJWYDYPQQGDAVCNFSM6AAAAABBJJABSSVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSMBSGI2TIOBVG4 . You are receiving this because you authored the thread.Message ID: @.***>
PR #2116 lays foundation to be able to experiment with CC 3.5 support. I'm not sure if we'll need other flags to get it working, or simply adding "35" to the list of CMAKE_CUDA_ARCHITECTURES
.
EDIT: I am aware that there are "resizable BAR" issues around the use of the Tesla P40, and my hardware is so ancient that it does not support resizable BAR. However, PyTorch runs just fine and I can load e.g. BigBird into the P40 and do inference. Note that my PyTorch install is 2.0.1 and also worked on the K80. PyTorch itself warns that the GT730 (CC 3.5) is not supported, and CC 3.7 is the lowest supported on 2.0.1 (which is a few years old at this point).
I replaced the K80 with a P40, which is a Compute Capability 6.1 card. The card appears in nvidia-smi and is detected in the Ollama logs:
... Jan 25 15:26:21 thinkstation-s30 ollama[919]: ggml_init_cublas: found 2 CUDA devices: Jan 25 15:26:21 thinkstation-s30 ollama[919]: Device 0: Tesla P40, compute capability 6.1 Jan 25 15:26:21 thinkstation-s30 ollama[919]: Device 1: NVIDIA GeForce GT 730, compute capability 3.5 ...
However I still get the "... no kernel..." error, it appears to be using Device 1! it's not very clear how to force the use of Device 0 (when I was using the K80 it was being properly selected) - I tried the CUDA_VISIBLE_DEVICES environment variable which had no effect.
... Jan 25 15:26:26 thinkstation-s30 ollama[919]: llama_new_context_with_model: total VRAM used: 2258.20 MiB (model: 1456.19 MiB, context: 802.00 MiB) Jan 25 15:26:26 thinkstation-s30 ollama[919]: CUDA error 209 at /go/src/github.com/jmorganca/ollama/llm/llama.cpp/gguf/ggml-cuda.cu:8075: no kernel image is available for execution on the device Jan 25 15:26:26 thinkstation-s30 ollama[919]: current device: 1 Jan 25 15:26:26 thinkstation-s30 ollama[919]: GGML_ASSERT: /go/src/github.com/jmorganca/ollama/llm/llama.cpp/gguf/ggml-cuda.cu:8075: !"CUDA error" Jan 25 15:26:27 thinkstation-s30 ollama[919]: 2024/01/25 15:26:27 llama.go:451: 209 at /go/src/github.com/jmorganca/ollama/llm/llama.cpp/gguf/ggml-cuda.cu:8075: no kernel image is available for execution on the device Jan 25 15:26:27 thinkstation-s30 ollama[919]: current device: 1 Jan 25 15:26:27 thinkstation-s30 ollama[919]: GGML_ASSERT: /go/src/github.com/jmorganca/ollama/llm/llama.cpp/gguf/ggml-cuda.cu:8075: !"CUDA error" ...
@orlyandico that's unfortunate CUDA_VISIBLE_DEVICES didn't do the trick. I'll try to see if I can setup a test rig similar to your setup and try to find a way to ignore the unsupported card.
I've also gotten DiffusionPipeline and models from HuggingFace working, it is a bit odd that
torch.cuda.device_count()
sometimes returns 1 (and only enumerates the P40) and sometimes 2 (also enumerates the GT730)
I've got a PR up to add support, but I'm a little concerned people might actually see a performance hit not improvement by transitioning to GPU instead of CPU for these old cards.
Folks with these old cards - if you want to give the change a try and build from source and let me know how the performance compares before/after that would be helpful to weigh when/if we merge the PR.
Folks with these old cards - if you want to give the change a try and build from source and let me know how the performance compares before/after that would be helpful to weigh when/if we merge the PR.
Hi @dhiltgen
I have a GeForce 920M GPU which has a CC 3.5 I'd like to participate in that test, please guide me how could I compile it on Ubuntu 22.04 and how can I benchmark this test with and without the GPU.
I appreciate your contributions and appreciate your efforts to support these older GPUs.
Thanks @felipecock
Check out https://github.com/ollama/ollama/blob/main/docs/development.md for instructions, and if you get stuck, join the community on Discord for an added hand.
Hello @dhiltgen
Is there any possibility of getting Ollama to work with the Nvidia K80 in the next few days, or should we abandon this idea?
@nejib1 if you apply the changes of my PR as a patch to the repo and build from source, it will run on a K80 GPU. Instructions on building from source are here
Given the concerns we have that this might actually result in a performance regression not improvement for users, we're going to hold off merging this until we get more performance data.
@nejib1 if you apply the changes of my PR as a patch to the repo and build from source, it will run on a K80 GPU. Instructions on building from source are here
Given the concerns we have that this might actually result in a performance regression not improvement for users, we're going to hold off merging this until we get more performance data.
Thank you very much, I'll try it
I am having similiar issues trying to run Ollama Web UI with my RTX A4000 16GB GPU. When I run standard Ollama, it uses my GPU just fine. When I install Ollama Web UI, I get errors (from a full clean Ubuntu install, with all NVIDIA drivers and container toolkit installed).
Ollama Web UI commands
gtadmin@gtaiws3:~/ollama-webui$ docker-compose -f docker-compose.yaml -f docker-compose.gpu.yaml up
Traceback (most recent call last):
File "/usr/bin/docker-compose", line 33, in
When I just run the CPU only yaml, everything works fine.... gtadmin@gtaiws3:~/ollama-webui$ docker-compose -f docker-compose.yaml up ollama is up-to-date ollama-webui is up-to-date Attaching to ollama, ollama-webui ollama | Couldn't find '/root/.ollama/id_ed25519'. Generating new private key. ollama | Your new public key is: ollama | ollama | ssh-ed25519 AAAAC3NzaC1lZDI1NTE5AAAAIEi4k2WvzJB4+o3PMQTvhq1M2ci6JnEYfDUiH6Dl6k+k ollama | ollama | 2024/01/29 02:09:40 images.go:857: INFO total blobs: 0 ollama | 2024/01/29 02:09:40 images.go:864: INFO total unused blobs removed: 0 ollama | 2024/01/29 02:09:40 routes.go:950: INFO Listening on [::]:11434 (version 0.1.22) ollama | 2024/01/29 02:09:40 payload_common.go:106: INFO Extracting dynamic libraries... ollama | 2024/01/29 02:09:42 payload_common.go:145: INFO Dynamic LLM libraries [cpu cuda_v11 cpu_avx rocm_v5 rocm_v6 cpu_avx2] ollama | 2024/01/29 02:09:42 gpu.go:94: INFO Detecting GPU type ollama | 2024/01/29 02:09:42 gpu.go:236: INFO Searching for GPU management library libnvidia-ml.so ollama | 2024/01/29 02:09:42 gpu.go:282: INFO Discovered GPU libraries: [] ollama | 2024/01/29 02:09:42 gpu.go:236: INFO Searching for GPU management library librocm_smi64.so ollama | 2024/01/29 02:09:42 gpu.go:282: INFO Discovered GPU libraries: [] ollama | 2024/01/29 02:09:42 cpu_common.go:11: INFO CPU has AVX2 ollama | 2024/01/29 02:09:42 routes.go:973: INFO no GPU detected ollama-webui | start.sh: 3: Bad substitution ollama-webui | INFO: Started server process [1] ollama-webui | INFO: Waiting for application startup. ollama-webui | INFO: Application startup complete. ollama-webui | INFO: Uvicorn running on http://0.0.0.0:8080 (Press CTRL+C to quit)
@tbendien an RTX A4000 is a modern GPU with Compute Capability 8.6. Let's keep this ticket focused on support for much older cards with CC 3.5 and 3.7. Folks can help troubleshoot on Discord, or you can open a new issue.
@dhiltgen I've perfomed a test with and without the GPU:
Only CPU Intel Core i7 5500U CPU - ollama:main branch (time seems to be in ns)
{"model":"llama2:latest","created_at":"2024-01-31T22:24:33.848173925Z","message":{"role":"assistant","content":""},"done":true,"total_duration":330940056957,"load_duration":3067744651,"prompt_eval_count":457,"prompt_eval_duration":227370727000,"eval_count":157,"eval_duration":100501014000}
GPU GeForce 920M @ 4GB (It only reached up to 33% GPU about the first minute, Dedicated Memory doesn't seem to be used) + CPU Intel Core i7 5500U CPU (Reached 100% most of time) - ollama:cc_3.5 branch
llama_print_timings: load time = 2001.26 ms llama_print_timings: sample time = 168.67 ms / 175 runs ( 0.96 ms per token, 1037.54 tokens per second) llama_print_timings: prompt eval time = 110295.28 ms / 154 tokens ( 716.20 ms per token, 1.40 tokens per second) llama_print_timings: eval time = 198530.10 ms / 174 runs ( 1140.98 ms per token, 0.88 tokens per second) llama_print_timings: total time = 309092.12 ms
It was a bit faster with GPU, although it was not used at 100% as I expected, IDK if that is ok for this model.
@orlyandico that's unfortunate CUDA_VISIBLE_DEVICES didn't do the trick. I'll try to see if I can setup a test rig similar to your setup and try to find a way to ignore the unsupported card.
Found the reason, ollama.service was launching from systemd and so wasn't picking up CUDA_VISIBLE_DEVICES from the environment.
Still leaves the question as to why the CC 3.5 device was being selected when it isn't the first device and is not supported. Ollama probably should have logic to select only the supported CUDA devices on a multi-device host..
@orlyandico we don't yet have logic to automatically detect and bypass unsupported cards in a multi-gpu setup when one isn't supported but others are.
@felipecock can you clarify your scenario? Are you attempting to load a model that can't fit entirely in VRAM and thus are getting a split between CPU/GPU? For apples-to-apples performance comparison, I'd try to get metrics from a model that fits entirely in the GPU so we're not getting thrown off by I/O bottlenecks or GPU stalling waiting for CPU.
@dhiltgen, I've performed a test in a newer machine (13th Gen Intel(R) Core(TM) i9-13900H, 2600 Mhz, 14 Core(s), 20 Logical Processor(s) 64GB RAM + NVIDIA RTX 2000 Ada Generation Laptop GPU) and I realized that CPU is used in a more extensive way than the GPU, despite Ollama said the GPU was to be used:
gpu.go:88: Detecting GPU type
gpu.go:203: Searching for GPU management library libnvidia-ml.so
gpu.go:248: Discovered GPU libraries: [/usr/lib/x86_64-linux-gnu/libnvidia-ml.so.1]
gpu.go:94: Nvidia GPU detected
gpu.go:135: CUDA Compute Capability detected: 8.9
...
shim_ext_server_linux.go:24: Updating PATH to /usr/local/nvidia/bin:/usr/local/cuda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/tmp/ollama1622992116/cuda
shim_ext_server.go:92: Loading Dynamic Shim llm server: /tmp/ollama1622992116/cuda/libext_server.so
ext_server_common.go:136: Initializing internal llama server
ggml_init_cublas: GGML_CUDA_FORCE_MMQ: no
CUDA_USE_TENSOR_CORES: yes
ggml_init_cublas: found 1 CUDA devices:
Device 0: NVIDIA RTX 2000 Ada Generation Laptop GPU, compute capability 8.9
llama_model_loader: loaded meta data with 20 key-value pairs and 291 tensors from /root/.ollama/models/blobs/sha256:3a43f93b78ec50f7c4e4dc8bd1cb3fff5a900e7d574c51a6f7495e48486e0dac (version GGUF V2)
So, I think that the expected behavior is to use the CPU in some part of the process, that cannot be parallelized (I believe), and it results in a wider CPU usage rather of the GPU.
I'm not an expert in this, then I could be wrong. :confused:
@felipecock I'm not quite sure what your question is. It looks like that GPU has 12G of VRAM, so you'll be able to run larger models entirely on the GPU than a typical CC 3.5 or 3.7 card. We're drifting a bit off-topic for this issue, but if the model doesn't fit in VRAM, then some amount of processing is done on the CPU, and often this can result in poor performance as the GPU stalls waiting for the CPU to keep up.
The current state of this issue is I have a PR up which would enable support for these older cards, but we're not sure if we're going to merge it yet or not, as we're concerned it could be a performance hit for many users given these older cards aren't particularly well suited for LLM work.