GPU not detected in Kubernetes.
What is the issue?
When deploying into kubernetes the container is complaining about being unable to load the cudart library. (Or maybe its out of date)
Based on the documentation and provided examples I expect it to detect and utilize the GPU in container.
Every test I can think of (which is limited) seems to indicate this should be working but I'll bet I'm missing some nuance in the stack here - any advice would be appreciated.
Host Configuration :
uname -a
Linux overseer 6.5.0-28-generic #29~22.04.1-Ubuntu SMP PREEMPT_DYNAMIC Thu Apr 4 14:39:20 UTC 2 x86_64 x86_64 x86_64 GNU/Linux
nvcc
vcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2024 NVIDIA Corporation
Built on Thu_Mar_28_02:18:24_PDT_2024
Cuda compilation tools, release 12.4, V12.4.131
Build cuda_12.4.r12.4/compiler.34097967_0
Docker Run outputs :
docker run -d --gpus=all -v ollama:/root/.ollama -p 11435:11434 --name ollama ollama/ollama docker run -d --gpus=all -v ollama:/root/.ollama -p 11435:11434 --name ollama ollama/ollama
docker logs ollama
time=2024-04-19T17:59:48.712Z level=INFO source=images.go:817 msg="total blobs: 0"
time=2024-04-19T17:59:48.712Z level=INFO source=images.go:824 msg="total unused blobs removed: 0"
time=2024-04-19T17:59:48.712Z level=INFO source=routes.go:1143 msg="Listening on [::]:11434 (version 0.1.32)"
time=2024-04-19T17:59:48.712Z level=INFO source=payload.go:28 msg="extracting embedded files" dir=/tmp/ollama4206527122/runners
time=2024-04-19T17:59:50.712Z level=INFO source=payload.go:41 msg="Dynamic LLM libraries [cpu_avx cpu_avx2 cuda_v11 rocm_v60002 cpu]"
time=2024-04-19T17:59:50.712Z level=INFO source=gpu.go:121 msg="Detecting GPU type"
time=2024-04-19T17:59:50.712Z level=INFO source=gpu.go:268 msg="Searching for GPU management library libcudart.so*"
time=2024-04-19T17:59:50.713Z level=INFO source=gpu.go:314 msg="Discovered GPU libraries: [/tmp/ollama4206527122/runners/cuda_v11/libcudart.so.11.0]"
time=2024-04-19T17:59:50.746Z level=INFO source=gpu.go:126 msg="Nvidia GPU detected via cudart"
time=2024-04-19T17:59:50.746Z level=INFO source=cpu_common.go:11 msg="CPU has AVX2"
time=2024-04-19T17:59:50.855Z level=INFO source=gpu.go:202 msg="[cudart] CUDART CUDA Compute Capability detected: 8.6"
[GIN] 2024/04/19 - 18:00:35 | 404 | 154.523µs | 172.17.0.1 | POST "/api/generate"
Deployment Configuration:
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: ollama
namespace: ollama
spec:
runtimeClassName: nvidia
selector:
matchLabels:
name: ollama
template:
metadata:
labels:
name: ollama
spec:
containers:
- name: ollama
image: ollama/ollama
resources:
limits:
nvidia.com/gpu: 1
tolerations:
- key: nvidia.com/gpu
operator: Exists
effect: NoSchedule
env:
- name: PATH
value: /usr/local/nvidia/bin:/usr/local/cuda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
- name: LD_LIBRARY_PATH
value: /usr/local/nvidia/lib:/usr/local/nvidia/lib64
- name: NVIDIA_VISIBLE_DEVICES
value: all
- name: NVIDIA_DRIVER_CAPABILITIES
value: compute,utility
- name: OLLAMA_DEBUG
value: "1"
ports:
- name: http
containerPort: 11434
protocol: TCP
Deployment Logs:
│ Autoscroll:On FullScreen:Off Timestamps:Off Wrap:On │
│ Couldn't find '/root/.ollama/id_ed25519'. Generating new private key. │
│ Your new public key is: │
│ │
│ ssh-ed25519 AAAAC3NzaC1lZDI1NTE5AAAAIMp1XONYlspAEBzMEJNATgAMm39ctFUiN3XZxLwlzVMB │
│ │
│ time=2024-04-19T17:27:12.252Z level=INFO source=images.go:817 msg="total blobs: 0" │
│ time=2024-04-19T17:27:12.252Z level=INFO source=images.go:824 msg="total unused blobs removed: 0" │
│ time=2024-04-19T17:27:12.253Z level=INFO source=routes.go:1143 msg="Listening on :11434 (version 0.1.32)" │
│ time=2024-04-19T17:27:12.253Z level=INFO source=payload.go:28 msg="extracting embedded files" dir=/tmp/ollama2307533246/runners │
│ time=2024-04-19T17:27:12.253Z level=DEBUG source=payload.go:160 msg=extracting variant=cpu file=build/linux/x86_64/cpu/bin/ollama_llama_server.gz │
│ time=2024-04-19T17:27:12.253Z level=DEBUG source=payload.go:160 msg=extracting variant=cpu_avx file=build/linux/x86_64/cpu_avx/bin/ollama_llama_server.gz │
│ time=2024-04-19T17:27:12.253Z level=DEBUG source=payload.go:160 msg=extracting variant=cpu_avx2 file=build/linux/x86_64/cpu_avx2/bin/ollama_llama_server.gz │
│ time=2024-04-19T17:27:12.253Z level=DEBUG source=payload.go:160 msg=extracting variant=cuda_v11 file=build/linux/x86_64/cuda_v11/bin/libcublas.so.11.gz │
│ time=2024-04-19T17:27:12.253Z level=DEBUG source=payload.go:160 msg=extracting variant=cuda_v11 file=build/linux/x86_64/cuda_v11/bin/libcublasLt.so.11.gz │
│ time=2024-04-19T17:27:12.253Z level=DEBUG source=payload.go:160 msg=extracting variant=cuda_v11 file=build/linux/x86_64/cuda_v11/bin/libcudart.so.11.0.gz │
│ time=2024-04-19T17:27:12.253Z level=DEBUG source=payload.go:160 msg=extracting variant=cuda_v11 file=build/linux/x86_64/cuda_v11/bin/ollama_llama_server.gz │
│ time=2024-04-19T17:27:12.253Z level=DEBUG source=payload.go:160 msg=extracting variant=rocm_v60002 file=build/linux/x86_64/rocm_v60002/bin/deps.txt.gz │
│ time=2024-04-19T17:27:12.253Z level=DEBUG source=payload.go:160 msg=extracting variant=rocm_v60002 file=build/linux/x86_64/rocm_v60002/bin/ollama_llama_server.gz │
│ time=2024-04-19T17:27:14.217Z level=DEBUG source=payload.go:68 msg="availableServers : found" file=/tmp/ollama2307533246/runners/cpu │
│ time=2024-04-19T17:27:14.217Z level=DEBUG source=payload.go:68 msg="availableServers : found" file=/tmp/ollama2307533246/runners/cpu_avx │
│ time=2024-04-19T17:27:14.217Z level=DEBUG source=payload.go:68 msg="availableServers : found" file=/tmp/ollama2307533246/runners/cpu_avx2 │
│ time=2024-04-19T17:27:14.217Z level=DEBUG source=payload.go:68 msg="availableServers : found" file=/tmp/ollama2307533246/runners/cuda_v11 │
│ time=2024-04-19T17:27:14.217Z level=DEBUG source=payload.go:68 msg="availableServers : found" file=/tmp/ollama2307533246/runners/rocm_v60002 │
│ time=2024-04-19T17:27:14.217Z level=INFO source=payload.go:41 msg="Dynamic LLM libraries [cuda_v11 rocm_v60002 cpu cpu_avx cpu_avx2]" │
│ time=2024-04-19T17:27:14.217Z level=DEBUG source=payload.go:42 msg="Override detection logic by setting OLLAMA_LLM_LIBRARY" │
│ time=2024-04-19T17:27:14.217Z level=INFO source=gpu.go:121 msg="Detecting GPU type" │
│ time=2024-04-19T17:27:14.217Z level=INFO source=gpu.go:268 msg="Searching for GPU management library libcudart.so*" │
│ time=2024-04-19T17:27:14.217Z level=DEBUG source=gpu.go:286 msg="gpu management search paths: [/tmp/ollama2307533246/runners/cuda*/libcudart.so* /usr/local/cuda/lib64/libcudart.so* /usr/lib/x86_64-linux-gnu/nvidia/current/libcudart.so* /usr/lib/x86_64-linux-gnu/libcudart.so* /usr/lib/wsl/lib/libcudart.so* /u │
│ sr/lib/wsl/drivers/*/libcudart.so* /opt/cuda/lib64/libcudart.so* /usr/local/cuda*/targets/aarch64-linux/lib/libcudart.so* /usr/lib/aarch64-linux-gnu/nvidia/current/libcudart.so* /usr/lib/aarch64-linux-gnu/libcudart.so* /usr/local/cuda/lib*/libcudart.so* /usr/lib*/libcudart.so* /usr/local/lib*/libcudart.so* / │
│ usr/local/nvidia/lib/libcudart.so** /usr/local/nvidia/lib64/libcudart.so**]" │
│ time=2024-04-19T17:27:14.217Z level=INFO source=gpu.go:314 msg="Discovered GPU libraries: [/tmp/ollama2307533246/runners/cuda_v11/libcudart.so.11.0]" │
│ wiring cudart library functions in /tmp/ollama2307533246/runners/cuda_v11/libcudart.so.11.0 │
│ dlsym: cudaSetDevice │
│ dlsym: cudaDeviceSynchronize │
│ dlsym: cudaDeviceReset │
│ dlsym: cudaMemGetInfo │
│ dlsym: cudaGetDeviceCount │
│ dlsym: cudaDeviceGetAttribute │
│ dlsym: cudaDriverGetVersion │
│ cudaSetDevice err: 35 │
│ time=2024-04-19T17:27:14.218Z level=INFO source=gpu.go:343 msg="Unable to load cudart CUDA management library /tmp/ollama2307533246/runners/cuda_v11/libcudart.so.11.0: your nvidia driver is too old or missing, please upgrade to run ollama" │
│ time=2024-04-19T17:27:14.218Z level=INFO source=gpu.go:268 msg="Searching for GPU management library libnvidia-ml.so" │
│ time=2024-04-19T17:27:14.218Z level=DEBUG source=gpu.go:286 msg="gpu management search paths: [/usr/local/cuda/lib64/libnvidia-ml.so* /usr/lib/x86_64-linux-gnu/nvidia/current/libnvidia-ml.so* /usr/lib/x86_64-linux-gnu/libnvidia-ml.so* /usr/lib/wsl/lib/libnvidia-ml.so* /usr/lib/wsl/drivers/*/libnvidia-ml.so* │
│ /opt/cuda/lib64/libnvidia-ml.so* /usr/lib*/libnvidia-ml.so* /usr/lib/aarch64-linux-gnu/nvidia/current/libnvidia-ml.so* /usr/lib/aarch64-linux-gnu/libnvidia-ml.so* /usr/local/lib*/libnvidia-ml.so* /opt/cuda/targets/x86_64-linux/lib/stubs/libnvidia-ml.so* /usr/local/nvidia/lib/libnvidia-ml.so* /usr/local/nvidi │
│ a/lib64/libnvidia-ml.so*]" │
│ time=2024-04-19T17:27:14.218Z level=INFO source=gpu.go:314 msg="Discovered GPU libraries: []" │
│ time=2024-04-19T17:27:14.218Z level=INFO source=cpu_common.go:11 msg="CPU has AVX2" │
│ time=2024-04-19T17:27:14.218Z level=DEBUG source=amd_linux.go:280 msg="amdgpu driver not detected /sys/module/amdgpu" │
│ time=2024-04-19T17:27:14.218Z level=INFO source=routes.go:1164 msg="no GPU detected"
Kubernetes Based nbody run :
cat <<EOF | kubectl create -f -
apiVersion: v1
kind: Pod
metadata:
name: nbody-gpu-benchmark
namespace: default
spec:
restartPolicy: OnFailure
runtimeClassName: nvidia
containers:
- name: cuda-container
image: nvcr.io/nvidia/k8s/cuda-sample:nbody
args: ["nbody", "-gpu", "-benchmark"]
resources:
limits:
nvidia.com/gpu: 1
env:
- name: NVIDIA_VISIBLE_DEVICES
value: all
- name: NVIDIA_DRIVER_CAPABILITIES
value: all
EOF
nbody container logs
Autoscroll:On FullScreen:Off Timestamps:Off Wrap:Off │
│ Run "nbody -benchmark [-numbodies=<numBodies>]" to measure performance. │
│ -fullscreen (run n-body simulation in fullscreen mode) │
│ -fp64 (use double precision floating point values for simulation) │
│ -hostmem (stores simulation data in host memory) │
│ -benchmark (run benchmark to measure performance) │
│ -numbodies=<N> (number of bodies (>= 1) to run in simulation) │
│ -device=<d> (where d=0,1,2.... for the CUDA device to use) │
│ -numdevices=<i> (where i=(number of CUDA devices > 0) to use for simulation) │
│ -compare (compares simulation results running once on the default GPU and once on the CPU) │
│ -cpu (run n-body simulation on the CPU) │
│ -tipsy=<file.bin> (load a tipsy model file for simulation) │
│ │
│ NOTE: The CUDA Samples are not meant for performance measurements. Results may vary when GPU Boost is enabled. │
│ │
│ > Windowed mode │
│ > Simulation data stored in video memory │
│ > Single precision floating point simulation │
│ > 1 Devices used for simulation │
│ GPU Device 0: "Ampere" with compute capability 8.6 │
│ │
│ > Compute 8.6 CUDA device: [NVIDIA GeForce RTX 3060] │
│ 28672 bodies, total time for 10 iterations: 22.067 ms │
│ = 372.538 billion interactions per second │
│ = 7450.761 single-precision GFLOP/s at 20 flops per interaction
OS
Linux
GPU
Nvidia
CPU
Intel
Ollama version
latest
Can you share what driver version you have on the host? nvidia-smi on the host should report it.
Sure thing !
Fri Apr 19 16:41:58 2024
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.54.15 Driver Version: 550.54.15 CUDA Version: 12.4 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA GeForce RTX 3060 Off | 00000000:0B:00.0 On | N/A |
| 0% 45C P8 20W / 170W | 6167MiB / 12288MiB | 18% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| 0 N/A N/A 1351 G /usr/lib/xorg/Xorg 420MiB |
| 0 N/A N/A 1710 G /usr/bin/gnome-shell 50MiB |
| 0 N/A N/A 5747 G ...ures=SpareRendererForSitePerProcess 241MiB |
| 0 N/A N/A 20562 G ...yOnDemand --variations-seed-version 117MiB |
| 0 N/A N/A 127243 G ...seed-version=20240419-050138.465000 187MiB |
| 0 N/A N/A 206026 C ...unners/cuda_v11/ollama_llama_server 5114MiB |
+-----------------------------------------------------------------------------------------+
I am hosting locally and running ollama with my RTX4090.
I had an issue running the new Llama3 model, so thought I'd update my local image/container.
Since pulling the new version, I get the error shown in the screen clip, that my nvidia driver is too old:
Coincidentally, there was an nvidia driver update available, but applying this update has made no difference. My current nvidia driver version is 552.22 released 16th April 2024.
Hmm... I have a very similar test setup and haven't been able to reproduce yet. (Ubuntu 22.04 kernel 5.15.0-105-generic)
% nvidia-smi
Mon Apr 22 23:00:49 2024
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.54.15 Driver Version: 550.54.15 CUDA Version: 12.4 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA GeForce RTX 3060 Off | 00000000:01:00.0 Off | N/A |
| 0% 45C P8 14W / 170W | 1872MiB / 12288MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
| 1 NVIDIA GeForce RTX 3060 Off | 00000000:05:00.0 Off | N/A |
| 0% 48C P8 13W / 170W | 1792MiB / 12288MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| 0 N/A N/A 3159 C ...unners/cuda_v11/ollama_llama_server 1866MiB |
| 1 N/A N/A 3159 C ...unners/cuda_v11/ollama_llama_server 1786MiB |
+-----------------------------------------------------------------------------------------+
% docker run --rm -it --gpus=all -e OLLAMA_DEBUG=1 ollama/ollama
Unable to find image 'ollama/ollama:latest' locally
latest: Pulling from ollama/ollama
bccd10f490ab: Pull complete
50f1619ca0b1: Pull complete
7a344f274044: Pull complete
Digest: sha256:c5018bf71b27a38f50da37d86fa0067105eea488cdcc258ace6d222dde632f75
Status: Downloaded newer image for ollama/ollama:latest
Couldn't find '/root/.ollama/id_ed25519'. Generating new private key.
Your new public key is:
ssh-ed25519 AAAAC3NzaC1lZDI1NTE5AAAAIIC3jXTsRoQI5yH6n0hhiHtRbJoD6gYOd9C81TRfk+Ck
time=2024-04-22T22:59:11.746Z level=INFO source=images.go:817 msg="total blobs: 0"
time=2024-04-22T22:59:11.746Z level=INFO source=images.go:824 msg="total unused blobs removed: 0"
time=2024-04-22T22:59:11.746Z level=INFO source=routes.go:1143 msg="Listening on [::]:11434 (version 0.1.32)"
time=2024-04-22T22:59:11.746Z level=INFO source=payload.go:28 msg="extracting embedded files" dir=/tmp/ollama1759042201/runners
time=2024-04-22T22:59:11.746Z level=DEBUG source=payload.go:160 msg=extracting variant=cpu file=build/linux/x86_64/cpu/bin/ollama_llama_server.gz
time=2024-04-22T22:59:11.746Z level=DEBUG source=payload.go:160 msg=extracting variant=cpu_avx file=build/linux/x86_64/cpu_avx/bin/ollama_llama_server.gz
time=2024-04-22T22:59:11.746Z level=DEBUG source=payload.go:160 msg=extracting variant=cpu_avx2 file=build/linux/x86_64/cpu_avx2/bin/ollama_llama_server.gz
time=2024-04-22T22:59:11.746Z level=DEBUG source=payload.go:160 msg=extracting variant=cuda_v11 file=build/linux/x86_64/cuda_v11/bin/libcublas.so.11.gz
time=2024-04-22T22:59:11.746Z level=DEBUG source=payload.go:160 msg=extracting variant=cuda_v11 file=build/linux/x86_64/cuda_v11/bin/libcublasLt.so.11.gz
time=2024-04-22T22:59:11.746Z level=DEBUG source=payload.go:160 msg=extracting variant=cuda_v11 file=build/linux/x86_64/cuda_v11/bin/libcudart.so.11.0.gz
time=2024-04-22T22:59:11.746Z level=DEBUG source=payload.go:160 msg=extracting variant=cuda_v11 file=build/linux/x86_64/cuda_v11/bin/ollama_llama_server.gz
time=2024-04-22T22:59:11.746Z level=DEBUG source=payload.go:160 msg=extracting variant=rocm_v60002 file=build/linux/x86_64/rocm_v60002/bin/deps.txt.gz
time=2024-04-22T22:59:11.746Z level=DEBUG source=payload.go:160 msg=extracting variant=rocm_v60002 file=build/linux/x86_64/rocm_v60002/bin/ollama_llama_server.gz
time=2024-04-22T22:59:13.484Z level=DEBUG source=payload.go:68 msg="availableServers : found" file=/tmp/ollama1759042201/runners/cpu
time=2024-04-22T22:59:13.484Z level=DEBUG source=payload.go:68 msg="availableServers : found" file=/tmp/ollama1759042201/runners/cpu_avx
time=2024-04-22T22:59:13.484Z level=DEBUG source=payload.go:68 msg="availableServers : found" file=/tmp/ollama1759042201/runners/cpu_avx2
time=2024-04-22T22:59:13.484Z level=DEBUG source=payload.go:68 msg="availableServers : found" file=/tmp/ollama1759042201/runners/cuda_v11
time=2024-04-22T22:59:13.484Z level=DEBUG source=payload.go:68 msg="availableServers : found" file=/tmp/ollama1759042201/runners/rocm_v60002
time=2024-04-22T22:59:13.484Z level=INFO source=payload.go:41 msg="Dynamic LLM libraries [cpu_avx2 cuda_v11 rocm_v60002 cpu cpu_avx]"
time=2024-04-22T22:59:13.484Z level=DEBUG source=payload.go:42 msg="Override detection logic by setting OLLAMA_LLM_LIBRARY"
time=2024-04-22T22:59:13.484Z level=INFO source=gpu.go:121 msg="Detecting GPU type"
time=2024-04-22T22:59:13.484Z level=INFO source=gpu.go:268 msg="Searching for GPU management library libcudart.so*"
time=2024-04-22T22:59:13.484Z level=DEBUG source=gpu.go:286 msg="gpu management search paths: [/tmp/ollama1759042201/runners/cuda*/libcudart.so* /usr/local/cuda/lib64/libcudart.so* /usr/lib/x86_64-linux-gnu/nvidia/current/libcudart.so* /usr/lib/x86_64-linux-gnu/libcudart.so* /usr/lib/wsl/lib/libcudart.so* /usr/lib/wsl/drivers/*/libcudart.so* /opt/cuda/lib64/libcudart.so* /usr/local/cuda*/targets/aarch64-linux/lib/libcudart.so* /usr/lib/aarch64-linux-gnu/nvidia/current/libcudart.so* /usr/lib/aarch64-linux-gnu/libcudart.so* /usr/local/cuda/lib*/libcudart.so* /usr/lib*/libcudart.so* /usr/local/lib*/libcudart.so* /usr/local/nvidia/lib/libcudart.so** /usr/local/nvidia/lib64/libcudart.so**]"
time=2024-04-22T22:59:13.484Z level=INFO source=gpu.go:314 msg="Discovered GPU libraries: [/tmp/ollama1759042201/runners/cuda_v11/libcudart.so.11.0]"
wiring cudart library functions in /tmp/ollama1759042201/runners/cuda_v11/libcudart.so.11.0
dlsym: cudaSetDevice
dlsym: cudaDeviceSynchronize
dlsym: cudaDeviceReset
dlsym: cudaMemGetInfo
dlsym: cudaGetDeviceCount
dlsym: cudaDeviceGetAttribute
dlsym: cudaDriverGetVersion
CUDA driver version: 12-4
time=2024-04-22T22:59:13.525Z level=INFO source=gpu.go:126 msg="Nvidia GPU detected via cudart"
time=2024-04-22T22:59:13.525Z level=INFO source=cpu_common.go:11 msg="CPU has AVX2"
[0] CUDA totalMem 12622168064
[0] CUDA freeMem 12508921856
[1] CUDA totalMem 12622168064
[1] CUDA freeMem 12508921856
time=2024-04-22T22:59:13.621Z level=INFO source=gpu.go:202 msg="[cudart] CUDART CUDA Compute Capability detected: 8.6"
releasing cudart library
[GIN] 2024/04/22 - 22:59:38 | 200 | 48.977µs | 127.0.0.1 | HEAD "/"
Exec'ing into the container shows fast token rate (logs also show running on GPU)
% docker exec -it stupefied_bhabha ollama run --verbose orca-mini hello
Hello there! How can I assist you today?
total duration: 153.15311ms
load duration: 318.736µs
prompt eval duration: 20.014ms
prompt eval rate: 0.00 tokens/s
eval count: 11 token(s)
eval duration: 86.036ms
eval rate: 127.85 tokens/s
Is it possible the nvidia container toolkit is out of rev? If not that, perhaps the kernel version being newer has an impact.
The docker container is working fine - gpu getting used as expected, I'm only seeing the problem within kubernetes.
The log that I think indicates what is going on is here is at :
│ time=2024-04-19T17:27:14.217Z level=INFO source=gpu.go:268 msg="Searching for GPU management library libcudart.so*" │
│ time=2024-04-19T17:27:14.217Z level=DEBUG source=gpu.go:286 msg="gpu management search paths: [/tmp/ollama2307533246/runners/cuda*/libcudart.so* /usr/local/cuda/lib64/libcudart.so* /usr/lib/x86_64-linux-gnu/nvidia/current/libcudart.so* /usr/lib/x86_64-linux-gnu/libcudart.so* /usr/lib/wsl/lib/libcudart.so* /u │
│ sr/lib/wsl/drivers/*/libcudart.so* /opt/cuda/lib64/libcudart.so* /usr/local/cuda*/targets/aarch64-linux/lib/libcudart.so* /usr/lib/aarch64-linux-gnu/nvidia/current/libcudart.so* /usr/lib/aarch64-linux-gnu/libcudart.so* /usr/local/cuda/lib*/libcudart.so* /usr/lib*/libcudart.so* /usr/local/lib*/libcudart.so* / │
│ usr/local/nvidia/lib/libcudart.so** /usr/local/nvidia/lib64/libcudart.so**]" │
│ time=2024-04-19T17:27:14.217Z level=INFO source=gpu.go:314 msg="Discovered GPU libraries: [/tmp/ollama2307533246/runners/cuda_v11/libcudart.so.11.0]" │
│ wiring cudart library functions in /tmp/ollama2307533246/runners/cuda_v11/libcudart.so.11.0 │
│ dlsym: cudaSetDevice │
│ dlsym: cudaDeviceSynchronize │
│ dlsym: cudaDeviceReset │
│ dlsym: cudaMemGetInfo │
│ dlsym: cudaGetDeviceCount │
│ dlsym: cudaDeviceGetAttribute │
│ dlsym: cudaDriverGetVersion │
│ cudaSetDevice err: 35 │
│ time=2024-04-19T17:27:14.218Z level=INFO source=gpu.go:343 msg="Unable to load cudart CUDA management library /tmp/ollama2307533246/runners/cuda_v11/libcudart.so.11.0: your nvidia driver is too old or missing, please upgrade to run ollama"
What confuses me is that i'm able to load other cuda libraries (i think) when using the nbody program so I know the GPU is available in cluster as part of the run time.
@dhiltgen quick note of appreciation for the response. Mine may have been a short term issue whilst all the different moving parts aligned around Llama3. Today everything looks good with latest version of Ollama and AnythingLLM (my current goto environment for learning).
Apologies if this was off topic.
I wonder if this might be a dup of #1500 and the bundled cudart library we bake into the image doesn't support MIG?
Oh good eye, I'll pull that PR and see if that fixes the issue and then patiently wait for the merge to come through if so.
This doesn't appear to be related. :disappointed: I'll keep looking - if i get an answer i'll make sure to post it. Thanks for providing the support.
Maybe useful - i ran nvidia-smi within the container and got an NVML: Unknown error.
I've been working on a change to how we do GPU discovery that might resolve this issue. I've pushed test images up to Docker Hub for folks to try out.
dhiltgen/ollama:0.1.33-rc5-24-g089daae
dhiltgen/ollama:0.1.33-rc5-24-g089daae-rocm
Still getting a 35 error.
@dylanbstorey can you share the server log with -e OLLAMA_DEBUG=1 set so I can see which API is getting that error?
│ Couldn't find '/root/.ollama/id_ed25519'. Generating new private key. │
│ Your new public key is: │
│ │
│ ssh-ed25519 AAAAC3NzaC1lZDI1NTE5AAAAIGToQkRNU4xkWVE/tqEac0rtLgjiOPiF58sO7iSlbYJK │
│ │
│ time=2024-05-02T01:16:39.059Z level=INFO source=images.go:828 msg="total blobs: 0" │
│ time=2024-05-02T01:16:39.062Z level=INFO source=images.go:835 msg="total unused blobs removed: 0" │
│ time=2024-05-02T01:16:39.062Z level=INFO source=routes.go:1074 msg="Listening on :11434 (version 0.1.33-rc5-24-g089daae)" │
│ time=2024-05-02T01:16:39.063Z level=INFO source=payload.go:30 msg="extracting embedded files" dir=/tmp/ollama3280993898/runners │
│ time=2024-05-02T01:16:39.063Z level=DEBUG source=payload.go:180 msg=extracting variant=cpu file=build/linux/x86_64/cpu/bin/ollama_llama_server.gz │
│ time=2024-05-02T01:16:39.063Z level=DEBUG source=payload.go:180 msg=extracting variant=cpu_avx file=build/linux/x86_64/cpu_avx/bin/ollama_llama_server.gz │
│ time=2024-05-02T01:16:39.063Z level=DEBUG source=payload.go:180 msg=extracting variant=cpu_avx2 file=build/linux/x86_64/cpu_avx2/bin/ollama_llama_server.gz │
│ time=2024-05-02T01:16:39.063Z level=DEBUG source=payload.go:180 msg=extracting variant=cuda_v11 file=build/linux/x86_64/cuda_v11/bin/libcublas.so.11.gz │
│ time=2024-05-02T01:16:39.063Z level=DEBUG source=payload.go:180 msg=extracting variant=cuda_v11 file=build/linux/x86_64/cuda_v11/bin/libcublasLt.so.11.gz │
│ time=2024-05-02T01:16:39.063Z level=DEBUG source=payload.go:180 msg=extracting variant=cuda_v11 file=build/linux/x86_64/cuda_v11/bin/libcudart.so.11.0.gz │
│ time=2024-05-02T01:16:39.063Z level=DEBUG source=payload.go:180 msg=extracting variant=cuda_v11 file=build/linux/x86_64/cuda_v11/bin/ollama_llama_server.gz │
│ time=2024-05-02T01:16:39.063Z level=DEBUG source=payload.go:180 msg=extracting variant=rocm_v60002 file=build/linux/x86_64/rocm_v60002/bin/deps.txt.gz │
│ time=2024-05-02T01:16:39.063Z level=DEBUG source=payload.go:180 msg=extracting variant=rocm_v60002 file=build/linux/x86_64/rocm_v60002/bin/ollama_llama_server.gz │
│ time=2024-05-02T01:16:41.203Z level=DEBUG source=payload.go:71 msg="availableServers : found" file=/tmp/ollama3280993898/runners/cpu │
│ time=2024-05-02T01:16:41.203Z level=DEBUG source=payload.go:71 msg="availableServers : found" file=/tmp/ollama3280993898/runners/cpu_avx │
│ time=2024-05-02T01:16:41.203Z level=DEBUG source=payload.go:71 msg="availableServers : found" file=/tmp/ollama3280993898/runners/cpu_avx2 │
│ time=2024-05-02T01:16:41.203Z level=DEBUG source=payload.go:71 msg="availableServers : found" file=/tmp/ollama3280993898/runners/cuda_v11 │
│ time=2024-05-02T01:16:41.203Z level=DEBUG source=payload.go:71 msg="availableServers : found" file=/tmp/ollama3280993898/runners/rocm_v60002 │
│ time=2024-05-02T01:16:41.203Z level=INFO source=payload.go:44 msg="Dynamic LLM libraries [cuda_v11 rocm_v60002 cpu cpu_avx cpu_avx2]" │
│ time=2024-05-02T01:16:41.203Z level=DEBUG source=payload.go:45 msg="Override detection logic by setting OLLAMA_LLM_LIBRARY" │
│ time=2024-05-02T01:16:41.203Z level=DEBUG source=sched.go:105 msg="starting llm scheduler" │
│ time=2024-05-02T01:16:41.203Z level=INFO source=gpu.go:121 msg="Detecting GPUs" │
│ time=2024-05-02T01:16:41.203Z level=DEBUG source=gpu.go:247 msg="Searching for GPU library" name=libcuda.so* │
│ time=2024-05-02T01:16:41.203Z level=DEBUG source=gpu.go:266 msg="gpu library search" globs="[/usr/local/nvidia/lib/libcuda.so** /usr/local/nvidia/lib64/libcuda.so** /usr/local/cuda*/t │
│ argets/*/lib/libcuda.so* /usr/lib/*-linux-gnu/nvidia/current/libcuda.so* /usr/lib/*-linux-gnu/libcuda.so* /usr/lib/wsl/lib/libcuda.so* /usr/lib/wsl/drivers/*/libcuda.so* /opt/cuda/lib │
│ */libcuda.so* /usr/local/cuda/lib*/libcuda.so* /usr/lib*/libcuda.so* /usr/local/lib*/libcuda.so*]" │
│ time=2024-05-02T01:16:41.205Z level=DEBUG source=gpu.go:294 msg="discovered GPU libraries" paths=[] │
│ time=2024-05-02T01:16:41.205Z level=DEBUG source=gpu.go:247 msg="Searching for GPU library" name=libcudart.so* │
│ time=2024-05-02T01:16:41.205Z level=DEBUG source=gpu.go:266 msg="gpu library search" globs="[/usr/local/nvidia/lib/libcudart.so** /usr/local/nvidia/lib64/libcudart.so** /tmp/ollama328 │
│ 0993898/runners/cuda*/libcudart.so* /usr/local/cuda/lib64/libcudart.so* /usr/lib/x86_64-linux-gnu/nvidia/current/libcudart.so* /usr/lib/x86_64-linux-gnu/libcudart.so* /usr/lib/wsl/lib │
│ /libcudart.so* /usr/lib/wsl/drivers/*/libcudart.so* /opt/cuda/lib64/libcudart.so* /usr/local/cuda*/targets/aarch64-linux/lib/libcudart.so* /usr/lib/aarch64-linux-gnu/nvidia/current/li │
│ bcudart.so* /usr/lib/aarch64-linux-gnu/libcudart.so* /usr/local/cuda/lib*/libcudart.so* /usr/lib*/libcudart.so* /usr/local/lib*/libcudart.so*]" │
│ time=2024-05-02T01:16:41.207Z level=DEBUG source=gpu.go:294 msg="discovered GPU libraries" paths=[/tmp/ollama3280993898/runners/cuda_v11/libcudart.so.11.0] │
│ cudaSetDevice err: 35 │
│ time=2024-05-02T01:16:41.211Z level=DEBUG source=gpu.go:306 msg="Unable to load cudart" library=/tmp/ollama3280993898/runners/cuda_v11/libcudart.so.11.0 error="your nvidia driver is t │
│ oo old or missing. If you have a CUDA GPU please upgrade to run ollama" │
│ time=2024-05-02T01:16:41.211Z level=INFO source=cpu_common.go:11 msg="CPU has AVX2" │
│ time=2024-05-02T01:16:41.211Z level=DEBUG source=amd_linux.go:297 msg="amdgpu driver not detected /sys/module/amdgpu" │
│
│ time=2024-05-02T01:16:41.203Z level=DEBUG source=gpu.go:266 msg="gpu library search" globs="[/usr/local/nvidia/lib/libcuda.so** /usr/local/nvidia/lib64/libcuda.so** /usr/local/cuda*/t │
│ argets/*/lib/libcuda.so* /usr/lib/*-linux-gnu/nvidia/current/libcuda.so* /usr/lib/*-linux-gnu/libcuda.so* /usr/lib/wsl/lib/libcuda.so* /usr/lib/wsl/drivers/*/libcuda.so* /opt/cuda/lib │
│ */libcuda.so* /usr/local/cuda/lib*/libcuda.so* /usr/lib*/libcuda.so* /usr/local/lib*/libcuda.so*]" │
│ time=2024-05-02T01:16:41.205Z level=DEBUG source=gpu.go:294 msg="discovered GPU libraries" paths=[]
This indicates we're not finding a Driver API library but falling back to the runtime API. Perhaps I don't have enough search paths wired up for all the different container runtime permutations. (On my test system with local Docker and the nvidia container toolkit it finds the library) Can you try exec'ing into the pod and running a find / -name libcuda.so\* and see if it's located someplace else?
Ok - somewhere along the line (I'll have to dig through git blame to figure out when), I removed the runtimeClass setting on the pod... now that I've set it correctly to nvidia this container is working well for me.
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: ollama
namespace: ollama
spec:
runtimeClassName: nvidia
selector:
matchLabels:
name: ollama
template:
metadata:
labels:
name: ollama
spec:
runtimeClassName: nvidia
containers:
- name: ollama
image: dhiltgen/ollama:0.1.33-rc5-24-g089daae
resources:
limits:
nvidia.com/gpu: 1
tolerations:
- key: nvidia.com/gpu
operator: Exists
effect: NoSchedule
env:
- name: PATH
value: /usr/local/nvidia/bin:/usr/local/cuda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/local
- name: LD_LIBRARY_PATH
value: /usr/local/nvidia/lib:/usr/local/nvidia/lib64/:/usr/local/
- name: NVIDIA_DRIVER_CAPABILITIES
value: compute,utility
- name: NVIDIA_VISIBLE_DEVICES
value: all
- name: OLLAMA_DEBUG
value: "1"
ports:
- name: http
containerPort: 11434
protocol: TCP
With this container it is mounting correctly now and working as intended. I'll try to find some time to go back and see if I can figure out if other builds were also working I was just being a dolt about the pod definition.
I want to just say thank you for your patience and work on this project.
glad you found it :)
confirmed working fine with 0.1.34 image now. Assuming its fine with 0.1.33 as well. Thanks again.