ollama icon indicating copy to clipboard operation
ollama copied to clipboard

GPU not detected in Kubernetes.

Open dylanbstorey opened this issue 1 year ago • 14 comments

What is the issue?

When deploying into kubernetes the container is complaining about being unable to load the cudart library. (Or maybe its out of date)

Based on the documentation and provided examples I expect it to detect and utilize the GPU in container.

Every test I can think of (which is limited) seems to indicate this should be working but I'll bet I'm missing some nuance in the stack here - any advice would be appreciated.

Host Configuration :

uname -a                                                                                                                                                                                               
Linux overseer 6.5.0-28-generic #29~22.04.1-Ubuntu SMP PREEMPT_DYNAMIC Thu Apr  4 14:39:20 UTC 2 x86_64 x86_64 x86_64 GNU/Linux

nvcc

vcc --version                                                                                                                                                                                          
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2024 NVIDIA Corporation
Built on Thu_Mar_28_02:18:24_PDT_2024
Cuda compilation tools, release 12.4, V12.4.131
Build cuda_12.4.r12.4/compiler.34097967_0

Docker Run outputs :

docker run -d --gpus=all -v ollama:/root/.ollama -p 11435:11434 --name ollama ollama/ollama   docker run -d --gpus=all -v ollama:/root/.ollama -p 11435:11434 --name ollama ollama/ollama   
docker logs ollama 
time=2024-04-19T17:59:48.712Z level=INFO source=images.go:817 msg="total blobs: 0"
time=2024-04-19T17:59:48.712Z level=INFO source=images.go:824 msg="total unused blobs removed: 0"
time=2024-04-19T17:59:48.712Z level=INFO source=routes.go:1143 msg="Listening on [::]:11434 (version 0.1.32)"
time=2024-04-19T17:59:48.712Z level=INFO source=payload.go:28 msg="extracting embedded files" dir=/tmp/ollama4206527122/runners
time=2024-04-19T17:59:50.712Z level=INFO source=payload.go:41 msg="Dynamic LLM libraries [cpu_avx cpu_avx2 cuda_v11 rocm_v60002 cpu]"
time=2024-04-19T17:59:50.712Z level=INFO source=gpu.go:121 msg="Detecting GPU type"
time=2024-04-19T17:59:50.712Z level=INFO source=gpu.go:268 msg="Searching for GPU management library libcudart.so*"
time=2024-04-19T17:59:50.713Z level=INFO source=gpu.go:314 msg="Discovered GPU libraries: [/tmp/ollama4206527122/runners/cuda_v11/libcudart.so.11.0]"
time=2024-04-19T17:59:50.746Z level=INFO source=gpu.go:126 msg="Nvidia GPU detected via cudart"
time=2024-04-19T17:59:50.746Z level=INFO source=cpu_common.go:11 msg="CPU has AVX2"
time=2024-04-19T17:59:50.855Z level=INFO source=gpu.go:202 msg="[cudart] CUDART CUDA Compute Capability detected: 8.6"
[GIN] 2024/04/19 - 18:00:35 | 404 |     154.523µs |      172.17.0.1 | POST     "/api/generate"

Deployment Configuration:

---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: ollama
  namespace: ollama
spec:
  runtimeClassName: nvidia
  selector:
    matchLabels:
      name: ollama
  template:
    metadata:
      labels:
        name: ollama
    spec:
      containers:
        - name: ollama
          image: ollama/ollama
          resources:
            limits:
              nvidia.com/gpu: 1
          tolerations:
            - key: nvidia.com/gpu
              operator: Exists
              effect: NoSchedule
          env:
            - name: PATH
              value: /usr/local/nvidia/bin:/usr/local/cuda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
            - name: LD_LIBRARY_PATH
              value: /usr/local/nvidia/lib:/usr/local/nvidia/lib64
            - name: NVIDIA_VISIBLE_DEVICES
              value: all
            - name: NVIDIA_DRIVER_CAPABILITIES
              value: compute,utility
            - name: OLLAMA_DEBUG
              value: "1"
          ports:
            - name: http
              containerPort: 11434
              protocol: TCP

Deployment Logs:

│                                                                                                                           Autoscroll:On      FullScreen:Off     Timestamps:Off     Wrap:On                                                                                                                            │
│ Couldn't find '/root/.ollama/id_ed25519'. Generating new private key.                                                                                                                                                                                                                                                 │
│ Your new public key is:                                                                                                                                                                                                                                                                                               │
│                                                                                                                                                                                                                                                                                                                       │
│ ssh-ed25519 AAAAC3NzaC1lZDI1NTE5AAAAIMp1XONYlspAEBzMEJNATgAMm39ctFUiN3XZxLwlzVMB                                                                                                                                                                                                                                      │
│                                                                                                                                                                                                                                                                                                                       │
│ time=2024-04-19T17:27:12.252Z level=INFO source=images.go:817 msg="total blobs: 0"                                                                                                                                                                                                                                    │
│ time=2024-04-19T17:27:12.252Z level=INFO source=images.go:824 msg="total unused blobs removed: 0"                                                                                                                                                                                                                     │
│ time=2024-04-19T17:27:12.253Z level=INFO source=routes.go:1143 msg="Listening on :11434 (version 0.1.32)"                                                                                                                                                                                                             │
│ time=2024-04-19T17:27:12.253Z level=INFO source=payload.go:28 msg="extracting embedded files" dir=/tmp/ollama2307533246/runners                                                                                                                                                                                       │
│ time=2024-04-19T17:27:12.253Z level=DEBUG source=payload.go:160 msg=extracting variant=cpu file=build/linux/x86_64/cpu/bin/ollama_llama_server.gz                                                                                                                                                                     │
│ time=2024-04-19T17:27:12.253Z level=DEBUG source=payload.go:160 msg=extracting variant=cpu_avx file=build/linux/x86_64/cpu_avx/bin/ollama_llama_server.gz                                                                                                                                                             │
│ time=2024-04-19T17:27:12.253Z level=DEBUG source=payload.go:160 msg=extracting variant=cpu_avx2 file=build/linux/x86_64/cpu_avx2/bin/ollama_llama_server.gz                                                                                                                                                           │
│ time=2024-04-19T17:27:12.253Z level=DEBUG source=payload.go:160 msg=extracting variant=cuda_v11 file=build/linux/x86_64/cuda_v11/bin/libcublas.so.11.gz                                                                                                                                                               │
│ time=2024-04-19T17:27:12.253Z level=DEBUG source=payload.go:160 msg=extracting variant=cuda_v11 file=build/linux/x86_64/cuda_v11/bin/libcublasLt.so.11.gz                                                                                                                                                             │
│ time=2024-04-19T17:27:12.253Z level=DEBUG source=payload.go:160 msg=extracting variant=cuda_v11 file=build/linux/x86_64/cuda_v11/bin/libcudart.so.11.0.gz                                                                                                                                                             │
│ time=2024-04-19T17:27:12.253Z level=DEBUG source=payload.go:160 msg=extracting variant=cuda_v11 file=build/linux/x86_64/cuda_v11/bin/ollama_llama_server.gz                                                                                                                                                           │
│ time=2024-04-19T17:27:12.253Z level=DEBUG source=payload.go:160 msg=extracting variant=rocm_v60002 file=build/linux/x86_64/rocm_v60002/bin/deps.txt.gz                                                                                                                                                                │
│ time=2024-04-19T17:27:12.253Z level=DEBUG source=payload.go:160 msg=extracting variant=rocm_v60002 file=build/linux/x86_64/rocm_v60002/bin/ollama_llama_server.gz                                                                                                                                                     │
│ time=2024-04-19T17:27:14.217Z level=DEBUG source=payload.go:68 msg="availableServers : found" file=/tmp/ollama2307533246/runners/cpu                                                                                                                                                                                  │
│ time=2024-04-19T17:27:14.217Z level=DEBUG source=payload.go:68 msg="availableServers : found" file=/tmp/ollama2307533246/runners/cpu_avx                                                                                                                                                                              │
│ time=2024-04-19T17:27:14.217Z level=DEBUG source=payload.go:68 msg="availableServers : found" file=/tmp/ollama2307533246/runners/cpu_avx2                                                                                                                                                                             │
│ time=2024-04-19T17:27:14.217Z level=DEBUG source=payload.go:68 msg="availableServers : found" file=/tmp/ollama2307533246/runners/cuda_v11                                                                                                                                                                             │
│ time=2024-04-19T17:27:14.217Z level=DEBUG source=payload.go:68 msg="availableServers : found" file=/tmp/ollama2307533246/runners/rocm_v60002                                                                                                                                                                          │
│ time=2024-04-19T17:27:14.217Z level=INFO source=payload.go:41 msg="Dynamic LLM libraries [cuda_v11 rocm_v60002 cpu cpu_avx cpu_avx2]"                                                                                                                                                                                 │
│ time=2024-04-19T17:27:14.217Z level=DEBUG source=payload.go:42 msg="Override detection logic by setting OLLAMA_LLM_LIBRARY"                                                                                                                                                                                           │
│ time=2024-04-19T17:27:14.217Z level=INFO source=gpu.go:121 msg="Detecting GPU type"                                                                                                                                                                                                                                   │
│ time=2024-04-19T17:27:14.217Z level=INFO source=gpu.go:268 msg="Searching for GPU management library libcudart.so*"                                                                                                                                                                                                   │
│ time=2024-04-19T17:27:14.217Z level=DEBUG source=gpu.go:286 msg="gpu management search paths: [/tmp/ollama2307533246/runners/cuda*/libcudart.so* /usr/local/cuda/lib64/libcudart.so* /usr/lib/x86_64-linux-gnu/nvidia/current/libcudart.so* /usr/lib/x86_64-linux-gnu/libcudart.so* /usr/lib/wsl/lib/libcudart.so* /u │
│ sr/lib/wsl/drivers/*/libcudart.so* /opt/cuda/lib64/libcudart.so* /usr/local/cuda*/targets/aarch64-linux/lib/libcudart.so* /usr/lib/aarch64-linux-gnu/nvidia/current/libcudart.so* /usr/lib/aarch64-linux-gnu/libcudart.so* /usr/local/cuda/lib*/libcudart.so* /usr/lib*/libcudart.so* /usr/local/lib*/libcudart.so* / │
│ usr/local/nvidia/lib/libcudart.so** /usr/local/nvidia/lib64/libcudart.so**]"                                                                                                                                                                                                                                          │
│ time=2024-04-19T17:27:14.217Z level=INFO source=gpu.go:314 msg="Discovered GPU libraries: [/tmp/ollama2307533246/runners/cuda_v11/libcudart.so.11.0]"                                                                                                                                                                 │
│ wiring cudart library functions in /tmp/ollama2307533246/runners/cuda_v11/libcudart.so.11.0                                                                                                                                                                                                                           │
│ dlsym: cudaSetDevice                                                                                                                                                                                                                                                                                                  │
│ dlsym: cudaDeviceSynchronize                                                                                                                                                                                                                                                                                          │
│ dlsym: cudaDeviceReset                                                                                                                                                                                                                                                                                                │
│ dlsym: cudaMemGetInfo                                                                                                                                                                                                                                                                                                 │
│ dlsym: cudaGetDeviceCount                                                                                                                                                                                                                                                                                             │
│ dlsym: cudaDeviceGetAttribute                                                                                                                                                                                                                                                                                         │
│ dlsym: cudaDriverGetVersion                                                                                                                                                                                                                                                                                           │
│ cudaSetDevice err: 35                                                                                                                                                                                                                                                                                                 │
│ time=2024-04-19T17:27:14.218Z level=INFO source=gpu.go:343 msg="Unable to load cudart CUDA management library /tmp/ollama2307533246/runners/cuda_v11/libcudart.so.11.0: your nvidia driver is too old or missing, please upgrade to run ollama"                                                                       │
│ time=2024-04-19T17:27:14.218Z level=INFO source=gpu.go:268 msg="Searching for GPU management library libnvidia-ml.so"                                                                                                                                                                                                 │
│ time=2024-04-19T17:27:14.218Z level=DEBUG source=gpu.go:286 msg="gpu management search paths: [/usr/local/cuda/lib64/libnvidia-ml.so* /usr/lib/x86_64-linux-gnu/nvidia/current/libnvidia-ml.so* /usr/lib/x86_64-linux-gnu/libnvidia-ml.so* /usr/lib/wsl/lib/libnvidia-ml.so* /usr/lib/wsl/drivers/*/libnvidia-ml.so*  │
│ /opt/cuda/lib64/libnvidia-ml.so* /usr/lib*/libnvidia-ml.so* /usr/lib/aarch64-linux-gnu/nvidia/current/libnvidia-ml.so* /usr/lib/aarch64-linux-gnu/libnvidia-ml.so* /usr/local/lib*/libnvidia-ml.so* /opt/cuda/targets/x86_64-linux/lib/stubs/libnvidia-ml.so* /usr/local/nvidia/lib/libnvidia-ml.so* /usr/local/nvidi │
│ a/lib64/libnvidia-ml.so*]"                                                                                                                                                                                                                                                                                            │
│ time=2024-04-19T17:27:14.218Z level=INFO source=gpu.go:314 msg="Discovered GPU libraries: []"                                                                                                                                                                                                                         │
│ time=2024-04-19T17:27:14.218Z level=INFO source=cpu_common.go:11 msg="CPU has AVX2"                                                                                                                                                                                                                                   │
│ time=2024-04-19T17:27:14.218Z level=DEBUG source=amd_linux.go:280 msg="amdgpu driver not detected /sys/module/amdgpu"                                                                                                                                                                                                 │
│ time=2024-04-19T17:27:14.218Z level=INFO source=routes.go:1164 msg="no GPU detected"  

Kubernetes Based nbody run :

 cat <<EOF | kubectl create -f -                                                                                                                                                                                                                                                                                   
apiVersion: v1              
kind: Pod
metadata:
  name: nbody-gpu-benchmark
  namespace: default
spec:
  restartPolicy: OnFailure
  runtimeClassName: nvidia
  containers:
  - name: cuda-container
    image: nvcr.io/nvidia/k8s/cuda-sample:nbody
    args: ["nbody", "-gpu", "-benchmark"]
    resources:
      limits:
        nvidia.com/gpu: 1
    env:
    - name: NVIDIA_VISIBLE_DEVICES
      value: all
    - name: NVIDIA_DRIVER_CAPABILITIES
      value: all
EOF

nbody container logs

                                                                                                                           Autoscroll:On      FullScreen:Off     Timestamps:Off     Wrap:Off                                                                                                                           │
│ Run "nbody -benchmark [-numbodies=<numBodies>]" to measure performance.                                                                                                                                                                                                                                               │
│     -fullscreen       (run n-body simulation in fullscreen mode)                                                                                                                                                                                                                                                      │
│     -fp64             (use double precision floating point values for simulation)                                                                                                                                                                                                                                     │
│     -hostmem          (stores simulation data in host memory)                                                                                                                                                                                                                                                         │
│     -benchmark        (run benchmark to measure performance)                                                                                                                                                                                                                                                          │
│     -numbodies=<N>    (number of bodies (>= 1) to run in simulation)                                                                                                                                                                                                                                                  │
│     -device=<d>       (where d=0,1,2.... for the CUDA device to use)                                                                                                                                                                                                                                                  │
│     -numdevices=<i>   (where i=(number of CUDA devices > 0) to use for simulation)                                                                                                                                                                                                                                    │
│     -compare          (compares simulation results running once on the default GPU and once on the CPU)                                                                                                                                                                                                               │
│     -cpu              (run n-body simulation on the CPU)                                                                                                                                                                                                                                                              │
│     -tipsy=<file.bin> (load a tipsy model file for simulation)                                                                                                                                                                                                                                                        │
│                                                                                                                                                                                                                                                                                                                       │
│ NOTE: The CUDA Samples are not meant for performance measurements. Results may vary when GPU Boost is enabled.                                                                                                                                                                                                        │
│                                                                                                                                                                                                                                                                                                                       │
│ > Windowed mode                                                                                                                                                                                                                                                                                                       │
│ > Simulation data stored in video memory                                                                                                                                                                                                                                                                              │
│ > Single precision floating point simulation                                                                                                                                                                                                                                                                          │
│ > 1 Devices used for simulation                                                                                                                                                                                                                                                                                       │
│ GPU Device 0: "Ampere" with compute capability 8.6                                                                                                                                                                                                                                                                    │
│                                                                                                                                                                                                                                                                                                                       │
│ > Compute 8.6 CUDA device: [NVIDIA GeForce RTX 3060]                                                                                                                                                                                                                                                                  │
│ 28672 bodies, total time for 10 iterations: 22.067 ms                                                                                                                                                                                                                                                                 │
│ = 372.538 billion interactions per second                                                                                                                                                                                                                                                                             │
│ = 7450.761 single-precision GFLOP/s at 20 flops per interaction  

OS

Linux

GPU

Nvidia

CPU

Intel

Ollama version

latest

dylanbstorey avatar Apr 19 '24 18:04 dylanbstorey

Can you share what driver version you have on the host? nvidia-smi on the host should report it.

dhiltgen avatar Apr 19 '24 20:04 dhiltgen

Sure thing !

Fri Apr 19 16:41:58 2024       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.54.15              Driver Version: 550.54.15      CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce RTX 3060        Off |   00000000:0B:00.0  On |                  N/A |
|  0%   45C    P8             20W /  170W |    6167MiB /  12288MiB |     18%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                                                                         
+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|    0   N/A  N/A      1351      G   /usr/lib/xorg/Xorg                            420MiB |
|    0   N/A  N/A      1710      G   /usr/bin/gnome-shell                           50MiB |
|    0   N/A  N/A      5747      G   ...ures=SpareRendererForSitePerProcess        241MiB |
|    0   N/A  N/A     20562      G   ...yOnDemand --variations-seed-version        117MiB |
|    0   N/A  N/A    127243      G   ...seed-version=20240419-050138.465000        187MiB |
|    0   N/A  N/A    206026      C   ...unners/cuda_v11/ollama_llama_server       5114MiB |
+-----------------------------------------------------------------------------------------+

dylanbstorey avatar Apr 19 '24 20:04 dylanbstorey

I am hosting locally and running ollama with my RTX4090.

I had an issue running the new Llama3 model, so thought I'd update my local image/container.

Since pulling the new version, I get the error shown in the screen clip, that my nvidia driver is too old: image

Coincidentally, there was an nvidia driver update available, but applying this update has made no difference. My current nvidia driver version is 552.22 released 16th April 2024.

chrisnurse avatar Apr 20 '24 01:04 chrisnurse

Hmm... I have a very similar test setup and haven't been able to reproduce yet. (Ubuntu 22.04 kernel 5.15.0-105-generic)

% nvidia-smi
Mon Apr 22 23:00:49 2024
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.54.15              Driver Version: 550.54.15      CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce RTX 3060        Off |   00000000:01:00.0 Off |                  N/A |
|  0%   45C    P8             14W /  170W |    1872MiB /  12288MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   1  NVIDIA GeForce RTX 3060        Off |   00000000:05:00.0 Off |                  N/A |
|  0%   48C    P8             13W /  170W |    1792MiB /  12288MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|    0   N/A  N/A      3159      C   ...unners/cuda_v11/ollama_llama_server       1866MiB |
|    1   N/A  N/A      3159      C   ...unners/cuda_v11/ollama_llama_server       1786MiB |
+-----------------------------------------------------------------------------------------+
% docker run --rm -it --gpus=all -e OLLAMA_DEBUG=1  ollama/ollama
Unable to find image 'ollama/ollama:latest' locally
latest: Pulling from ollama/ollama
bccd10f490ab: Pull complete
50f1619ca0b1: Pull complete
7a344f274044: Pull complete
Digest: sha256:c5018bf71b27a38f50da37d86fa0067105eea488cdcc258ace6d222dde632f75
Status: Downloaded newer image for ollama/ollama:latest
Couldn't find '/root/.ollama/id_ed25519'. Generating new private key.
Your new public key is:

ssh-ed25519 AAAAC3NzaC1lZDI1NTE5AAAAIIC3jXTsRoQI5yH6n0hhiHtRbJoD6gYOd9C81TRfk+Ck

time=2024-04-22T22:59:11.746Z level=INFO source=images.go:817 msg="total blobs: 0"
time=2024-04-22T22:59:11.746Z level=INFO source=images.go:824 msg="total unused blobs removed: 0"
time=2024-04-22T22:59:11.746Z level=INFO source=routes.go:1143 msg="Listening on [::]:11434 (version 0.1.32)"
time=2024-04-22T22:59:11.746Z level=INFO source=payload.go:28 msg="extracting embedded files" dir=/tmp/ollama1759042201/runners
time=2024-04-22T22:59:11.746Z level=DEBUG source=payload.go:160 msg=extracting variant=cpu file=build/linux/x86_64/cpu/bin/ollama_llama_server.gz
time=2024-04-22T22:59:11.746Z level=DEBUG source=payload.go:160 msg=extracting variant=cpu_avx file=build/linux/x86_64/cpu_avx/bin/ollama_llama_server.gz
time=2024-04-22T22:59:11.746Z level=DEBUG source=payload.go:160 msg=extracting variant=cpu_avx2 file=build/linux/x86_64/cpu_avx2/bin/ollama_llama_server.gz
time=2024-04-22T22:59:11.746Z level=DEBUG source=payload.go:160 msg=extracting variant=cuda_v11 file=build/linux/x86_64/cuda_v11/bin/libcublas.so.11.gz
time=2024-04-22T22:59:11.746Z level=DEBUG source=payload.go:160 msg=extracting variant=cuda_v11 file=build/linux/x86_64/cuda_v11/bin/libcublasLt.so.11.gz
time=2024-04-22T22:59:11.746Z level=DEBUG source=payload.go:160 msg=extracting variant=cuda_v11 file=build/linux/x86_64/cuda_v11/bin/libcudart.so.11.0.gz
time=2024-04-22T22:59:11.746Z level=DEBUG source=payload.go:160 msg=extracting variant=cuda_v11 file=build/linux/x86_64/cuda_v11/bin/ollama_llama_server.gz
time=2024-04-22T22:59:11.746Z level=DEBUG source=payload.go:160 msg=extracting variant=rocm_v60002 file=build/linux/x86_64/rocm_v60002/bin/deps.txt.gz
time=2024-04-22T22:59:11.746Z level=DEBUG source=payload.go:160 msg=extracting variant=rocm_v60002 file=build/linux/x86_64/rocm_v60002/bin/ollama_llama_server.gz
time=2024-04-22T22:59:13.484Z level=DEBUG source=payload.go:68 msg="availableServers : found" file=/tmp/ollama1759042201/runners/cpu
time=2024-04-22T22:59:13.484Z level=DEBUG source=payload.go:68 msg="availableServers : found" file=/tmp/ollama1759042201/runners/cpu_avx
time=2024-04-22T22:59:13.484Z level=DEBUG source=payload.go:68 msg="availableServers : found" file=/tmp/ollama1759042201/runners/cpu_avx2
time=2024-04-22T22:59:13.484Z level=DEBUG source=payload.go:68 msg="availableServers : found" file=/tmp/ollama1759042201/runners/cuda_v11
time=2024-04-22T22:59:13.484Z level=DEBUG source=payload.go:68 msg="availableServers : found" file=/tmp/ollama1759042201/runners/rocm_v60002
time=2024-04-22T22:59:13.484Z level=INFO source=payload.go:41 msg="Dynamic LLM libraries [cpu_avx2 cuda_v11 rocm_v60002 cpu cpu_avx]"
time=2024-04-22T22:59:13.484Z level=DEBUG source=payload.go:42 msg="Override detection logic by setting OLLAMA_LLM_LIBRARY"
time=2024-04-22T22:59:13.484Z level=INFO source=gpu.go:121 msg="Detecting GPU type"
time=2024-04-22T22:59:13.484Z level=INFO source=gpu.go:268 msg="Searching for GPU management library libcudart.so*"
time=2024-04-22T22:59:13.484Z level=DEBUG source=gpu.go:286 msg="gpu management search paths: [/tmp/ollama1759042201/runners/cuda*/libcudart.so* /usr/local/cuda/lib64/libcudart.so* /usr/lib/x86_64-linux-gnu/nvidia/current/libcudart.so* /usr/lib/x86_64-linux-gnu/libcudart.so* /usr/lib/wsl/lib/libcudart.so* /usr/lib/wsl/drivers/*/libcudart.so* /opt/cuda/lib64/libcudart.so* /usr/local/cuda*/targets/aarch64-linux/lib/libcudart.so* /usr/lib/aarch64-linux-gnu/nvidia/current/libcudart.so* /usr/lib/aarch64-linux-gnu/libcudart.so* /usr/local/cuda/lib*/libcudart.so* /usr/lib*/libcudart.so* /usr/local/lib*/libcudart.so* /usr/local/nvidia/lib/libcudart.so** /usr/local/nvidia/lib64/libcudart.so**]"
time=2024-04-22T22:59:13.484Z level=INFO source=gpu.go:314 msg="Discovered GPU libraries: [/tmp/ollama1759042201/runners/cuda_v11/libcudart.so.11.0]"
wiring cudart library functions in /tmp/ollama1759042201/runners/cuda_v11/libcudart.so.11.0
dlsym: cudaSetDevice
dlsym: cudaDeviceSynchronize
dlsym: cudaDeviceReset
dlsym: cudaMemGetInfo
dlsym: cudaGetDeviceCount
dlsym: cudaDeviceGetAttribute
dlsym: cudaDriverGetVersion
CUDA driver version: 12-4
time=2024-04-22T22:59:13.525Z level=INFO source=gpu.go:126 msg="Nvidia GPU detected via cudart"
time=2024-04-22T22:59:13.525Z level=INFO source=cpu_common.go:11 msg="CPU has AVX2"
[0] CUDA totalMem 12622168064
[0] CUDA freeMem 12508921856
[1] CUDA totalMem 12622168064
[1] CUDA freeMem 12508921856
time=2024-04-22T22:59:13.621Z level=INFO source=gpu.go:202 msg="[cudart] CUDART CUDA Compute Capability detected: 8.6"
releasing cudart library
[GIN] 2024/04/22 - 22:59:38 | 200 |      48.977µs |       127.0.0.1 | HEAD     "/"

Exec'ing into the container shows fast token rate (logs also show running on GPU)

% docker exec -it stupefied_bhabha ollama run --verbose orca-mini hello
 Hello there! How can I assist you today?

total duration:       153.15311ms
load duration:        318.736µs
prompt eval duration: 20.014ms
prompt eval rate:     0.00 tokens/s
eval count:           11 token(s)
eval duration:        86.036ms
eval rate:            127.85 tokens/s

Is it possible the nvidia container toolkit is out of rev? If not that, perhaps the kernel version being newer has an impact.

dhiltgen avatar Apr 22 '24 23:04 dhiltgen

The docker container is working fine - gpu getting used as expected, I'm only seeing the problem within kubernetes.

The log that I think indicates what is going on is here is at :

│ time=2024-04-19T17:27:14.217Z level=INFO source=gpu.go:268 msg="Searching for GPU management library libcudart.so*"                                                                                                                                                                                                   │
│ time=2024-04-19T17:27:14.217Z level=DEBUG source=gpu.go:286 msg="gpu management search paths: [/tmp/ollama2307533246/runners/cuda*/libcudart.so* /usr/local/cuda/lib64/libcudart.so* /usr/lib/x86_64-linux-gnu/nvidia/current/libcudart.so* /usr/lib/x86_64-linux-gnu/libcudart.so* /usr/lib/wsl/lib/libcudart.so* /u │
│ sr/lib/wsl/drivers/*/libcudart.so* /opt/cuda/lib64/libcudart.so* /usr/local/cuda*/targets/aarch64-linux/lib/libcudart.so* /usr/lib/aarch64-linux-gnu/nvidia/current/libcudart.so* /usr/lib/aarch64-linux-gnu/libcudart.so* /usr/local/cuda/lib*/libcudart.so* /usr/lib*/libcudart.so* /usr/local/lib*/libcudart.so* / │
│ usr/local/nvidia/lib/libcudart.so** /usr/local/nvidia/lib64/libcudart.so**]"                                                                                                                                                                                                                                          │
│ time=2024-04-19T17:27:14.217Z level=INFO source=gpu.go:314 msg="Discovered GPU libraries: [/tmp/ollama2307533246/runners/cuda_v11/libcudart.so.11.0]"                                                                                                                                                                 │
│ wiring cudart library functions in /tmp/ollama2307533246/runners/cuda_v11/libcudart.so.11.0                                                                                                                                                                                                                           │
│ dlsym: cudaSetDevice                                                                                                                                                                                                                                                                                                  │
│ dlsym: cudaDeviceSynchronize                                                                                                                                                                                                                                                                                          │
│ dlsym: cudaDeviceReset                                                                                                                                                                                                                                                                                                │
│ dlsym: cudaMemGetInfo                                                                                                                                                                                                                                                                                                 │
│ dlsym: cudaGetDeviceCount                                                                                                                                                                                                                                                                                             │
│ dlsym: cudaDeviceGetAttribute                                                                                                                                                                                                                                                                                         │
│ dlsym: cudaDriverGetVersion                                                                                                                                                                                                                                                                                           │
│ cudaSetDevice err: 35                                                                                                                                                                                                                                                                                                 │
│ time=2024-04-19T17:27:14.218Z level=INFO source=gpu.go:343 msg="Unable to load cudart CUDA management library /tmp/ollama2307533246/runners/cuda_v11/libcudart.so.11.0: your nvidia driver is too old or missing, please upgrade to run ollama"    

What confuses me is that i'm able to load other cuda libraries (i think) when using the nbody program so I know the GPU is available in cluster as part of the run time.

dylanbstorey avatar Apr 23 '24 00:04 dylanbstorey

@dhiltgen quick note of appreciation for the response. Mine may have been a short term issue whilst all the different moving parts aligned around Llama3. Today everything looks good with latest version of Ollama and AnythingLLM (my current goto environment for learning).

Apologies if this was off topic.

chrisnurse avatar Apr 23 '24 02:04 chrisnurse

I wonder if this might be a dup of #1500 and the bundled cudart library we bake into the image doesn't support MIG?

dhiltgen avatar Apr 24 '24 17:04 dhiltgen

Oh good eye, I'll pull that PR and see if that fixes the issue and then patiently wait for the merge to come through if so.

dylanbstorey avatar Apr 24 '24 20:04 dylanbstorey

This doesn't appear to be related. :disappointed: I'll keep looking - if i get an answer i'll make sure to post it. Thanks for providing the support.

Maybe useful - i ran nvidia-smi within the container and got an NVML: Unknown error.

dylanbstorey avatar Apr 24 '24 22:04 dylanbstorey

I've been working on a change to how we do GPU discovery that might resolve this issue. I've pushed test images up to Docker Hub for folks to try out.

dhiltgen/ollama:0.1.33-rc5-24-g089daae
dhiltgen/ollama:0.1.33-rc5-24-g089daae-rocm

dhiltgen avatar May 01 '24 16:05 dhiltgen

Still getting a 35 error.

dylanbstorey avatar May 02 '24 01:05 dylanbstorey

@dylanbstorey can you share the server log with -e OLLAMA_DEBUG=1 set so I can see which API is getting that error?

dhiltgen avatar May 02 '24 16:05 dhiltgen

│ Couldn't find '/root/.ollama/id_ed25519'. Generating new private key.                                                                                                                   │
│ Your new public key is:                                                                                                                                                                 │
│                                                                                                                                                                                         │
│ ssh-ed25519 AAAAC3NzaC1lZDI1NTE5AAAAIGToQkRNU4xkWVE/tqEac0rtLgjiOPiF58sO7iSlbYJK                                                                                                        │
│                                                                                                                                                                                         │
│ time=2024-05-02T01:16:39.059Z level=INFO source=images.go:828 msg="total blobs: 0"                                                                                                      │
│ time=2024-05-02T01:16:39.062Z level=INFO source=images.go:835 msg="total unused blobs removed: 0"                                                                                       │
│ time=2024-05-02T01:16:39.062Z level=INFO source=routes.go:1074 msg="Listening on :11434 (version 0.1.33-rc5-24-g089daae)"                                                               │
│ time=2024-05-02T01:16:39.063Z level=INFO source=payload.go:30 msg="extracting embedded files" dir=/tmp/ollama3280993898/runners                                                         │
│ time=2024-05-02T01:16:39.063Z level=DEBUG source=payload.go:180 msg=extracting variant=cpu file=build/linux/x86_64/cpu/bin/ollama_llama_server.gz                                       │
│ time=2024-05-02T01:16:39.063Z level=DEBUG source=payload.go:180 msg=extracting variant=cpu_avx file=build/linux/x86_64/cpu_avx/bin/ollama_llama_server.gz                               │
│ time=2024-05-02T01:16:39.063Z level=DEBUG source=payload.go:180 msg=extracting variant=cpu_avx2 file=build/linux/x86_64/cpu_avx2/bin/ollama_llama_server.gz                             │
│ time=2024-05-02T01:16:39.063Z level=DEBUG source=payload.go:180 msg=extracting variant=cuda_v11 file=build/linux/x86_64/cuda_v11/bin/libcublas.so.11.gz                                 │
│ time=2024-05-02T01:16:39.063Z level=DEBUG source=payload.go:180 msg=extracting variant=cuda_v11 file=build/linux/x86_64/cuda_v11/bin/libcublasLt.so.11.gz                               │
│ time=2024-05-02T01:16:39.063Z level=DEBUG source=payload.go:180 msg=extracting variant=cuda_v11 file=build/linux/x86_64/cuda_v11/bin/libcudart.so.11.0.gz                               │
│ time=2024-05-02T01:16:39.063Z level=DEBUG source=payload.go:180 msg=extracting variant=cuda_v11 file=build/linux/x86_64/cuda_v11/bin/ollama_llama_server.gz                             │
│ time=2024-05-02T01:16:39.063Z level=DEBUG source=payload.go:180 msg=extracting variant=rocm_v60002 file=build/linux/x86_64/rocm_v60002/bin/deps.txt.gz                                  │
│ time=2024-05-02T01:16:39.063Z level=DEBUG source=payload.go:180 msg=extracting variant=rocm_v60002 file=build/linux/x86_64/rocm_v60002/bin/ollama_llama_server.gz                       │
│ time=2024-05-02T01:16:41.203Z level=DEBUG source=payload.go:71 msg="availableServers : found" file=/tmp/ollama3280993898/runners/cpu                                                    │
│ time=2024-05-02T01:16:41.203Z level=DEBUG source=payload.go:71 msg="availableServers : found" file=/tmp/ollama3280993898/runners/cpu_avx                                                │
│ time=2024-05-02T01:16:41.203Z level=DEBUG source=payload.go:71 msg="availableServers : found" file=/tmp/ollama3280993898/runners/cpu_avx2                                               │
│ time=2024-05-02T01:16:41.203Z level=DEBUG source=payload.go:71 msg="availableServers : found" file=/tmp/ollama3280993898/runners/cuda_v11                                               │
│ time=2024-05-02T01:16:41.203Z level=DEBUG source=payload.go:71 msg="availableServers : found" file=/tmp/ollama3280993898/runners/rocm_v60002                                            │
│ time=2024-05-02T01:16:41.203Z level=INFO source=payload.go:44 msg="Dynamic LLM libraries [cuda_v11 rocm_v60002 cpu cpu_avx cpu_avx2]"                                                   │
│ time=2024-05-02T01:16:41.203Z level=DEBUG source=payload.go:45 msg="Override detection logic by setting OLLAMA_LLM_LIBRARY"                                                             │
│ time=2024-05-02T01:16:41.203Z level=DEBUG source=sched.go:105 msg="starting llm scheduler"                                                                                              │
│ time=2024-05-02T01:16:41.203Z level=INFO source=gpu.go:121 msg="Detecting GPUs"                                                                                                         │
│ time=2024-05-02T01:16:41.203Z level=DEBUG source=gpu.go:247 msg="Searching for GPU library" name=libcuda.so*                                                                            │
│ time=2024-05-02T01:16:41.203Z level=DEBUG source=gpu.go:266 msg="gpu library search" globs="[/usr/local/nvidia/lib/libcuda.so** /usr/local/nvidia/lib64/libcuda.so** /usr/local/cuda*/t │
│ argets/*/lib/libcuda.so* /usr/lib/*-linux-gnu/nvidia/current/libcuda.so* /usr/lib/*-linux-gnu/libcuda.so* /usr/lib/wsl/lib/libcuda.so* /usr/lib/wsl/drivers/*/libcuda.so* /opt/cuda/lib │
│ */libcuda.so* /usr/local/cuda/lib*/libcuda.so* /usr/lib*/libcuda.so* /usr/local/lib*/libcuda.so*]"                                                                                      │
│ time=2024-05-02T01:16:41.205Z level=DEBUG source=gpu.go:294 msg="discovered GPU libraries" paths=[]                                                                                     │
│ time=2024-05-02T01:16:41.205Z level=DEBUG source=gpu.go:247 msg="Searching for GPU library" name=libcudart.so*                                                                          │
│ time=2024-05-02T01:16:41.205Z level=DEBUG source=gpu.go:266 msg="gpu library search" globs="[/usr/local/nvidia/lib/libcudart.so** /usr/local/nvidia/lib64/libcudart.so** /tmp/ollama328 │
│ 0993898/runners/cuda*/libcudart.so* /usr/local/cuda/lib64/libcudart.so* /usr/lib/x86_64-linux-gnu/nvidia/current/libcudart.so* /usr/lib/x86_64-linux-gnu/libcudart.so* /usr/lib/wsl/lib │
│ /libcudart.so* /usr/lib/wsl/drivers/*/libcudart.so* /opt/cuda/lib64/libcudart.so* /usr/local/cuda*/targets/aarch64-linux/lib/libcudart.so* /usr/lib/aarch64-linux-gnu/nvidia/current/li │
│ bcudart.so* /usr/lib/aarch64-linux-gnu/libcudart.so* /usr/local/cuda/lib*/libcudart.so* /usr/lib*/libcudart.so* /usr/local/lib*/libcudart.so*]"                                         │
│ time=2024-05-02T01:16:41.207Z level=DEBUG source=gpu.go:294 msg="discovered GPU libraries" paths=[/tmp/ollama3280993898/runners/cuda_v11/libcudart.so.11.0]                             │
│ cudaSetDevice err: 35                                                                                                                                                                   │
│ time=2024-05-02T01:16:41.211Z level=DEBUG source=gpu.go:306 msg="Unable to load cudart" library=/tmp/ollama3280993898/runners/cuda_v11/libcudart.so.11.0 error="your nvidia driver is t │
│ oo old or missing.  If you have a CUDA GPU please upgrade to run ollama"                                                                                                                │
│ time=2024-05-02T01:16:41.211Z level=INFO source=cpu_common.go:11 msg="CPU has AVX2"                                                                                                     │
│ time=2024-05-02T01:16:41.211Z level=DEBUG source=amd_linux.go:297 msg="amdgpu driver not detected /sys/module/amdgpu"                                                                   │
│                                                                                                                         

dylanbstorey avatar May 03 '24 11:05 dylanbstorey

│ time=2024-05-02T01:16:41.203Z level=DEBUG source=gpu.go:266 msg="gpu library search" globs="[/usr/local/nvidia/lib/libcuda.so** /usr/local/nvidia/lib64/libcuda.so** /usr/local/cuda*/t │
│ argets/*/lib/libcuda.so* /usr/lib/*-linux-gnu/nvidia/current/libcuda.so* /usr/lib/*-linux-gnu/libcuda.so* /usr/lib/wsl/lib/libcuda.so* /usr/lib/wsl/drivers/*/libcuda.so* /opt/cuda/lib │
│ */libcuda.so* /usr/local/cuda/lib*/libcuda.so* /usr/lib*/libcuda.so* /usr/local/lib*/libcuda.so*]"                                                                                      │
│ time=2024-05-02T01:16:41.205Z level=DEBUG source=gpu.go:294 msg="discovered GPU libraries" paths=[]

This indicates we're not finding a Driver API library but falling back to the runtime API. Perhaps I don't have enough search paths wired up for all the different container runtime permutations. (On my test system with local Docker and the nvidia container toolkit it finds the library) Can you try exec'ing into the pod and running a find / -name libcuda.so\* and see if it's located someplace else?

dhiltgen avatar May 03 '24 16:05 dhiltgen

giphy

Ok - somewhere along the line (I'll have to dig through git blame to figure out when), I removed the runtimeClass setting on the pod... now that I've set it correctly to nvidia this container is working well for me.

---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: ollama
  namespace: ollama
spec:
  runtimeClassName: nvidia
  selector:
    matchLabels:
      name: ollama
  template:
    metadata:
      labels:
        name: ollama
    spec:
      runtimeClassName: nvidia
      containers:
        - name: ollama
          image: dhiltgen/ollama:0.1.33-rc5-24-g089daae
          resources:
            limits:
              nvidia.com/gpu: 1
          tolerations:
            - key: nvidia.com/gpu
              operator: Exists
              effect: NoSchedule
          env:
            - name: PATH
              value: /usr/local/nvidia/bin:/usr/local/cuda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/local
            - name: LD_LIBRARY_PATH
              value: /usr/local/nvidia/lib:/usr/local/nvidia/lib64/:/usr/local/
            - name: NVIDIA_DRIVER_CAPABILITIES
              value: compute,utility
            - name: NVIDIA_VISIBLE_DEVICES
              value: all
            - name: OLLAMA_DEBUG
              value: "1"
          ports:
            - name: http
              containerPort: 11434
              protocol: TCP


With this container it is mounting correctly now and working as intended. I'll try to find some time to go back and see if I can figure out if other builds were also working I was just being a dolt about the pod definition.

I want to just say thank you for your patience and work on this project.

dylanbstorey avatar May 08 '24 12:05 dylanbstorey

glad you found it :)

dims avatar May 08 '24 12:05 dims

confirmed working fine with 0.1.34 image now. Assuming its fine with 0.1.33 as well. Thanks again.

dylanbstorey avatar May 10 '24 12:05 dylanbstorey