Bug: Container image fails to start due to CUDA version mismatch
Describe the bug
When trying to run the latest official cuda-86 image (ghcr.io/ericlbuehler/mistral.rs:cuda-86-latest) mistralrs server fails to load with an error stating it can't find the cuda libs in the LD_LIBRARY_PATH:
Unable to dynamically load the "cuda" shared library - searched for library names: ["cuda", "nvcuda"]. Ensure that `LD_LIBRARY_PATH` has the correct path to the installed library. If the shared library is present on the system under a different name than one of those listed above, please open a GitHub issue.
stack backtrace:
0: rust_begin_unwind
1: core::panicking::panic_fmt
2: cudarc::panic_no_lib_found
3: std::sys::sync::once::futex::Once::call
4: std::sync::once_lock::OnceLock<T>::initialize
5: cudarc::driver::safe::core::CudaDevice::new
6: <candle_core::cuda_backend::device::CudaDevice as candle_core::backend::BackendDevice>::new
7: candle_core::device::Device::cuda_if_available
8: mistralrs_server::main::{{closure}}
9: tokio::runtime::park::CachedParkThread::block_on
10: tokio::runtime::context::runtime::enter_runtime
11: tokio::runtime::runtime::Runtime::block_on
12: mistralrs_server::main
note: Some details are omitted, run with `RUST_BACKTRACE=full` for a verbose backtrace.
Overriding LD_LIBRARY_PATH to include the path to libcuda.so inside the container (/usr/local/cuda-12.4/compat) then reveals that the problem is a CUDA version mismatch:
LD_LIBRARY_PATH= /usr/local/cuda/lib64:/usr/local/nvidia/lib:/usr/local/nvidia/lib64:/usr/local/cuda-12.4/compat
Stack backtrace:
0: anyhow::error::<impl core::convert::From<E> for anyhow::Error>::from
1: mistralrs_server::main::{{closure}}
2: tokio::runtime::park::CachedParkThread::block_on
3: tokio::runtime::context::runtime::enter_runtime
4: tokio::runtime::runtime::Runtime::block_on
5: mistralrs_server::main
6: std::sys::backtrace::__rust_begin_short_backtrace
7: std::rt::lang_start::{{closure}}
8: std::rt::lang_start_internal
9: main
10: <unknown>
11: __libc_start_main
12: _start
Error: DriverError(CUDA_ERROR_SYSTEM_DRIVER_MISMATCH, "system has unsupported display driver / cuda driver combination")
0: candle_core::error::Error::bt
1: <candle_core::cuda_backend::device::CudaDevice as candle_core::backend::BackendDevice>::new
2: candle_core::device::Device::cuda_if_available
3: mistralrs_server::main::{{closure}}
4: tokio::runtime::park::CachedParkThread::block_on
5: tokio::runtime::context::runtime::enter_runtime
6: tokio::runtime::runtime::Runtime::block_on
7: mistralrs_server::main
8: std::sys::backtrace::__rust_begin_short_backtrace
9: std::rt::lang_start::{{closure}}
10: std::rt::lang_start_internal
11: main
12: <unknown>
13: __libc_start_main
14: _start
Stack backtrace:
0: anyhow::error::<impl core::convert::From<E> for anyhow::Error>::from
1: mistralrs_server::main::{{closure}}
2: tokio::runtime::park::CachedParkThread::block_on
3: tokio::runtime::context::runtime::enter_runtime
4: tokio::runtime::runtime::Runtime::block_on
5: mistralrs_server::main
6: std::sys::backtrace::__rust_begin_short_backtrace
7: std::rt::lang_start::{{closure}}
8: std::rt::lang_start_internal
9: main
10: <unknown>
11: __libc_start_main
12: _start
docker inspect shows the env to be:
"Env": [
"KEEP_ALIVE_INTERVAL=100",
"RUST_BACKTRACE=1",
"LD_LIBRARY_PATH=/usr/local/cuda/lib64:/usr/local/nvidia/lib:/usr/local/nvidia/lib64:/usr/local/cuda-12.4",
"PATH=/usr/local/nvidia/bin:/usr/local/cuda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin",
"NVARCH=x86_64",
"NVIDIA_REQUIRE_CUDA=cuda>=12.4 brand=tesla,driver>=470,driver<471 brand=unknown,driver>=470,driver<471 brand=nvidia,driver>=470,driver<471 brand=nvidiartx,driver>=470,driver<471 brand=geforce,driver>=470,driver<471 brand=geforcertx,driver>=470,driver<471 brand=quadro,driver>=470,driver<471 brand=quadrortx,driver>=470,driver<471 brand=titan,driver>=470,driver<471 brand=titanrtx,driver>=470,driver<471 brand=tesla,driver>=525,driver<526 brand=unknown,driver>=525,driver<526 brand=nvidia,driver>=525,driver<526 brand=nvidiartx,driver>=525,driver<526 brand=geforce,driver>=525,driver<526 brand=geforcertx,driver>=525,driver<526 brand=quadro,driver>=525,driver<526 brand=quadrortx,driver>=525,driver<526 brand=titan,driver>=525,driver<526 brand=titanrtx,driver>=525,driver<526 brand=tesla,driver>=535,driver<536 brand=unknown,driver>=535,driver<536 brand=nvidia,driver>=535,driver<536 brand=nvidiartx,driver>=535,driver<536 brand=geforce,driver>=535,driver<536 brand=geforcertx,driver>=535,driver<536 brand=quadro,driver>=535,driver<536 brand=quadrortx,driver>=535,driver<536 brand=titan,driver>=535,driver<536 brand=titanrtx,driver>=535,driver<536",
"NV_CUDA_CUDART_VERSION=12.4.127-1",
"NV_CUDA_COMPAT_PACKAGE=cuda-compat-12-4",
"CUDA_VERSION=12.4.1",
"NVIDIA_VISIBLE_DEVICES=all",
"NVIDIA_DRIVER_CAPABILITIES=compute,utility",
"NV_CUDA_LIB_VERSION=12.4.1-1",
"NV_NVTX_VERSION=12.4.127-1",
"NV_LIBNPP_VERSION=12.2.5.30-1",
"NV_LIBNPP_PACKAGE=libnpp-12-4=12.2.5.30-1",
"NV_LIBCUSPARSE_VERSION=12.3.1.170-1",
"NV_LIBCUBLAS_PACKAGE_NAME=libcublas-12-4",
"NV_LIBCUBLAS_VERSION=12.4.5.8-1",
"NV_LIBCUBLAS_PACKAGE=libcublas-12-4=12.4.5.8-1",
"NV_LIBNCCL_PACKAGE_NAME=libnccl2",
"NV_LIBNCCL_PACKAGE_VERSION=2.21.5-1",
"NCCL_VERSION=2.21.5-1",
"NV_LIBNCCL_PACKAGE=libnccl2=2.21.5-1+cuda12.4",
"NVIDIA_PRODUCT_NAME=CUDA",
"NV_CUDNN_VERSION=9.1.0.70-1",
"NV_CUDNN_PACKAGE_NAME=libcudnn9-cuda-12",
"NV_CUDNN_PACKAGE=libcudnn9-cuda-12=9.1.0.70-1",
"HUGGINGFACE_HUB_CACHE=/data",
"PORT=80",
"RAYON_NUM_THREADS=8"
Latest commit
- "Image": "sha256:a454e50438c0b4972193ee8d977d76e8539a5b50bdd023e146c121af7920de73"
Host system:
- Fedora 40
- Nvidia driver 555.58.02
- Kernel 6.9.8-200.fc40.x86_64
- nvidia-container-runtime-3.14.0-1.noarch
- docker-ce-27.0.3-1.fc40.x86_64
- 1x RTX 3090 + 2x RTX A4000
FYI I did a custom build with CUDA 12.5.1 as the base image and it had the same issue.
same issue 1xA6000 + 1x4090
nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2021 NVIDIA Corporation
Built on Thu_Nov_18_09:45:30_PST_2021
Cuda compilation tools, release 11.5, V11.5.119
Build cuda_11.5.r11.5/compiler.30672275_0
CUDA Version: 12.2
I fixed the following:
- Uninstall cuda
sudo apt-get --purge remove "*cuda*"
sudo apt-get autoremove
- Follow the steps here: https://developer.nvidia.com/cuda-downloads I used the next commands: wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64/cuda-keyring_1.1-1_all.deb
sudo dpkg -i cuda-keyring_1.1-1_all.deb
sudo apt-get update
sudo apt-get -y install cuda-12-4
- Restart xd After this, I compiled the code correctly. Remember to add your LD Library Path.
for me I updated my drivers then self-built an image with a matching cuda version (while also adding /usr/local/cuda/compat to the container's LD_LIBRARY_PATH):
- FROM nvidia/cuda:12.4.1-cudnn-devel-ubuntu22.04 AS builder
+ FROM nvidia/cuda:12.6.2-cudnn-devel-ubuntu22.04 AS builder
weird that a more recent cuda driver wouldn't work with an older cuda runtime... didn't seem to have this issue while using llama.cpp (been looking at alternatives for it for personal use) which I have an image built using cuda 11.8 as base