Consider providing interim instructions for Linux "happy path" using Docker
Hello,
I spent part of this afternoon banging my head against a wall with getting dfdx with the cuda feature enabled up and running on my computer. It turns a big part of this appeared to be that my version (11.2) doesn't really appear to work well with the build.rs script, with errors appearing in multiple steps. As I think I may have mentioned in previous issues, my set-up isn't particularly exotic (just the recent Pop!_OS release with the default NVIDIA drivers), so I suspect that other folks may run into the same issue.
According to System76's docs, the recommended way of dealing with a CUDA version mismatch is just to use Docker. While this isn't ideal (I don't love having to rely on Docker), I can confirm that this solved most of my build issues, by first following the GPU-enabled container instructions in the link above, then building a dfdx-specific container using the Dockerfile below (which takes a hot minute to build).
FROM nvidia/cuda:12.1.0-devel-ubuntu22.04
RUN apt-get update
# Get Ubuntu packages
RUN apt-get install -y \
build-essential \
curl \
git
# Get Rust
RUN curl https://sh.rustup.rs -sSf | bash -s -- -y
RUN echo 'source $HOME/.cargo/env' >> $HOME/.bashrc
I was just thinking that it might be worth considering adding this kind of process into the crate's documentation to help other people that may run into the same issue, at least until it becomes clear that the base NVIDIA-enabled system configurations being shipped with distro's like Pop!_OS/Ubuntu are able to support the dfdx's build script.
Edited because I'm occasionally Very Dumb(TM) and forgot to actually run this with the GPU passthrough. The following comment is now accurate.
Edit 2: Except this doesn't work with the PyTorch example in image-classification either, so I guess maybe it's something about the NVIDIA Docker image itself :upside_down_face: I'll update whenever I happen to regain the willpower to continue exploring this.
I need to basically remove the nvidia-smi section of the build script in favor of nvcc, but this allows dfdx to compile with cuda enabled, but can't actually run the test suite, exiting with an error
thread 'main' panicked at 'called `Result::unwrap()` on an `Err` value: DriverError(CUDA_ERROR_COMPAT_NOT_SUPPORTED_ON_DEVICE, "forward compatibility was attempted on non supported HW")', /usr/local/cargo/git/checkouts/cudarc-2602ad613d9c0487/cc9a8d3/src/driver/safe/core.rs:50:24
In addition, it's recommended to add pkg-config and libssl-dev to the apt-get install list.
I'm not sure if you got it working, but I'm trying to learn ML while using this lib and this is my dockerfile dev env:
FROM nvidia/cuda:12.1.0-cudnn8-devel-ubuntu22.04
# basic tools
RUN apt update \
&& apt install -y --no-install-recommends \
git vim openssh-client gnupg curl wget ca-certificates unzip zip less zlib1g sudo coreutils sed grep
#
# cargo/rust
ENV RUSTUP_HOME=/usr/local/rustup
ENV CARGO_HOME=/usr/local/cargo
ENV PATH=/usr/local/cargo/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
# https://blog.rust-lang.org/2022/06/22/sparse-registry-testing.html
ENV CARGO_UNSTABLE_SPARSE_REGISTRY=true
RUN set -eux; \
apt update \
&& apt install -y --no-install-recommends \
ca-certificates gcc build-essential; \
url="https://static.rust-lang.org/rustup/dist/x86_64-unknown-linux-gnu/rustup-init"; \
wget "$url"; \
chmod +x rustup-init; \
./rustup-init -y --no-modify-path --default-toolchain nightly; \
rm rustup-init; \
chmod -R a+w $RUSTUP_HOME $CARGO_HOME; \
rustup --version; \
cargo --version; \
rustc --version;
#
# https://docs.nvidia.com/cuda/cuda-installation-guide-linux/index.html#environment-setup
RUN echo "export PATH=/usr/local/cuda-12.1/bin${PATH:+:${PATH}}" >> ~/.bashrc
Thats for:
[dependencies.dfdx]
version = "0.13.0"
default-features = false
features = [
"std",
"fast-alloc",
"cpu",
"cuda",
"cudnn",
"safetensors",
"numpy",
"nightly",
]