envbuilder Investigate GPU support

Some users will want to mount a GPU to an envbuilder-backed workspace. Can we investigate in which scenarios (if any) this works today and if/how we can patch upstream Kaniko to improve the experience?

Try to run envbuilder in a regular environment
- Spin up a regular k8s cluster on Google Cloud
- Try to run envbuilder with a hello world image and see if it works
Try to reproduce the user error by running envbuilder with a GPU
- Spin up a k8s cluster using a NVidia GPU on Google Cloud
- Try to run envbuilder with a hello world image and see if it works
- Try to find a workaround
- Investigate possible solutions using diff builders besides kaniko

May 14 '24 12:05 BrunoQuaresma

@bpmct I tried to use envbuilder in a GPU environment and it worked as expected. Here is how I made it:

Spin up a k8s cluster with GPU support on GKE
- GKE version 1.27.13-gke.1000000
- Machine type n1-standard-4
- GPU accelerators (per node) 2 x NVIDIA T4
Setup a test repo with devcontainer using a Nvidia test image
- Example: https://catalog.ngc.nvidia.com/orgs/nvidia/teams/k8s/containers/cuda-sample
- NVidia example image: nvcr.io/nvidia/k8s/cuda-sample:vectoradd-cuda10.2
Set the following envbuilder config
- GIT_URL as https://github.com/BrunoQuaresma/envbuilder-gpu-test
- INIT_SCRIPT as /tmp/vectorAdd

This is the output:

envbuilder - Build development environments from repositories in a container
#1: 📦 Cloning https://github.com/BrunoQuaresma/envbuilder-gpu-test to /workspaces/envbuilder-gpu-test...
#1: Enumerating objects: 4, done.
#1: Counting objects:  25% (1/4)
#1: Counting objects:  50% (2/4)
#1: Counting objects:  75% (3/4)
#1: Counting objects: 100% (4/4)
#1: Counting objects: 100% (4/4), done.
#1: Compressing objects:  50% (1/2)
#1: Compressing objects: 100% (2/2)
#1: Compressing objects: 100% (2/2), done.
#1: Total 4 (delta 0), reused 4 (delta 0), pack-reused 0
#1: 📦 Cloned repository! [193.807769ms]
#2: Deleting filesystem...
#2: 🏗️ Building image...
#2: Retrieving image manifest nvcr.io/nvidia/k8s/cuda-sample:vectoradd-cuda10.2
#2: Retrieving image nvcr.io/nvidia/k8s/cuda-sample:vectoradd-cuda10.2 from registry nvcr.io
#2: Built cross stage deps: map[]
#2: Retrieving image manifest nvcr.io/nvidia/k8s/cuda-sample:vectoradd-cuda10.2
#2: Returning cached image manifest
#2: Executing 0 build triggers
#2: Building stage 'nvcr.io/nvidia/k8s/cuda-sample:vectoradd-cuda10.2' [idx: '0', base-idx: '-1']
#2: 🏗️ Built image! [3.019338331s]
#3: no user specified, using root
#3: 🔄 Updating the ownership of the workspace...
#3: 👤 Updated the ownership of the workspace! [449.651µs]
=== Running the init command /bin/sh [-c /tmp/vectorAdd] as the "root" user...
[Vector addition of 50000 elements]
Copy input data from the host memory to the CUDA device
CUDA kernel launch with 196 blocks of 256 threads
Copy output data from the CUDA device to the host memory
Test PASSED
Done

@bpmct do you think we can get more details from the user?

May 23 '24 15:05 BrunoQuaresma

I am closing this for now until we have more context from the user.

May 28 '24 13:05 BrunoQuaresma

Try using the NVIDIA k8s device plugin (DaemonSet) and not a NVIDIA container image, e.g.:

https://github.com/NVIDIA/k8s-device-plugin

https://learn.microsoft.com/en-us/azure/aks/gpu-cluster?tabs=add-ubuntu-gpu-node-pool#nvidia-device-plugin-installation

This is the recommended approach when using GPU-enabled node pools for Azure Linux.

May 30 '24 20:05 marrotte

Try using the NVIDIA k8s device plugin (DaemonSet)

@marrotte FYI while we tested using GKE, the cluster we tested on does use the device plugin. However, it appears to be a customized version for GKE COS, and I will freely admit that cluster is a bit old.

What Kubernetes version are you seeing issues with on AKS?

not a NVIDIA container image, e.g.:

This container image is the one NVIDIA recommends to test GPU support (c.f. https://github.com/NVIDIA/k8s-device-plugin?tab=readme-ov-file#running-gpu-jobs)

What image would you recommend instead as a test? I note that the Azure docs you linked reference a separate MNIST test image.

May 31 '24 14:05 johnstcn

@johnstcn I'm seeing the issue on:

K8s Rev: v1.27.7
Node image: AKSUbuntu-2204gen2containerd-202401.09.0
Plugin image: mcr.microsoft.com/oss/nvidia/k8s-device-plugin:1.11
Pod image = "ghcr.io/coder/envbuilder:0.2.9"

May 31 '24 15:05 marrotte

@BrunoQuaresma Still have issues via using your test repo on AKS agains ghcr.io/coder/envbuilder:0.2.9

root@coder-daniel-kkk-copy-75f96d6c7b-dxxdn:/# nvidia-smi
NVIDIA-SMI couldn't find libnvidia-ml.so library in your system. Please make sure that the NVIDIA Display Driver is properly installed and present in your system.
Please also try adding directory that contains libnvidia-ml.so to your system PATH.
root@coder-daniel-kkk-copy-75f96d6c7b-dxxdn:/# cd /tmp
root@coder-daniel-kkk-copy-75f96d6c7b-dxxdn:/tmp# ls
coder.wTqTN7  vectorAdd
root@coder-daniel-kkk-copy-75f96d6c7b-dxxdn:/tmp# ./vectorAdd
[Vector addition of 50000 elements]
Failed to allocate device vector A (error code CUDA driver version is insufficient for CUDA runtime version)!

/usr/lib64# ll
total 176176
drwxr-xr-x  2 root root     4096 Jun 13 14:00 ./
drwxr-xr-x 14 root root     4096 Jun 13 13:49 ../
lrwxrwxrwx  1 root root       42 Jun 13 13:49 ld-linux-x86-64.so.2 -> /lib/x86_64-linux-gnu/ld-linux-x86-64.so.2*
-rwxr-xr-x  1 root root 28392536 Jun 13 12:01 libcuda.so.550.54.15*
-rwxr-xr-x  1 root root 10524136 Jun 13 12:01 libcudadebugger.so.550.54.15*
-rwxr-xr-x  1 root root   168744 Jun 13 12:01 libnvidia-allocator.so.550.54.15*
-rwxr-xr-x  1 root root   398968 Jun 13 12:01 libnvidia-cfg.so.550.54.15*
lrwxrwxrwx  1 root root       36 Jun 13 14:00 libnvidia-ml.so -> /usr/lib64/libnvidia-ml.so.550.54.15*
-rwxr-xr-x  1 root root  2078360 Jun 13 12:01 libnvidia-ml.so.550.54.15*
-rwxr-xr-x  1 root root 86842616 Jun 13 12:01 libnvidia-nvvm.so.550.54.15*
-rwxr-xr-x  1 root root 23293568 Jun 13 12:01 libnvidia-opencl.so.550.54.15*
-rwxr-xr-x  1 root root    10168 Jun 13 12:01 libnvidia-pkcs11.so.550.54.15*
-rwxr-xr-x  1 root root 28670368 Jun 13 12:01 libnvidia-ptxjitcompiler.so.550.54.15*

Jun 13 '24 12:06 nikawang

Oh, I fixed it by

echo "/usr/lib64" > /etc/ld.so.conf.d/customized.conf 
ldconfig

Jun 13 '24 14:06 nikawang

@marrotte does the @nikawang fix work for you?

Jun 13 '24 14:06 BrunoQuaresma

@BrunoQuaresma I don't think I can test that as my envbuilder fails to build. I believe @nikawang either applied that fix to a envbuilder running container or the running container built by envbuilder. I did try applying @nikawang's fix to the AKS/K8s GPU node as if that might be where it was applied and that had no effect.

Jun 17 '24 14:06 marrotte

@marrotte could you please share with me how I can set up a similar k8s cluster step by step or a Terraform file where I can just run it?

Jun 17 '24 19:06 BrunoQuaresma

I tried to create a Kubernetes GPU cluster on Azure following this tutorial, but without success. During the process, I managed to get the cluster up and running and register the required features and services through step five of the tutorial.

bruno [ ~ ]$ az feature show --namespace "Microsoft.ContainerService" --name "GPUDedicatedVHDPreview"
{
  "id": "/subscriptions/05e8b285-4ce1-46a3-b4c9-f51ba67d6acc/providers/Microsoft.Features/providers/Microsoft.ContainerService/features/GPUDedicatedVHDPreview",
  "name": "Microsoft.ContainerService/GPUDedicatedVHDPreview",
  "properties": {
    "state": "Registered"
  },
  "type": "Microsoft.Features/providers/features"
}

However, when I began adding the node pool, I started encountering errors.

az aks nodepool add \
    --resource-group bruno \
    --cluster-name bruno-gpu \
    --name gpunp \
    --node-count 1 \
    --node-vm-size Standard_NC6s_v3 \
    --node-taints sku=gpu:NoSchedule \
    --aks-custom-headers UseGPUDedicatedVHD=true \
    --enable-cluster-autoscaler \
    --min-count 1 \
    --max-count 3

(OperationNotAllowed) .properties.nodeProvisioningProfile.mode cannot be Auto while any AgentPools have .properties.enableAutoScaling enabled
Code: OperationNotAllowed
Message: .properties.nodeProvisioningProfile.mode cannot be Auto while any AgentPools have .properties.enableAutoScaling enabled

I tried searching for the error on Google to find a solution or any information related to properties.nodeProvisioningProfile.mode, but I didn't find anything helpful. I realized that it might be better to ask if you could share a Terraform file or a more straightforward tutorial for us to reproduce your environment.

Jun 21 '24 16:06 BrunoQuaresma

So ENVBUILDER_IGNORE_PATHS can be set to /dev,/lib/firmware/nvidia,/usr/bin/nv-,/usr/bin/nvidia-,/usr/lib64/libcuda,/usr/lib64/libnvidia-,/var/run, but we hit the known unlinkat/device or resource busy error.

The easiest to get the right environment to reproduce is likely the gpu-operator for Kubernetes or the NVIDIA Container Toolkit for Docker.

I believe #183 (and #249) can provide a workaround here by temporarily remounting the path out of the way instead trying to ignore them in Kaniko, although note that mount/umount require privileges.

~Currently only read-only mounts are taken care of, but the NVIDIA container runtime mounts devtmpfs filesystems at /var/run/nvidia-container-devices/GPU-<uuid> (the actual mountpoint can be /run since often /var/run is a symlink to it), the logic would need to be extended to cover those (I have successfully done that).~ No special handling needed, I probably had forgotten to add back /var/run to the ignored paths.

The runtime mounts libraries with symlinks:

libcuda.so -> libcuda.so.1
libcuda.so.1 -> libcuda.so.<driver-version>
libcuda.so.<driver-version>
libcudadebugger.so.1 -> libcudadebugger.so.<driver-version>
libcudadebugger.so.<driver-version>
libnvidia-allocator.so.1 -> libnvidia-allocator.so.<driver-version>
libnvidia-allocator.so.<driver-version>
libnvidia-cfg.so.1 -> libnvidia-cfg.so.<driver-version>
libnvidia-cfg.so.<driver-version>
libnvidia-ml.so.1 -> libnvidia-ml.so.<driver-version>
libnvidia-ml.so.<driver-version>
libnvidia-nvvm.so.4 -> libnvidia-nvvm.so.<driver-version>
libnvidia-nvvm.so.<driver-version>
libnvidia-opencl.so.1 -> libnvidia-opencl.so.<driver-version>
libnvidia-opencl.so.<driver-version>
libnvidia-pkcs11-openssl3.so.<driver-version>
libnvidia-pkcs11.so.<driver-version>
libnvidia-ptxjitcompiler.so.1 -> libnvidia-ptxjitcompiler.so.<driver-version>
libnvidia-ptxjitcompiler.so.<driver-version>

The symlinks must also be preserved. The location in the envbuilder image is /usr/lib64, but it differs between distros (for example in Debian, it should be /usr/lib/x86_64-linux-gnu), so the remount process must discover the appropriate location in the new filesystem hierarchy.

I am using this quick-and-dirty script afterward to get things working:

remount_and_resymlink.sh

#!/usr/bin/env bash
set -euo pipefail

TARGET=/usr/lib/x86_64-linux-gnu

FIRMWARES=(/lib/firmware/nvidia/*)
VERSION="${FIRMWARES[0]}"
VERSION="${VERSION##*/}"

mount | awk '/\/usr\/lib64/{print $3}' | while read -r path; do
  lib="${path##*/}"

  mkdir -p "${TARGET}"
  touch "${TARGET}/${lib}"
  mount --bind "${path}" "${TARGET}/${lib}"
  unmount "${path}"

  case "${lib}" in
    libnvidia-pkcs11.so.*) ;;
    libnvidia-pkcs11-openssl3.so.*) ;;
    libnvidia-nvvm.so.*)
      n=4
      ;;
    *)
      n=1
      ;;
  esac

  if [[ -n "${n:-}" ]]; then
    ln -s "${lib}" "${TARGET}/${lib%"${VERSION}"}${n}"
  fi

  if [[ "${lib}" == "libcuda.so."* ]]; then
    ln -s "${lib%"${VERSION}"}${n}" "${TARGET}/${lib%".${VERSION}"}"
  fi
done

This is the logic the runtime uses to pick the library directory: https://github.com/NVIDIA/libnvidia-container/blob/v1.15.0/src/nvc_container.c#L151-L188

And this looks like the libraries it can potentially mount: https://github.com/NVIDIA/libnvidia-container/blob/v1.15.0/src/nvc_info.c#L75-L139

Once that is all in place, the nvidia-smi command should work, the GPU(s) should be visible as well as the CUDA version.

In an image with pytorch (e.g. nvcr.io/nvidia/pytorch:24.05-py3), python -c 'import torch; print(torch.cuda.is_available())' should return True.

~One thing I have not figured out yet is why the container gets all GPUs when only 1 is requested (this works properly for a regular container) 😕~ That's because the pod is running with privileges.

The manifest I am using at the moment:

pod.yaml

apiVersion: v1
kind: Pod
metadata:
  name: envbuilder 
spec:
  containers:
  - name: envbuilder
    image: ghcr.io/coder/envbuilder-preview
    env:
    - name: FALLBACK_IMAGE
      value: debian
    - name: INIT_SCRIPT
      value: sh -c 'while :; do sleep 86400; done' 
    - name: NVIDIA_DRIVER_CAPABILITIES
      value: all
    resources:
      limits:
        nvidia.com/gpu: "1"
    securityContext:
      privileged: true

If you are on GCP/GKE, the above should be valid for Ubuntu nodes (I think, I am not testing there). ~I need to investigate on GCP's ContainerOS too since things are wired a little differently.~ On GCP's ContainerOS the only mount is /usr/local/nvidia, so this path can either be ignored or remounted and no care is given regarding the PATH or ldconfig search path by default, it has to be handle the user's image (e.g. LD_LIBRARY_PATH=/usr/local/nvidia/lib64 /usr/local/nvidia/bin/nvidia-smi should work).

Jun 26 '24 18:06 maxbrunet

envbuilder
envbuilder copied to clipboard

Investigate GPU support

Related

envbuilder envbuilder copied to clipboard

Investigate GPU support

Related

envbuilder
envbuilder copied to clipboard