envbuilder
envbuilder copied to clipboard
Investigate GPU support
Some users will want to mount a GPU to an envbuilder-backed workspace. Can we investigate in which scenarios (if any) this works today and if/how we can patch upstream Kaniko to improve the experience?
Related
- Kaniko issue: https://github.com/GoogleContainerTools/kaniko/issues/3006
- Internal Slack message
@BrunoQuaresma We may need to write a mini-RFC describing the status quo.
After talk to @mtojek I think I have a good plan:
- Try to run envbuilder in a regular environment
- Spin up a regular k8s cluster on Google Cloud
- Try to run envbuilder with a hello world image and see if it works
- Try to reproduce the user error by running envbuilder with a GPU
- Spin up a k8s cluster using a NVidia GPU on Google Cloud
- Try to run envbuilder with a hello world image and see if it works
- Try to find a workaround
- Investigate possible solutions using diff builders besides kaniko
@bpmct I tried to use envbuilder in a GPU environment and it worked as expected. Here is how I made it:
- Spin up a k8s cluster with GPU support on GKE
- GKE version
1.27.13-gke.1000000 - Machine type
n1-standard-4 - GPU accelerators (per node)
2 x NVIDIA T4
- GKE version
- Setup a test repo with devcontainer using a Nvidia test image
- Example: https://catalog.ngc.nvidia.com/orgs/nvidia/teams/k8s/containers/cuda-sample
- NVidia example image:
nvcr.io/nvidia/k8s/cuda-sample:vectoradd-cuda10.2
- Set the following envbuilder config
GIT_URLashttps://github.com/BrunoQuaresma/envbuilder-gpu-testINIT_SCRIPTas/tmp/vectorAdd
This is the output:
envbuilder - Build development environments from repositories in a container
#1: ๐ฆ Cloning https://github.com/BrunoQuaresma/envbuilder-gpu-test to /workspaces/envbuilder-gpu-test...
#1: Enumerating objects: 4, done.
#1: Counting objects: 25% (1/4)
#1: Counting objects: 50% (2/4)
#1: Counting objects: 75% (3/4)
#1: Counting objects: 100% (4/4)
#1: Counting objects: 100% (4/4), done.
#1: Compressing objects: 50% (1/2)
#1: Compressing objects: 100% (2/2)
#1: Compressing objects: 100% (2/2), done.
#1: Total 4 (delta 0), reused 4 (delta 0), pack-reused 0
#1: ๐ฆ Cloned repository! [193.807769ms]
#2: Deleting filesystem...
#2: ๐๏ธ Building image...
#2: Retrieving image manifest nvcr.io/nvidia/k8s/cuda-sample:vectoradd-cuda10.2
#2: Retrieving image nvcr.io/nvidia/k8s/cuda-sample:vectoradd-cuda10.2 from registry nvcr.io
#2: Built cross stage deps: map[]
#2: Retrieving image manifest nvcr.io/nvidia/k8s/cuda-sample:vectoradd-cuda10.2
#2: Returning cached image manifest
#2: Executing 0 build triggers
#2: Building stage 'nvcr.io/nvidia/k8s/cuda-sample:vectoradd-cuda10.2' [idx: '0', base-idx: '-1']
#2: ๐๏ธ Built image! [3.019338331s]
#3: no user specified, using root
#3: ๐ Updating the ownership of the workspace...
#3: ๐ค Updated the ownership of the workspace! [449.651ยตs]
=== Running the init command /bin/sh [-c /tmp/vectorAdd] as the "root" user...
[Vector addition of 50000 elements]
Copy input data from the host memory to the CUDA device
CUDA kernel launch with 196 blocks of 256 threads
Copy output data from the CUDA device to the host memory
Test PASSED
Done
@bpmct do you think we can get more details from the user?
I am closing this for now until we have more context from the user.
Try using the NVIDIA k8s device plugin (DaemonSet) and not a NVIDIA container image, e.g.:
https://github.com/NVIDIA/k8s-device-plugin
https://learn.microsoft.com/en-us/azure/aks/gpu-cluster?tabs=add-ubuntu-gpu-node-pool#nvidia-device-plugin-installation
This is the recommended approach when using GPU-enabled node pools for Azure Linux.
Try using the NVIDIA k8s device plugin (DaemonSet)
@marrotte FYI while we tested using GKE, the cluster we tested on does use the device plugin. However, it appears to be a customized version for GKE COS, and I will freely admit that cluster is a bit old.
What Kubernetes version are you seeing issues with on AKS?
not a NVIDIA container image, e.g.:
This container image is the one NVIDIA recommends to test GPU support (c.f. https://github.com/NVIDIA/k8s-device-plugin?tab=readme-ov-file#running-gpu-jobs)
What image would you recommend instead as a test? I note that the Azure docs you linked reference a separate MNIST test image.
@johnstcn I'm seeing the issue on:
- K8s Rev: v1.27.7
- Node image: AKSUbuntu-2204gen2containerd-202401.09.0
- Plugin image: mcr.microsoft.com/oss/nvidia/k8s-device-plugin:1.11
- Pod image = "ghcr.io/coder/envbuilder:0.2.9"
@BrunoQuaresma
Still have issues via using your test repo on AKS agains ghcr.io/coder/envbuilder:0.2.9
root@coder-daniel-kkk-copy-75f96d6c7b-dxxdn:/# nvidia-smi
NVIDIA-SMI couldn't find libnvidia-ml.so library in your system. Please make sure that the NVIDIA Display Driver is properly installed and present in your system.
Please also try adding directory that contains libnvidia-ml.so to your system PATH.
root@coder-daniel-kkk-copy-75f96d6c7b-dxxdn:/# cd /tmp
root@coder-daniel-kkk-copy-75f96d6c7b-dxxdn:/tmp# ls
coder.wTqTN7 vectorAdd
root@coder-daniel-kkk-copy-75f96d6c7b-dxxdn:/tmp# ./vectorAdd
[Vector addition of 50000 elements]
Failed to allocate device vector A (error code CUDA driver version is insufficient for CUDA runtime version)!
/usr/lib64# ll
total 176176
drwxr-xr-x 2 root root 4096 Jun 13 14:00 ./
drwxr-xr-x 14 root root 4096 Jun 13 13:49 ../
lrwxrwxrwx 1 root root 42 Jun 13 13:49 ld-linux-x86-64.so.2 -> /lib/x86_64-linux-gnu/ld-linux-x86-64.so.2*
-rwxr-xr-x 1 root root 28392536 Jun 13 12:01 libcuda.so.550.54.15*
-rwxr-xr-x 1 root root 10524136 Jun 13 12:01 libcudadebugger.so.550.54.15*
-rwxr-xr-x 1 root root 168744 Jun 13 12:01 libnvidia-allocator.so.550.54.15*
-rwxr-xr-x 1 root root 398968 Jun 13 12:01 libnvidia-cfg.so.550.54.15*
lrwxrwxrwx 1 root root 36 Jun 13 14:00 libnvidia-ml.so -> /usr/lib64/libnvidia-ml.so.550.54.15*
-rwxr-xr-x 1 root root 2078360 Jun 13 12:01 libnvidia-ml.so.550.54.15*
-rwxr-xr-x 1 root root 86842616 Jun 13 12:01 libnvidia-nvvm.so.550.54.15*
-rwxr-xr-x 1 root root 23293568 Jun 13 12:01 libnvidia-opencl.so.550.54.15*
-rwxr-xr-x 1 root root 10168 Jun 13 12:01 libnvidia-pkcs11.so.550.54.15*
-rwxr-xr-x 1 root root 28670368 Jun 13 12:01 libnvidia-ptxjitcompiler.so.550.54.15*
Oh, I fixed it by
echo "/usr/lib64" > /etc/ld.so.conf.d/customized.conf
ldconfig
@marrotte does the @nikawang fix work for you?
@BrunoQuaresma I don't think I can test that as my envbuilder fails to build. I believe @nikawang either applied that fix to a envbuilder running container or the running container built by envbuilder. I did try applying @nikawang's fix to the AKS/K8s GPU node as if that might be where it was applied and that had no effect.
@marrotte could you please share with me how I can set up a similar k8s cluster step by step or a Terraform file where I can just run it?
I tried to create a Kubernetes GPU cluster on Azure following this tutorial, but without success. During the process, I managed to get the cluster up and running and register the required features and services through step five of the tutorial.
bruno [ ~ ]$ az feature show --namespace "Microsoft.ContainerService" --name "GPUDedicatedVHDPreview"
{
"id": "/subscriptions/05e8b285-4ce1-46a3-b4c9-f51ba67d6acc/providers/Microsoft.Features/providers/Microsoft.ContainerService/features/GPUDedicatedVHDPreview",
"name": "Microsoft.ContainerService/GPUDedicatedVHDPreview",
"properties": {
"state": "Registered"
},
"type": "Microsoft.Features/providers/features"
}
However, when I began adding the node pool, I started encountering errors.
az aks nodepool add \
--resource-group bruno \
--cluster-name bruno-gpu \
--name gpunp \
--node-count 1 \
--node-vm-size Standard_NC6s_v3 \
--node-taints sku=gpu:NoSchedule \
--aks-custom-headers UseGPUDedicatedVHD=true \
--enable-cluster-autoscaler \
--min-count 1 \
--max-count 3
(OperationNotAllowed) .properties.nodeProvisioningProfile.mode cannot be Auto while any AgentPools have .properties.enableAutoScaling enabled
Code: OperationNotAllowed
Message: .properties.nodeProvisioningProfile.mode cannot be Auto while any AgentPools have .properties.enableAutoScaling enabled
I tried searching for the error on Google to find a solution or any information related to properties.nodeProvisioningProfile.mode, but I didn't find anything helpful. I realized that it might be better to ask if you could share a Terraform file or a more straightforward tutorial for us to reproduce your environment.
So ENVBUILDER_IGNORE_PATHS can be set to /dev,/lib/firmware/nvidia,/usr/bin/nv-,/usr/bin/nvidia-,/usr/lib64/libcuda,/usr/lib64/libnvidia-,/var/run, but we hit the known unlinkat/device or resource busy error.
The easiest to get the right environment to reproduce is likely the gpu-operator for Kubernetes or the NVIDIA Container Toolkit for Docker.
I believe #183 (and #249) can provide a workaround here by temporarily remounting the path out of the way instead trying to ignore them in Kaniko, although note that mount/umount require privileges.
~Currently only read-only mounts are taken care of, but the NVIDIA container runtime mounts devtmpfs filesystems at /var/run/nvidia-container-devices/GPU-<uuid> (the actual mountpoint can be /run since often /var/run is a symlink to it), the logic would need to be extended to cover those (I have successfully done that).~ No special handling needed, I probably had forgotten to add back /var/run to the ignored paths.
The runtime mounts libraries with symlinks:
libcuda.so -> libcuda.so.1
libcuda.so.1 -> libcuda.so.<driver-version>
libcuda.so.<driver-version>
libcudadebugger.so.1 -> libcudadebugger.so.<driver-version>
libcudadebugger.so.<driver-version>
libnvidia-allocator.so.1 -> libnvidia-allocator.so.<driver-version>
libnvidia-allocator.so.<driver-version>
libnvidia-cfg.so.1 -> libnvidia-cfg.so.<driver-version>
libnvidia-cfg.so.<driver-version>
libnvidia-ml.so.1 -> libnvidia-ml.so.<driver-version>
libnvidia-ml.so.<driver-version>
libnvidia-nvvm.so.4 -> libnvidia-nvvm.so.<driver-version>
libnvidia-nvvm.so.<driver-version>
libnvidia-opencl.so.1 -> libnvidia-opencl.so.<driver-version>
libnvidia-opencl.so.<driver-version>
libnvidia-pkcs11-openssl3.so.<driver-version>
libnvidia-pkcs11.so.<driver-version>
libnvidia-ptxjitcompiler.so.1 -> libnvidia-ptxjitcompiler.so.<driver-version>
libnvidia-ptxjitcompiler.so.<driver-version>
The symlinks must also be preserved. The location in the envbuilder image is /usr/lib64, but it differs between distros (for example in Debian, it should be /usr/lib/x86_64-linux-gnu), so the remount process must discover the appropriate location in the new filesystem hierarchy.
I am using this quick-and-dirty script afterward to get things working:
remount_and_resymlink.sh
#!/usr/bin/env bash
set -euo pipefail
TARGET=/usr/lib/x86_64-linux-gnu
FIRMWARES=(/lib/firmware/nvidia/*)
VERSION="${FIRMWARES[0]}"
VERSION="${VERSION##*/}"
mount | awk '/\/usr\/lib64/{print $3}' | while read -r path; do
lib="${path##*/}"
mkdir -p "${TARGET}"
touch "${TARGET}/${lib}"
mount --bind "${path}" "${TARGET}/${lib}"
unmount "${path}"
case "${lib}" in
libnvidia-pkcs11.so.*) ;;
libnvidia-pkcs11-openssl3.so.*) ;;
libnvidia-nvvm.so.*)
n=4
;;
*)
n=1
;;
esac
if [[ -n "${n:-}" ]]; then
ln -s "${lib}" "${TARGET}/${lib%"${VERSION}"}${n}"
fi
if [[ "${lib}" == "libcuda.so."* ]]; then
ln -s "${lib%"${VERSION}"}${n}" "${TARGET}/${lib%".${VERSION}"}"
fi
done
This is the logic the runtime uses to pick the library directory: https://github.com/NVIDIA/libnvidia-container/blob/v1.15.0/src/nvc_container.c#L151-L188
And this looks like the libraries it can potentially mount: https://github.com/NVIDIA/libnvidia-container/blob/v1.15.0/src/nvc_info.c#L75-L139
Once that is all in place, the nvidia-smi command should work, the GPU(s) should be visible as well as the CUDA version.
In an image with pytorch (e.g. nvcr.io/nvidia/pytorch:24.05-py3), python -c 'import torch; print(torch.cuda.is_available())' should return True.
~One thing I have not figured out yet is why the container gets all GPUs when only 1 is requested (this works properly for a regular container) ๐~ That's because the pod is running with privileges.
The manifest I am using at the moment:
pod.yaml
apiVersion: v1
kind: Pod
metadata:
name: envbuilder
spec:
containers:
- name: envbuilder
image: ghcr.io/coder/envbuilder-preview
env:
- name: FALLBACK_IMAGE
value: debian
- name: INIT_SCRIPT
value: sh -c 'while :; do sleep 86400; done'
- name: NVIDIA_DRIVER_CAPABILITIES
value: all
resources:
limits:
nvidia.com/gpu: "1"
securityContext:
privileged: true
If you are on GCP/GKE, the above should be valid for Ubuntu nodes (I think, I am not testing there). ~I need to investigate on GCP's ContainerOS too since things are wired a little differently.~ On GCP's ContainerOS the only mount is /usr/local/nvidia, so this path can either be ignored or remounted and no care is given regarding the PATH or ldconfig search path by default, it has to be handle the user's image (e.g. LD_LIBRARY_PATH=/usr/local/nvidia/lib64 /usr/local/nvidia/bin/nvidia-smi should work).