[EKS] [request]: Provide Nvidia Driver installer on general AMI to replace GPU AMI
Tell us about your request What do you want us to build? A Nvidia Driver installer for EKS-Optimized Linux AMI.
Which service(s) is this request for? EKS
Tell us about the problem you're trying to solve. What are you trying to do, and why is it hard? What outcome are you trying to achieve, ultimately, and why is it hard/impossible to do right now? What is the impact of not having this problem solved? The more details you can provide, the better we'll be able to understand and solve the problem.
EKS maintains Amazon EKS-Optimized Linux AMI and Amazon EKS-Optimized AMI with GPU Support. GPU AMI adds extra nvidia-docker and nvidia driver on top of Linux AMI.
Technically, we can install driver and cuda (if needed) separately on every kubernetes node using Daemonset. There's no need to build separate AMI for few reasons
- Reduce number of AMIs
- Decouple Nvidia driver upgrade and AMI release
Are you currently working around this issue? How are you currently solving this problem?
Additional context Nvidia also release GPU Operator here https://devblogs.nvidia.com/nvidia-gpu-operator-simplifying-gpu-management-in-kubernetes/. It's even better since user doesn't even need to install Nvidia driver. This is a very young project and we may can go this way eventually. More investigation needed.
Attachments If you think you might have additional information that you'd like to include via an attachment, please do - we'll take a look. (Remember to remove any personally-identifiable information.)
/assign @Jeffwan
Having gone through this with centos7 for another project, you wouldn't even need to go the daemonset route, you could just install all the drivers outright and then on startup, use lspci | grep -ci nvidia (or use the pci vendor code) to look for Nvidia card and insert "default-runtime": "nvidia", in the docker server config-if detected, otherwise leave it alone. There may be a better way, but looking around, it seemed kubernetes didn't pass in the runtime to the docker command as of yet?
GPU Operator dev is very slow-moving, no official eks support, and it currently crashes on install (as of a test install the other day just using stock helm commands they list)
FWIW, here's my install that worked at least as of ~30 days ago on a stock centos7:
sudo yum -y update
sudo yum install -y yum-utils device-mapper-persistent-data lvm2
# Add docker repo
sudo yum-config-manager --add-repo https://download.docker.com/linux/centos/docker-ce.repo
distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
# Add nvidia-docker repo.
curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.repo | \
sudo tee /etc/yum.repos.d/nvidia-docker.repo
# Add kubernetes repo
cat <<EOF | sudo tee /etc/yum.repos.d/kubernetes.repo
[kubernetes]
name=Kubernetes
baseurl=https://packages.cloud.google.com/yum/repos/kubernetes-el7-x86_64
enabled=1
gpgcheck=1
repo_gpgcheck=1
gpgkey=https://packages.cloud.google.com/yum/doc/yum-key.gpg https://packages.cloud.google.com/yum/doc/rpm-package-key.gpg
EOF
# nvidia drivers - this just a local disk repo - we still get prompted about key trust.. rpm's are here:
# /var/nvidia-driver-local-repo-440.64.00
sudo yum install -y http://us.download.nvidia.com/tesla/440.64.00/nvidia-driver-local-repo-rhel7-440.64.00-1.0-1.x86_64.rpm
# We want to make sure we have the most current kernel in place before we attempt to install drivers so the correct ones
# compile.
sudo reboot
# Mass destruction. akmods used to automatically compile kernel mods as dkms wasn't working before.
sudo yum install -y docker-ce docker-ce-cli containerd.io kubectl akmods
sudo yum install nvidia-drivers nvidia-docker2
sudo systemctl enable docker
sudo reboot
# Test that it works
nvidia-smi
# If nvidia-smi passed, then we roll onto the docker test to make sure that works.
sudo docker run --gpus all nvidia/cuda:latest nvidia-smi
GKE supports the pattern of installing a daemonset to enable GPU support, which makes it fairly easy to setup. Ideally we would also have the option of easily changing the Nvidia Driver version to support different Cuda versions.
@josegonzalez Thanks for the feedback. If user use containers on EKS GPU nodes, then we would suggest user install CUDA inside container, AMI should be clean and only come with latest driver.
@Jeffwan Forgive me if this is incorrect, but wouldn't that disallow the ability to reserve cluster resources? If the driver is not installed, then I don't understand how we could otherwise do resource reservations to ensure a GPU core is not being utilized by multiple container processes, and thus avoid memory corruption.
@josegonzalez I think the best practice here is it's better for you to apply some taint to GPU nodes, only containers with tolerations can be scheduled on those nodes. EKS GPU node should come with Nvidia driver, but not CUDA. Driver is good enough to advertise resources to APIServer via device plugin.
I understand that. What I'm asking is for - whenever daemonsets are implemented as a way to install nvidia drivers - that the driver version be something that we can easily modify. The driver version is linked to the version of Cuda that containers can use, so if we can't easily switch to a newer Nvidia driver than the default, then we can't easily use a newer Cuda version.
And apologies, I misread your comment and thought you meant for us to install the driver within the docker container :facepalm:.
@josegonzalez Driver has some backward compatibility with CUDA but not all. I know your point you want to decouple AMI and driver version to support more CUDA version. This make sense, however, we don't see that many requests on this, most user migrate to higher CUDA toolkit instead which make support easy. Feel free to give more concrete examples and let's see if any users have similar problem. Then we can consider this in future versions.
The point of my comment was to add an additional request - variable nvidia driver version - to the work to provide a general nvidia driver installer in the same vein as GKE. I guess if/when that work is completed, I'll come back and comment to add the request.
In terms of a concrete problem, the company I work at ships a PaaS that enables GPU workloads. Some of our customers would want to use specific CUDA versions Because Of Reasons™ (I don't ask, its enterprise sales, customers come in all sizes lol), and their orgs may/may not validate particular Nvidia versions during a release cycle, so being able to support them and say "hey when you are testing your upgrade, run this command in this way to get the new/old version as supported by your org" would be ideal.
to clarify on the CUDA version comments (cross posting from https://github.com/aws/containers-roadmap/issues/955#issuecomment-2350422060):
libcuda.so (see figure 1 from here) is installed on the EKS optimized GPU AMI for the NVIDIA driver as part of the driver installation - the version of CUDA that users are typically interested in is the version within their container image that is used by their application. Some application frameworks like PyTorch will provide the CUDA libraries they require either when installing via pip or when using the PyTorch supplied container images (ref 1, ref 2)
So in summary - the version of CUDA that you wish to use is up to you and your application provided it is supported within the CUDA compatibility with respect to the driver version supplied on the host/AMI