eksctl
eksctl copied to clipboard
[Feature] Allow managed nodegroup creation with Ubuntu AMIs and GPU instances
What feature/behavior/change do you want?
Allow creation of managed nodegroups with Ubuntu AMIs when selecting a GPU instance.
Example cluster.yaml
# cluster.yaml
# A cluster with a managed Ubuntu nodegroup.
---
apiVersion: eksctl.io/v1alpha5
kind: ClusterConfig
metadata:
name: test-cluster
region: us-west-2
managedNodeGroups:
- name: gpu-nodegroup
instanceType: g4dn.xlarge
amiFamily: Ubuntu2004
minSize: 1
desiredCapacity: 1
maxSize: 1
Currently there is the expected warning that Ubuntu2004 does not ship with NVIDIA GPU drivers installed, but following this warning there is an error and cluster creation is terminated.
$ eksctl create cluster --install-nvidia-plugin=false --config-file cluster.yaml
2023-03-20 20:50:00 [!] Ubuntu2004 does not ship with NVIDIA GPU drivers installed, hence won't support running GPU-accelerated workloads out of the box
2023-03-20 20:50:00 [ℹ] eksctl version 0.134.0
2023-03-20 20:50:00 [ℹ] using region us-west-2
2023-03-20 20:50:00 [ℹ] skipping us-west-2d from selection because it doesn't support the following instance type(s): g4dn.xlarge
2023-03-20 20:50:00 [ℹ] setting availability zones to [us-west-2a us-west-2c us-west-2b]
2023-03-20 20:50:00 [✖] image family Ubuntu2004 doesn't support GPU image class
Why do you want this feature?
A managed Ubuntu 20.04 nodegroup with no GPU drivers installed works well with the NVIDIA GPU Operator which is installed via Helm and includes the NVIDIA GPU device plugin as well as a GPU driver container. This will provide a quick and easy way to create a managed GPU nodegroup with updated GPU drivers.
The GPU drivers included in the default Amazon Linux 2 AMI are typically out of date, for example the GPU drivers in the current AMI release are version 470.161.03. Making it easier to use the GPU operator on EKS will provide an easy way to create EKS clusters with the latest recommended drivers which are version 525.85.12.
For example, this works currently with eksctl if you make the nodegroup unmanaged, provide the overrideBootstrapCommand section, and provide the correct Ubuntu EKS AMI id from here: https://cloud-images.ubuntu.com/docs/aws/eks/.
# cluster.yaml
# A cluster with a self-managed Ubuntu nodegroup.
---
apiVersion: eksctl.io/v1alpha5
kind: ClusterConfig
metadata:
name: test-cluster
region: us-west-2
nodeGroups:
- name: gpu-nodegroup
instanceType: g4dn.xlarge
amiFamily: Ubuntu2004
# grab AMI ID for Ubuntu EKS AMI here: https://cloud-images.ubuntu.com/aws-eks/
# using AMI ID for us-west-2 region: ami-06cd6fdaf5a24b728
ami: ami-06cd6fdaf5a24b728
minSize: 1
desiredCapacity: 1
maxSize: 1
overrideBootstrapCommand: |
#!/bin/bash
source /var/lib/cloud/scripts/eksctl/bootstrap.helper.sh
/etc/eks/bootstrap.sh ${CLUSTER_NAME} --container-runtime containerd --kubelet-extra-args "--node-labels=${NODE_LABELS}"
$ eksctl create cluster --install-nvidia-plugin=false --config-file cluster.yaml
$ aws eks --region us-west-2 update-kubeconfig --name test-cluster
Install the NVIDIA GPU Operator via Helm chart.
$ helm install --repo https://helm.ngc.nvidia.com/nvidia --wait --generate-name -n gpu-operator \
--create-namespace gpu-operator
NAME: gpu-operator-1670843572
NAMESPACE: gpu-operator
STATUS: deployed
REVISION: 1
TEST SUITE: None
Wait until all pods are deployed (~6-7 minutes or so). This will add GPU drivers and the GPU device plugin.
watch -n 5 kubectl get pods -n gpu-operator
#Completed after about 7 minutes in testing
Every 5.0s: kubectl get pods -n gpu-operator
NAME READY STATUS RESTARTS AGE
gpu-feature-discovery-sk42n 1/1 Running 0 6m34s
gpu-operator-1679256184-node-feature-discovery-master-5cfdc2bx9 1/1 Running 0 7m2s
gpu-operator-1679256184-node-feature-discovery-worker-n8k9v 1/1 Running 0 7m2s
gpu-operator-79f94979f9-trnlp 1/1 Running 0 7m2s
nvidia-container-toolkit-daemonset-zp8wb 1/1 Running 0 6m34s
nvidia-device-plugin-daemonset-djjqf 1/1 Running 0 6m34s
nvidia-driver-daemonset-nw7h7 1/1 Running 0 6m43
Filename: nvidia-smi.yaml
# nvidia-smi.yaml
apiVersion: v1
kind: Pod
metadata:
name: nvidia-smi
spec:
restartPolicy: OnFailure
containers:
- name: nvidia-smi
image: ubuntu:22.04
args:
- "nvidia-smi"
resources:
limits:
nvidia.com/gpu: 1
$ kubectl apply -f nvidia-smi.yaml
$ kubectl logs nvidia-smi
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.85.12 Driver Version: 525.85.12 CUDA Version: N/A |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 Tesla T4 Off | 00000000:00:1E.0 Off | 0 |
| N/A 33C P8 8W / 70W | 0MiB / 15360MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
It would be nice to streamline this, enable managed nodegroups, and avoid users having to look up and hard-code their AMI id.
Hello JamesMaki :wave: Thank you for opening an issue in eksctl project. The team will review the issue and aim to respond within 1-5 business days. Meanwhile, please read about the Contribution and Code of Conduct guidelines here. You can find out more information about eksctl on our website
We need to investigate how to best support this request. Spike : 1 day
it's good to add this feature
Re-opened this issue as a bug since it works with CPU instances but not GPU instances.
https://github.com/weaveworks/eksctl/issues/6499#issuecomment-1539952882
Thank you @Himangini!
It will be good to add this feature as one of our use case requires using Ubuntu AMIs. Thank you.
This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.
Please remove the stale label and keep this request open. Thank you!
This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.
I would like to keep this issue open please. Commenting so GitHub Actions will remove the stale label. Thank you.
This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.
Hi, I wanted to point out that this already works today for CPU instances but is hard-coded not to allow creation for GPU instances. I think the relevant bit of code is here: https://github.com/eksctl-io/eksctl/blob/main/pkg/ami/auto_resolver.go#L81
Is this feature request done?
This would be very nice indeed!