eksctl icon indicating copy to clipboard operation
eksctl copied to clipboard

[Feature] Allow managed nodegroup creation with Ubuntu AMIs and GPU instances

Open JamesMaki opened this issue 2 years ago • 15 comments

What feature/behavior/change do you want?

Allow creation of managed nodegroups with Ubuntu AMIs when selecting a GPU instance.

Example cluster.yaml

# cluster.yaml
# A cluster with a managed Ubuntu nodegroup.
---
apiVersion: eksctl.io/v1alpha5
kind: ClusterConfig

metadata:
  name: test-cluster
  region: us-west-2

managedNodeGroups:
  - name: gpu-nodegroup
    instanceType: g4dn.xlarge
    amiFamily: Ubuntu2004
    minSize: 1
    desiredCapacity: 1
    maxSize: 1

Currently there is the expected warning that Ubuntu2004 does not ship with NVIDIA GPU drivers installed, but following this warning there is an error and cluster creation is terminated.

 $ eksctl create cluster --install-nvidia-plugin=false --config-file cluster.yaml 
2023-03-20 20:50:00 [!]  Ubuntu2004 does not ship with NVIDIA GPU drivers installed, hence won't support running GPU-accelerated workloads out of the box
2023-03-20 20:50:00 [ℹ]  eksctl version 0.134.0
2023-03-20 20:50:00 [ℹ]  using region us-west-2
2023-03-20 20:50:00 [ℹ]  skipping us-west-2d from selection because it doesn't support the following instance type(s): g4dn.xlarge
2023-03-20 20:50:00 [ℹ]  setting availability zones to [us-west-2a us-west-2c us-west-2b]
2023-03-20 20:50:00 [✖]  image family Ubuntu2004 doesn't support GPU image class

Why do you want this feature?

A managed Ubuntu 20.04 nodegroup with no GPU drivers installed works well with the NVIDIA GPU Operator which is installed via Helm and includes the NVIDIA GPU device plugin as well as a GPU driver container. This will provide a quick and easy way to create a managed GPU nodegroup with updated GPU drivers.

The GPU drivers included in the default Amazon Linux 2 AMI are typically out of date, for example the GPU drivers in the current AMI release are version 470.161.03. Making it easier to use the GPU operator on EKS will provide an easy way to create EKS clusters with the latest recommended drivers which are version 525.85.12.

For example, this works currently with eksctl if you make the nodegroup unmanaged, provide the overrideBootstrapCommand section, and provide the correct Ubuntu EKS AMI id from here: https://cloud-images.ubuntu.com/docs/aws/eks/.

# cluster.yaml
# A cluster with a self-managed Ubuntu nodegroup.
---
apiVersion: eksctl.io/v1alpha5
kind: ClusterConfig

metadata:
  name: test-cluster
  region: us-west-2

nodeGroups:
  - name: gpu-nodegroup
    instanceType: g4dn.xlarge
    amiFamily: Ubuntu2004
    # grab AMI ID for Ubuntu EKS AMI here: https://cloud-images.ubuntu.com/aws-eks/
    # using AMI ID for us-west-2 region: ami-06cd6fdaf5a24b728
    ami: ami-06cd6fdaf5a24b728
    minSize: 1
    desiredCapacity: 1
    maxSize: 1
    overrideBootstrapCommand: |
      #!/bin/bash
      source /var/lib/cloud/scripts/eksctl/bootstrap.helper.sh
      /etc/eks/bootstrap.sh ${CLUSTER_NAME} --container-runtime containerd --kubelet-extra-args "--node-labels=${NODE_LABELS}"
 $ eksctl create cluster --install-nvidia-plugin=false --config-file cluster.yaml 
 $ aws eks --region us-west-2 update-kubeconfig --name test-cluster

Install the NVIDIA GPU Operator via Helm chart.

$ helm install --repo https://helm.ngc.nvidia.com/nvidia --wait --generate-name -n gpu-operator \
      --create-namespace gpu-operator
NAME: gpu-operator-1670843572
NAMESPACE: gpu-operator
STATUS: deployed
REVISION: 1
TEST SUITE: None

Wait until all pods are deployed (~6-7 minutes or so). This will add GPU drivers and the GPU device plugin.

watch -n 5 kubectl get pods -n gpu-operator

#Completed after about 7 minutes in testing
Every 5.0s: kubectl get pods -n gpu-operator                                                                                                                                                                                                                                                                                       

NAME                                                              READY   STATUS      RESTARTS   AGE                                                                                                                                                                                                                                                                  
gpu-feature-discovery-sk42n                                       1/1     Running     0          6m34s
gpu-operator-1679256184-node-feature-discovery-master-5cfdc2bx9   1/1     Running     0          7m2s
gpu-operator-1679256184-node-feature-discovery-worker-n8k9v       1/1     Running     0          7m2s
gpu-operator-79f94979f9-trnlp                                     1/1     Running     0          7m2s
nvidia-container-toolkit-daemonset-zp8wb                          1/1     Running     0          6m34s
nvidia-device-plugin-daemonset-djjqf                              1/1     Running     0          6m34s
nvidia-driver-daemonset-nw7h7                                     1/1     Running     0          6m43

Filename: nvidia-smi.yaml

# nvidia-smi.yaml
apiVersion: v1
kind: Pod
metadata:
  name: nvidia-smi
spec:
  restartPolicy: OnFailure
  containers:
  - name: nvidia-smi
    image: ubuntu:22.04
    args:
    - "nvidia-smi"
    resources:
      limits:
        nvidia.com/gpu: 1
$ kubectl apply -f nvidia-smi.yaml
$ kubectl logs nvidia-smi
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.85.12    Driver Version: 525.85.12    CUDA Version: N/A      |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla T4            Off  | 00000000:00:1E.0 Off |                    0 |
| N/A   33C    P8     8W /  70W |      0MiB / 15360MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

It would be nice to streamline this, enable managed nodegroups, and avoid users having to look up and hard-code their AMI id.

JamesMaki avatar Mar 20 '23 21:03 JamesMaki

Hello JamesMaki :wave: Thank you for opening an issue in eksctl project. The team will review the issue and aim to respond within 1-5 business days. Meanwhile, please read about the Contribution and Code of Conduct guidelines here. You can find out more information about eksctl on our website

github-actions[bot] avatar Mar 20 '23 21:03 github-actions[bot]

We need to investigate how to best support this request. Spike : 1 day

Himangini avatar Mar 29 '23 12:03 Himangini

it's good to add this feature

angudadevops avatar Apr 05 '23 23:04 angudadevops

Re-opened this issue as a bug since it works with CPU instances but not GPU instances.

JamesMaki avatar Apr 06 '23 20:04 JamesMaki

https://github.com/weaveworks/eksctl/issues/6499#issuecomment-1539952882

Himangini avatar May 09 '23 10:05 Himangini

Thank you @Himangini!

JamesMaki avatar May 09 '23 15:05 JamesMaki

It will be good to add this feature as one of our use case requires using Ubuntu AMIs. Thank you.

tanmatth avatar May 12 '23 17:05 tanmatth

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.

github-actions[bot] avatar Jun 26 '23 02:06 github-actions[bot]

Please remove the stale label and keep this request open. Thank you!

JamesMaki avatar Jun 26 '23 02:06 JamesMaki

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.

github-actions[bot] avatar Oct 28 '23 01:10 github-actions[bot]

I would like to keep this issue open please. Commenting so GitHub Actions will remove the stale label. Thank you.

JamesMaki avatar Oct 28 '23 01:10 JamesMaki

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.

github-actions[bot] avatar Nov 28 '23 01:11 github-actions[bot]

Hi, I wanted to point out that this already works today for CPU instances but is hard-coded not to allow creation for GPU instances. I think the relevant bit of code is here: https://github.com/eksctl-io/eksctl/blob/main/pkg/ami/auto_resolver.go#L81

JamesMaki avatar Dec 12 '23 21:12 JamesMaki

Is this feature request done?

xyfleet avatar Jan 29 '24 21:01 xyfleet

This would be very nice indeed!

montanaflynn avatar Jun 17 '24 17:06 montanaflynn