bottlerocket Node doesn't expose GPU resource on g4dn.[n]xlarge

Node doesn't expose GPU resource on g4dn.[n]xlarge

Open andrescaroc opened this issue 7 months ago • 9 comments

Image I'm using: System Info:

Kernel Version: 5.15.160
OS Image: Bottlerocket OS 1.20.3 (aws-k8s-1.26-nvidia)
Operating System: linux
Architecture: amd64
Container Runtime Version: containerd://1.6.31+bottlerocket
Kubelet Version: v1.26.14-eks-b063426
Kube-Proxy Version: v1.26.14-eks-b063426

What I expected to happen: 100% of the time that in EKS I start a Bottlerocket OS 1.20.3 (aws-k8s-1.26-nvidia) ami-09469fd78070eaac6 node on a g4dn.[n]xlarge instance-type it should expose the gpu count for pods.

Capacity:
  ...
  nvidia.com/gpu:              1
  ...
Allocatable:
  ...
  nvidia.com/gpu:              1
  ...

What actually happened: ~5% of the time that in EKS I start a Bottlerocket OS 1.20.3 (aws-k8s-1.26-nvidia) ami-09469fd78070eaac6 node on a g4dn.[n]xlarge instance-type it didn't expose the gpu count for pods, causing pods requiring nvidia.com/gpu: 1 to not be scheduled, keeping them in pending state waiting for a node.

Capacity:
  cpu:                8
  ephemeral-storage:  61904460Ki
  hugepages-1Gi:      0
  hugepages-2Mi:      0
  memory:             32366612Ki
  pods:               29
Allocatable:
  cpu:                7910m
  ephemeral-storage:  55977408418
  hugepages-1Gi:      0
  hugepages-2Mi:      0
  memory:             31676436Ki
  pods:               29

How to reproduce the problem: Note: This issue has existed for more than a year, you can see the slack thread here

Current settings:

EKS K8s v1.26
Karpenter Autoscaler v0.33.5
Karpenter Nodepool:

apiVersion: karpenter.sh/v1beta1
kind: NodePool
metadata:
  name: random-name
spec:
  disruption:
    consolidationPolicy: WhenUnderutilized
    expireAfter: 720h
  limits:
    cpu: 1000
  template:
    metadata:
      labels:
        company.ai/node: random-name
    spec:
      nodeClassRef:
        apiVersion: karpenter.k8s.aws/v1beta1
        kind: EC2NodeClass
        name: random-name
      requirements:
      - key: node.kubernetes.io/instance-type
        operator: In
        values:
        - g4dn.xlarge
        - g4dn.2xlarge
        - g4dn.4xlarge
      - key: topology.kubernetes.io/zone
        operator: In
        values:
        - eu-central-1a
        - eu-central-1b
        - eu-central-1c
      - key: kubernetes.io/arch
        operator: In
        values:
        - amd64
      - key: karpenter.sh/capacity-type
        operator: In
        values:
        - spot
        - on-demand
      taints:
      - effect: NoSchedule
        key: nvidia.com/gpu
status:
  resources:
    cpu: "8"
    ephemeral-storage: 61904460Ki
    memory: 32366612Ki
    nvidia.com/gpu: "1"
    pods: "29"
    vpc.amazonaws.com/pod-eni: "39"

Karpenter EC2NodeClass:

apiVersion: karpenter.k8s.aws/v1beta1
kind: EC2NodeClass
metadata:
  name: random-name
spec:
  amiFamily: Bottlerocket
  blockDeviceMappings:
  - deviceName: /dev/xvda
    ebs:
      deleteOnTermination: true
      volumeSize: 4Gi
      volumeType: gp3
  - deviceName: /dev/xvdb
    ebs:
      deleteOnTermination: true
      iops: 3000
      snapshotID: snap-d4758cc7f5f11
      throughput: 500
      volumeSize: 60Gi
      volumeType: gp3
  metadataOptions:
    httpEndpoint: enabled
    httpProtocolIPv6: disabled
    httpPutResponseHopLimit: 2
    httpTokens: required
  role: KarpenterNodeRole-prod
  securityGroupSelectorTerms:
  - tags:
      karpenter.sh/discovery: prod
  subnetSelectorTerms:
  - tags:
      Name: '*Private*'
      karpenter.sh/discovery: prod
  tags:
    nodepool: random-name
    purpose: prod
    vendor: random-name
status:
  amis:
  - id: ami-0a3eb13c0c420309b
    name: bottlerocket-aws-k8s-1.26-nvidia-aarch64-v1.20.3-5d9ac849
    requirements:
    - key: kubernetes.io/arch
      operator: In
      values:
      - arm64
    - key: karpenter.k8s.aws/instance-gpu-count
      operator: Exists
  - id: ami-0a3eb13c0c420309b
    name: bottlerocket-aws-k8s-1.26-nvidia-aarch64-v1.20.3-5d9ac849
    requirements:
    - key: kubernetes.io/arch
      operator: In
      values:
      - arm64
    - key: karpenter.k8s.aws/instance-accelerator-count
      operator: Exists
  - id: ami-0e68f27f62340664d
    name: bottlerocket-aws-k8s-1.26-aarch64-v1.20.3-5d9ac849
    requirements:
    - key: kubernetes.io/arch
      operator: In
      values:
      - arm64
    - key: karpenter.k8s.aws/instance-gpu-count
      operator: DoesNotExist
    - key: karpenter.k8s.aws/instance-accelerator-count
      operator: DoesNotExist
  - id: ami-09469fd78070eaac6
    name: bottlerocket-aws-k8s-1.26-nvidia-x86_64-v1.20.3-5d9ac849
    requirements:
    - key: kubernetes.io/arch
      operator: In
      values:
      - amd64
    - key: karpenter.k8s.aws/instance-gpu-count
      operator: Exists
  - id: ami-09469fd78070eaac6
    name: bottlerocket-aws-k8s-1.26-nvidia-x86_64-v1.20.3-5d9ac849
    requirements:
    - key: kubernetes.io/arch
      operator: In
      values:
      - amd64
    - key: karpenter.k8s.aws/instance-accelerator-count
      operator: Exists
  - id: ami-096d4acd33c9e9449
    name: bottlerocket-aws-k8s-1.26-x86_64-v1.20.3-5d9ac849
    requirements:
    - key: kubernetes.io/arch
      operator: In
      values:
      - amd64
    - key: karpenter.k8s.aws/instance-gpu-count
      operator: DoesNotExist
    - key: karpenter.k8s.aws/instance-accelerator-count
      operator: DoesNotExist
  instanceProfile: prod_29098703412147
  securityGroups:
  - id: sg-71ed7b6a7c7
    name: eks-cluster-sg-prod-684080
  subnets:
  - id: subnet-aa204f6e57f07
    zone: eu-central-1a
  - id: subnet-579874f746c1b
    zone: eu-central-1c
  - id: subnet-07ce8a8349377
    zone: eu-central-1b

Deployment:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: deploy-gpu
spec:
  template:
    spec:
      containers:
      - ...
        resources:
          limits:
            nvidia.com/gpu: "1"
          requests:
            nvidia.com/gpu: "1"
      nodeSelector:
        company.ai/node: random-name
      tolerations:
      - effect: NoSchedule
        key: nvidia.com/gpu
        operator: Exists
	  ...

Karpenter managed process:
- The Deployment is deployed in the cluster
- There are no nodes that can fulfill the requirements of resources, node labes and tolerations
- karpenter detects it can fulfill the requirement with the described NodePool and EC2NodeClass
- karpenter create a new node of the pool karpenter.sh/nodepool=random-name
- The node started is healthy but the scheduler can't schedule the pod on it because it is not exposing the gpu.
```
'0/16 nodes are available: 1 Insufficient nvidia.com/gpu, 10 node(s)
    didn''t match Pod''s node affinity/selector, 5 node(s) had untolerated taint
    {deepc-cpu: }. preemption: 0/16 nodes are available: 1 No preemption victims
    found for incoming pod, 15 Preemption is not helpful for scheduling..'
```
Node created:

kubectl describe no ip-192-168-164-242.eu-central-1.compute.internal 
Name:               ip-192-168-164-242.eu-central-1.compute.internal
Roles:              <none>
Labels:             beta.kubernetes.io/arch=amd64
                    beta.kubernetes.io/instance-type=g4dn.2xlarge
                    beta.kubernetes.io/os=linux
                    company.ai/node=random-name
                    failure-domain.beta.kubernetes.io/region=eu-central-1
                    failure-domain.beta.kubernetes.io/zone=eu-central-1b
                    k8s.io/cloud-provider-aws=e7b76f679c563363cec5c6d5c3
                    karpenter.k8s.aws/instance-category=g
                    karpenter.k8s.aws/instance-cpu=8
                    karpenter.k8s.aws/instance-encryption-in-transit-supported=true
                    karpenter.k8s.aws/instance-family=g4dn
                    karpenter.k8s.aws/instance-generation=4
                    karpenter.k8s.aws/instance-gpu-count=1
                    karpenter.k8s.aws/instance-gpu-manufacturer=nvidia
                    karpenter.k8s.aws/instance-gpu-memory=16384
                    karpenter.k8s.aws/instance-gpu-name=t4
                    karpenter.k8s.aws/instance-hypervisor=nitro
                    karpenter.k8s.aws/instance-local-nvme=225
                    karpenter.k8s.aws/instance-memory=32768
                    karpenter.k8s.aws/instance-network-bandwidth=10000
                    karpenter.k8s.aws/instance-size=2xlarge
                    karpenter.sh/capacity-type=spot
                    karpenter.sh/nodepool=random-name
                    karpenter.sh/registered=true
                    kubernetes.io/arch=amd64
                    kubernetes.io/hostname=ip-192-168-164-242.eu-central-1.compute.internal
                    kubernetes.io/os=linux
                    node.kubernetes.io/instance-type=g4dn.2xlarge
                    topology.ebs.csi.aws.com/zone=eu-central-1b
                    topology.kubernetes.io/region=eu-central-1
                    topology.kubernetes.io/zone=eu-central-1b
Annotations:        alpha.kubernetes.io/provided-node-ip: 192.168.164.242
                    csi.volume.kubernetes.io/nodeid:
                      {"csi.tigera.io":"ip-192-168-164-242.eu-central-1.compute.internal","ebs.csi.aws.com":"i-0fd5f7a7969d63c9d"}
                    karpenter.k8s.aws/ec2nodeclass-hash: 15616957348189460630
                    karpenter.k8s.aws/ec2nodeclass-hash-version: v1
                    karpenter.sh/nodepool-hash: 14407783392627717656
                    karpenter.sh/nodepool-hash-version: v1
                    node.alpha.kubernetes.io/ttl: 0
                    volumes.kubernetes.io/controller-managed-attach-detach: true
CreationTimestamp:  Mon, 08 Jul 2024 08:57:11 -0500
Taints:             nvidia.com/gpu:NoSchedule
Unschedulable:      false
Lease:
  HolderIdentity:  ip-192-168-164-242.eu-central-1.compute.internal
  AcquireTime:     <unset>
  RenewTime:       Fri, 12 Jul 2024 13:43:42 -0500
Conditions:
  Type             Status  LastHeartbeatTime                 LastTransitionTime                Reason                       Message
  ----             ------  -----------------                 ------------------                ------                       -------
  MemoryPressure   False   Fri, 12 Jul 2024 13:39:08 -0500   Mon, 08 Jul 2024 08:57:11 -0500   KubeletHasSufficientMemory   kubelet has sufficient memory available
  DiskPressure     False   Fri, 12 Jul 2024 13:39:08 -0500   Mon, 08 Jul 2024 08:57:11 -0500   KubeletHasNoDiskPressure     kubelet has no disk pressure
  PIDPressure      False   Fri, 12 Jul 2024 13:39:08 -0500   Mon, 08 Jul 2024 08:57:11 -0500   KubeletHasSufficientPID      kubelet has sufficient PID available
  Ready            True    Fri, 12 Jul 2024 13:39:08 -0500   Mon, 08 Jul 2024 08:57:19 -0500   KubeletReady                 kubelet is posting ready status
Addresses:
  InternalIP:   192.168.164.242
  InternalDNS:  ip-192-168-164-242.eu-central-1.compute.internal
  Hostname:     ip-192-168-164-242.eu-central-1.compute.internal
Capacity:
  cpu:                8
  ephemeral-storage:  61904460Ki
  hugepages-1Gi:      0
  hugepages-2Mi:      0
  memory:             32366612Ki
  pods:               29
Allocatable:
  cpu:                7910m
  ephemeral-storage:  55977408418
  hugepages-1Gi:      0
  hugepages-2Mi:      0
  memory:             31676436Ki
  pods:               29
System Info:
  Machine ID:                 ec2a5800025dbe346ecc517c4de3
  System UUID:                ec2a58-0002-5dbe-346e-cc517c4de3
  Boot ID:                    c25bd0fd-67f0-4681-946a-64f8c5c57878
  Kernel Version:             5.15.160
  OS Image:                   Bottlerocket OS 1.20.3 (aws-k8s-1.26-nvidia)
  Operating System:           linux
  Architecture:               amd64
  Container Runtime Version:  containerd://1.6.31+bottlerocket
  Kubelet Version:            v1.26.14-eks-b063426
  Kube-Proxy Version:         v1.26.14-eks-b063426
ProviderID:                   aws:///eu-central-1b/i-d5f7a7969d63c
Non-terminated Pods:          (8 in total)
  Namespace                   Name                                         CPU Requests  CPU Limits  Memory Requests  Memory Limits  Age
  ---------                   ----                                         ------------  ----------  ---------------  -------------  ---
  calico-system               calico-node-rlbvk                            0 (0%)        0 (0%)      0 (0%)           0 (0%)         4d4h
  calico-system               csi-node-driver-x5jn7                        0 (0%)        0 (0%)      0 (0%)           0 (0%)         4d4h
  kube-system                 aws-node-txk8m                               50m (0%)      0 (0%)      0 (0%)           0 (0%)         4d4h
  kube-system                 ebs-csi-node-b8ln5                           30m (0%)      0 (0%)      120Mi (0%)       768Mi (2%)     4d4h
  kube-system                 kube-proxy-zkn9m                             100m (1%)     0 (0%)      0 (0%)           0 (0%)         4d4h
  loki                        loki-promtail-6644l                          0 (0%)        0 (0%)      0 (0%)           0 (0%)         4d4h
  monitoring                  monitoring-prometheus-node-exporter-fx749    0 (0%)        0 (0%)      0 (0%)           0 (0%)         4d4h
  tigera-operator             tigera-operator-7b594b484b-rkn5g             0 (0%)        0 (0%)      0 (0%)           0 (0%)         3d21h
Allocated resources:
  (Total limits may be over 100 percent, i.e., overcommitted.)
  Resource           Requests    Limits
  --------           --------    ------
  cpu                180m (2%)   0 (0%)
  memory             120Mi (0%)  768Mi (2%)
  ephemeral-storage  0 (0%)      0 (0%)
  hugepages-1Gi      0 (0%)      0 (0%)
  hugepages-2Mi      0 (0%)      0 (0%)
Events:              <none>

As you can see the node created fulfill the requirements of node labes and tolerations but not the resources (gpu)
Inspecting the node:

Using the session manager -> amin-container -> sheltie

bash-5.1# lsmod | grep nvidia
nvidia_uvm           1454080  0
nvidia_modeset       1265664  0
nvidia              56004608  2 nvidia_uvm,nvidia_modeset
drm                   626688  1 nvidia
backlight              24576  2 drm,nvidia_modeset
i2c_core              102400  2 nvidia,drm

bash-5.1# systemctl list-unit-files | grep nvidia
nvidia-fabricmanager.service                                                          enabled         enabled
nvidia-k8s-device-plugin.service                                                      enabled         enabled

bash-5.1# journalctl -b -u nvidia-k8s-device-plugin
-- No entries --
bash-5.1# journalctl -b -u nvidia-fabricmanager.service
-- No entries --

bash-5.1# journalctl --list-boots
IDX BOOT ID                          FIRST ENTRY                 LAST ENTRY
  0 c25bd0fd67f04681946a64f8c5c57878 Tue 2024-07-09 22:06:03 UTC Fri 2024-07-12 22:02:13 UTC

From the slack thread, someone suggest this:

Grasping at straws, but I wonder if this is some sort of initialization race condition where the kubelet service starts before the NVIDIA device is ready.

Jul 12 '24 22:07 andrescaroc

bottlerocket bottlerocket copied to clipboard

Node doesn't expose GPU resource on g4dn.[n]xlarge

bottlerocket
bottlerocket copied to clipboard