aws-cdk icon indicating copy to clipboard operation
aws-cdk copied to clipboard

aws-eks: neuron device plugin manifest better reference

Open freschri opened this issue 1 year ago • 3 comments

Describe the bug

the neuron device plugin addon used in the cdk uses a custom manifest, see here: https://github.com/aws/aws-cdk/blob/f3d74bb78189ec6b76cfa85c97d993c1b26c1cac/packages/aws-cdk-lib/aws-eks/lib/cluster.ts#L1979 which is NOT pointing to the official neuron image (public.ecr.aws/neuron/neuron-device-plugin) and rbac is missing going into crashloopback and preventing metrics to be exposed

Expected Behavior

the right files are used

Current Behavior

crashloopback on deployment of inf2.xlarge

Reproduction Steps

deploy on inf2

Possible Solution

the neuron device plugin addon used in the cdk uses a custom manifest, see here: https://github.com/aws/aws-cdk/blob/f3d74bb78189ec6b76cfa85c97d993c1b26c1cac/packages/aws-cdk-lib/aws-eks/lib/cluster.ts#L1979 while there is a better existing reference from the Neuron, see description here: https://awsdocs-neuron.readthedocs-hosted.com/en/latest/containers/tutorials/k8s-setup.html

the yaml to use is https://raw.githubusercontent.com/aws-neuron/aws-neuron-sdk/master/src/k8/k8s-neuron-device-plugin.yml and also rbac needs to be used which is not in the current implementation const RBAC_URL = "https://raw.githubusercontent.com/aws-neuron/aws-neuron-sdk/master/src/k8/k8s-neuron-device-plugin-rbac.yml

Additional Information/Context

No response

CDK CLI Version

2.130.0

Framework Version

No response

Node.js Version

v20.4.0

OS

sonoma 14.3

Language

TypeScript

Language Version

No response

Other information

No response

freschri avatar Feb 26 '24 16:02 freschri

Thank you for the report. I guess we probably need to update this file. https://github.com/aws/aws-cdk/blob/f3d74bb78189ec6b76cfa85c97d993c1b26c1cac/packages/aws-cdk-lib/aws-eks/lib/addons/neuron-device-plugin.yaml

Are you interested to submit a PR for that?

pahud avatar Feb 27 '24 00:02 pahud

It is repruducible. I'm working on.

import * as cdk from 'aws-cdk-lib';
import { Construct } from 'constructs';
import * as eks from 'aws-cdk-lib/aws-eks';
import * as ec2 from 'aws-cdk-lib/aws-ec2';
import * as iam from 'aws-cdk-lib/aws-iam';

export class CdkIssueStack extends cdk.Stack {
  constructor(scope: Construct, id: string, props?: cdk.StackProps) {
    super(scope, id, props);

    const vpc = new ec2.Vpc(this, 'VPC', {
      maxAzs: 3
    });

    const cluster = new eks.Cluster(this, 'EKSCluster', {
      vpc,
      version: eks.KubernetesVersion.V1_29,
      defaultCapacity: 0,
      mastersRole: iam.Role.fromRoleArn(this, 'Admin', "xxx", {
        mutable: false,
      })
    });

    cluster.addNodegroupCapacity('Inf2NodeGroup', {
      instanceTypes: [new ec2.InstanceType('inf2.xlarge')],
      minSize: 2,
    });
  }
}
$ kubectl describe daemonset neuron-device-plugin-daemonset -n kube-system
Name:           neuron-device-plugin-daemonset
Selector:       name=neuron-device-plugin-ds
Node-Selector:  <none>
Labels:         aws.cdk.eks/prune-xxx
Annotations:    deprecated.daemonset.template.generation: 1
Desired Number of Nodes Scheduled: 2
Current Number of Nodes Scheduled: 2
Number of Nodes Scheduled with Up-to-date Pods: 2
Number of Nodes Scheduled with Available Pods: 0
Number of Nodes Misscheduled: 0
Pods Status:  2 Running / 0 Waiting / 0 Succeeded / 0 Failed
Pod Template:
  Labels:       name=neuron-device-plugin-ds
  Annotations:  scheduler.alpha.kubernetes.io/critical-pod: 
  Containers:
   k8s-neuron-device-plugin-ctr:
    Image:        790709498068.dkr.ecr.us-west-2.amazonaws.com/neuron-device-plugin:1.0.9043.0
    Port:         <none>
    Host Port:    <none>
    Environment:  <none>
    Mounts:
      /var/lib/kubelet/device-plugins from device-plugin (rw)
  Volumes:
   device-plugin:
    Type:               HostPath (bare host directory volume)
    Path:               /var/lib/kubelet/device-plugins
    HostPathType:       
  Priority Class Name:  system-node-critical
Events:
  Type    Reason            Age   From                  Message
  ----    ------            ----  ----                  -------
  Normal  SuccessfulCreate  37m   daemonset-controller  Created pod: neuron-device-plugin-daemonset-f578d
  Normal  SuccessfulCreate  37m   daemonset-controller  Created pod: neuron-device-plugin-daemonset-d4ksr
$ kubectl get pods -n kube-system
NAME                                   READY   STATUS             RESTARTS         AGE
aws-node-ghjqh                         2/2     Running            0                41m
aws-node-vjq99                         2/2     Running            0                42m
coredns-68bd859788-flbr4               1/1     Running            0                45m
coredns-68bd859788-wxtfv               1/1     Running            0                45m
kube-proxy-54klc                       1/1     Running            0                41m
kube-proxy-kx9rm                       1/1     Running            0                42m
neuron-device-plugin-daemonset-d4ksr   0/1     CrashLoopBackOff   12 (2m37s ago)   39m
neuron-device-plugin-daemonset-f578d   0/1     CrashLoopBackOff   12 (2m22s ago)   39m
$ kubectl describe pod neuron-device-plugin-daemonset-d4ksr -n kube-system
Name:                 neuron-device-plugin-daemonset-d4ksr
Namespace:            kube-system
Priority:             2000001000
Priority Class Name:  system-node-critical
Service Account:      default
Node:                 ip-10-0-240-116.eu-west-1.compute.internal/10.0.240.116
Start Time:           Fri, 29 Mar 2024 08:55:24 +0000
Labels:               controller-revision-hash=67496f5558
                      name=neuron-device-plugin-ds
                      pod-template-generation=1
Annotations:          scheduler.alpha.kubernetes.io/critical-pod: 
Status:               Running
IP:                   10.0.201.70
IPs:
  IP:           10.0.201.70
Controlled By:  DaemonSet/neuron-device-plugin-daemonset
Containers:
  k8s-neuron-device-plugin-ctr:
    Container ID:   containerd://6e5f8d1ebdc2591edd37ccfe20c79169dc1564d2e163e0d704cbef02d957dda9
    Image:          790709498068.dkr.ecr.us-west-2.amazonaws.com/neuron-device-plugin:1.0.9043.0
    Image ID:       790709498068.dkr.ecr.us-west-2.amazonaws.com/neuron-device-plugin@sha256:6a0df1d6446c96b752f7abbdc9478873e2f3da05989dcaf17667076db8339728
    Port:           <none>
    Host Port:      <none>
    State:          Waiting
      Reason:       CrashLoopBackOff
    Last State:     Terminated
      Reason:       Error
      Exit Code:    2
      Started:      Fri, 29 Mar 2024 09:31:51 +0000
      Finished:     Fri, 29 Mar 2024 09:31:51 +0000
    Ready:          False
    Restart Count:  12
    Environment:    <none>
    Mounts:
      /var/lib/kubelet/device-plugins from device-plugin (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-65qsg (ro)
Conditions:
  Type                        Status
  PodReadyToStartContainers   True 
  Initialized                 True 
  Ready                       False 
  ContainersReady             False 
  PodScheduled                True 
Volumes:
  device-plugin:
    Type:          HostPath (bare host directory volume)
    Path:          /var/lib/kubelet/device-plugins
    HostPathType:  
  kube-api-access-65qsg:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
QoS Class:                   BestEffort
Node-Selectors:              <none>
Tolerations:                 CriticalAddonsOnly op=Exists
                             aws.amazon.com/neuron:NoSchedule op=Exists
                             node.kubernetes.io/disk-pressure:NoSchedule op=Exists
                             node.kubernetes.io/memory-pressure:NoSchedule op=Exists
                             node.kubernetes.io/not-ready:NoExecute op=Exists
                             node.kubernetes.io/pid-pressure:NoSchedule op=Exists
                             node.kubernetes.io/unreachable:NoExecute op=Exists
                             node.kubernetes.io/unschedulable:NoSchedule op=Exists
Events:
  Type     Reason     Age                    From               Message
  ----     ------     ----                   ----               -------
  Normal   Scheduled  39m                    default-scheduler  Successfully assigned kube-system/neuron-device-plugin-daemonset-d4ksr to ip-10-0-240-116.eu-west-1.compute.internal
  Normal   Pulled     39m                    kubelet            Successfully pulled image "790709498068.dkr.ecr.us-west-2.amazonaws.com/neuron-device-plugin:1.0.9043.0" in 8.068s (8.068s including waiting)
  Normal   Pulled     39m                    kubelet            Successfully pulled image "790709498068.dkr.ecr.us-west-2.amazonaws.com/neuron-device-plugin:1.0.9043.0" in 683ms (683ms including waiting)
  Normal   Pulled     39m                    kubelet            Successfully pulled image "790709498068.dkr.ecr.us-west-2.amazonaws.com/neuron-device-plugin:1.0.9043.0" in 672ms (672ms including waiting)
  Normal   Started    38m (x4 over 39m)      kubelet            Started container k8s-neuron-device-plugin-ctr
  Normal   Pulled     38m                    kubelet            Successfully pulled image "790709498068.dkr.ecr.us-west-2.amazonaws.com/neuron-device-plugin:1.0.9043.0" in 680ms (680ms including waiting)
  Normal   Pulling    37m (x5 over 39m)      kubelet            Pulling image "790709498068.dkr.ecr.us-west-2.amazonaws.com/neuron-device-plugin:1.0.9043.0"
  Normal   Created    37m (x5 over 39m)      kubelet            Created container k8s-neuron-device-plugin-ctr
  Normal   Pulled     37m                    kubelet            Successfully pulled image "790709498068.dkr.ecr.us-west-2.amazonaws.com/neuron-device-plugin:1.0.9043.0" in 678ms (678ms including waiting)
  Warning  BackOff    4m19s (x163 over 39m)  kubelet            Back-off restarting failed container k8s-neuron-device-plugin-ctr in pod neuron-device-plugin-daemonset-d4ksr_kube-system(5b998b0a-c411-4aa0-916a-4b08433213f6)
$ kubectl logs neuron-device-plugin-daemonset-d4ksr -n kube-system
neuron-device-plugin: 2024/03/29 09:31:51 Fetching devices.
neuron-device-plugin: 2024/03/29 09:31:51 Error to get IB device: open /run/infa-map.json: no such file or directory
neuron-device-plugin: 2024/03/29 09:31:51 No devices found.
neuron-device-plugin: 2024/03/29 09:31:51 Device list: []
neuron-device-plugin: 2024/03/29 09:31:51 Starting FS watcher.
neuron-device-plugin: 2024/03/29 09:31:51 Starting OS watcher.
neuron-device-plugin: 2024/03/29 09:31:51 Error to get devices: open /run/infa-map.json: no such file or directory
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x28 pc=0x85bb96]

goroutine 1 [running]:
main.(*DevicePlugin).cleanup(0x0, 0x1, 0x1)
	/opt/workspace/KaenaTools/build/private/build/SRC_CPY/cmd/k8s-neuron-device-plugin/server.go:203 +0x26
main.(*DevicePlugin).Start(0x0, 0xc000120048, 0x10)
	/opt/workspace/KaenaTools/build/private/build/SRC_CPY/cmd/k8s-neuron-device-plugin/server.go:75 +0x2f
main.(*DevicePlugin).Serve(0x0, 0x9700e4, 0x15, 0xc0000665a0, 0x0)
	/opt/workspace/KaenaTools/build/private/build/SRC_CPY/cmd/k8s-neuron-device-plugin/server.go:229 +0x35
main.main()
	/opt/workspace/KaenaTools/build/private/build/SRC_CPY/cmd/k8s-neuron-device-plugin/main.go:64 +0x3a8

wafuwafu13 avatar Mar 29 '24 09:03 wafuwafu13

@wafuwafu13 @pahud please note my suggestion in "possible solution": the yaml to use is https://raw.githubusercontent.com/aws-neuron/aws-neuron-sdk/master/src/k8/k8s-neuron-device-plugin.yml and also rbac needs to be used which is not in the current implementation const RBAC_URL = "https://raw.githubusercontent.com/aws-neuron/aws-neuron-sdk/master/src/k8/k8s-neuron-device-plugin-rbac.yml

freschri avatar Apr 04 '24 08:04 freschri