karpenter icon indicating copy to clipboard operation
karpenter copied to clipboard

karpenter fails to add a spot replacement to on-demand ones because of nodeclaim validation

Open myaser opened this issue 9 months ago • 8 comments

Description

Observed Behavior: for a nodepool of mixed capacity types (on-demand, and spot) karpenter tries to decommission an on-demand instance to replace it with a spot instance. it then fails to do so because the generated nodeclaim contains a requirement label of the restricted domain "karpenter.sh"

check the controller logs

karpenter-779ff45f5c-nmn5w controller {"level":"INFO","time":"2024-04-25T10:22:27.988Z","logger":"controller.disruption","message":"disrupting via consolidation replace, terminating 1 nodes (25 pods) ip-10-149-88-228.eu-central-1.compute.internal/m5.xlarge/on-demand and replacing with spot node from types m5.xlarge","commit":"6b868db-dirty","command-id":"e31d43f1-3b17-4be9-acdb-658ba38f5b95"}

karpenter-779ff45f5c-nmn5w controller {"level":"ERROR","time":"2024-04-25T10:22:28.058Z","logger":"controller.disruption","message":"disrupting via \"consolidation\", disrupting candidates, launching replacement nodeclaim (command-id: e31d43f1-3b17-4be9-acdb-658ba38f5b95), creating node claim, NodeClaim.karpenter.sh \"karpenter-default-wx8rz\" is invalid: spec.requirements[9].key: Invalid value: \"string\": label domain \"karpenter.sh\" is restricted","commit":"6b868db-dirty"}

Expected Behavior: creating spot replacements for on-demand ones should not be blocked.

Reproduction Steps (Please include YAML): node pools

apiVersion: v1
items:
- apiVersion: karpenter.sh/v1beta1
  kind: NodePool
  metadata:
    annotations:
      karpenter.sh/nodepool-hash: "3243005398540344161"
      karpenter.sh/nodepool-hash-version: v2
      kubectl.kubernetes.io/last-applied-configuration: |
        {"apiVersion":"karpenter.sh/v1beta1","kind":"NodePool","metadata":{"annotations":{},"name":"karpenter-default"},"spec":{"disruption":{"consolidationPolicy":"WhenUnderutilized","expireAfter":"Never"},"template":{"metadata":{"labels":{"cluster-lifecycle-controller.zalan.do/replacement-strategy":"none","lifecycle-status":"ready","node.kubernetes.io/node-pool":"karpenter-default","node.kubernetes.io/profile":"worker-karpenter","node.kubernetes.io/role":"worker"}},"spec":{"kubelet":{"clusterDNS":["10.0.1.100"],"cpuCFSQuota":false,"kubeReserved":{"cpu":"100m","memory":"282Mi"},"maxPods":32,"systemReserved":{"cpu":"100m","memory":"164Mi"}},"nodeClassRef":{"name":"karpenter-default"},"requirements":[{"key":"node.kubernetes.io/instance-type","operator":"In","values":["m5.8xlarge","m5.xlarge"]},{"key":"karpenter.sh/capacity-type","operator":"In","values":["spot","on-demand"]},{"key":"kubernetes.io/arch","operator":"In","values":["arm64","amd64"]},{"key":"topology.kubernetes.io/zone","operator":"In","values":["eu-central-1a","eu-central-1b","eu-central-1c"]}],"startupTaints":[{"effect":"NoSchedule","key":"zalando.org/node-not-ready"}]}},"weight":1}}
    creationTimestamp: "2024-04-25T09:09:18Z"
    generation: 1
    name: karpenter-default
    resourceVersion: "1942211133"
    uid: 0d6de200-cac7-4ea3-a12d-a254b60b29f9
  spec:
    disruption:
      budgets:
      - nodes: 10%
      consolidationPolicy: WhenUnderutilized
      expireAfter: Never
    template:
      metadata:
        labels:
          cluster-lifecycle-controller.zalan.do/replacement-strategy: none
          lifecycle-status: ready
          node.kubernetes.io/node-pool: karpenter-default
          node.kubernetes.io/profile: worker-karpenter
          node.kubernetes.io/role: worker
      spec:
        kubelet:
          clusterDNS:
          - 10.0.1.100
          cpuCFSQuota: false
          kubeReserved:
            cpu: 100m
            memory: 282Mi
          maxPods: 32
          systemReserved:
            cpu: 100m
            memory: 164Mi
        nodeClassRef:
          name: karpenter-default
        requirements:
        - key: node.kubernetes.io/instance-type
          operator: In
          values:
          - m5.8xlarge
          - m5.xlarge
        - key: karpenter.sh/capacity-type
          operator: In
          values:
          - spot
          - on-demand
        - key: kubernetes.io/arch
          operator: In
          values:
          - arm64
          - amd64
        - key: topology.kubernetes.io/zone
          operator: In
          values:
          - eu-central-1a
          - eu-central-1b
          - eu-central-1c
        startupTaints:
        - effect: NoSchedule
          key: zalando.org/node-not-ready
    weight: 1
  status:
    resources:
      cpu: "8"
      ephemeral-storage: 202861920Ki
      memory: 32315584Ki
      pods: "220"
kind: List
metadata:
  resourceVersion: ""
  selfLink: ""

Versions:

  • karpenter Version: v0.36.0
  • Kubernetes Version (kubectl version): Server Version: v1.28.8
  • Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
  • Please do not leave "+1" or "me too" comments, they generate extra noise for issue followers and do not help prioritize the request
  • If you are interested in working on this issue or have submitted a pull request, please leave a comment

myaser avatar Apr 25 '24 11:04 myaser

If the ip-10-149-88-228.eu-central-1.compute.internal node is around, can you supply the node object and node claim YAML?

tzneal avatar Apr 29 '24 13:04 tzneal

I found another instance (on a different cluster) where it failed to replace a spot node with another spot node. I captured the node object and nodeclaim YAMLs node object

apiVersion: v1
kind: Node
metadata:
  annotations:
    alpha.kubernetes.io/provided-node-ip: 172.31.5.136
    csi.volume.kubernetes.io/nodeid: '{"ebs.csi.aws.com":"i-059805a98b7e75171"}'
    flannel.alpha.coreos.com/backend-data: '{"VNI":1,"VtepMAC":"d6:b1:3a:ae:4a:bd"}'
    flannel.alpha.coreos.com/backend-type: vxlan
    flannel.alpha.coreos.com/kube-subnet-manager: "true"
    flannel.alpha.coreos.com/public-ip: 172.31.5.136
    karpenter.k8s.aws/ec2nodeclass-hash: "2026609550328776800"
    karpenter.k8s.aws/ec2nodeclass-hash-version: v2
    karpenter.sh/nodepool-hash: "4369624379001278596"
    karpenter.sh/nodepool-hash-version: v2
    kubectl.kubernetes.io/last-applied-configuration: {}
    node.alpha.kubernetes.io/ttl: "0"
    volumes.kubernetes.io/controller-managed-attach-detach: "true"
  creationTimestamp: "2024-05-02T12:39:31Z"
  finalizers:
  - karpenter.sh/termination
  labels:
    aws.amazon.com/spot: "true"
    beta.kubernetes.io/arch: amd64
    beta.kubernetes.io/instance-type: m5.large
    beta.kubernetes.io/os: linux
    cluster-lifecycle-controller.zalan.do/replacement-strategy: none
    failure-domain.beta.kubernetes.io/region: eu-central-1
    failure-domain.beta.kubernetes.io/zone: eu-central-1a
    karpenter.k8s.aws/instance-category: m
    karpenter.k8s.aws/instance-cpu: "2"
    karpenter.k8s.aws/instance-cpu-manufacturer: intel
    karpenter.k8s.aws/instance-encryption-in-transit-supported: "false"
    karpenter.k8s.aws/instance-family: m5
    karpenter.k8s.aws/instance-generation: "5"
    karpenter.k8s.aws/instance-hypervisor: nitro
    karpenter.k8s.aws/instance-memory: "8192"
    karpenter.k8s.aws/instance-network-bandwidth: "750"
    karpenter.k8s.aws/instance-size: large
    karpenter.sh/capacity-type: spot
    karpenter.sh/initialized: "true"
    karpenter.sh/nodepool: default-karpenter
    karpenter.sh/registered: "true"
    kubernetes.io/arch: amd64
    kubernetes.io/hostname: ip-172-31-5-136.eu-central-1.compute.internal
    kubernetes.io/os: linux
    kubernetes.io/role: worker
    lifecycle-status: ready
    node.kubernetes.io/distro: ubuntu
    node.kubernetes.io/instance-type: m5.large
    node.kubernetes.io/node-pool: default-karpenter
    node.kubernetes.io/profile: worker-karpenter
    node.kubernetes.io/role: worker
    topology.ebs.csi.aws.com/zone: eu-central-1a
    topology.kubernetes.io/region: eu-central-1
    topology.kubernetes.io/zone: eu-central-1a
  name: ip-172-31-5-136.eu-central-1.compute.internal
  ownerReferences:
  - apiVersion: karpenter.sh/v1beta1
    blockOwnerDeletion: true
    kind: NodeClaim
    name: default-karpenter-hcj5f
    uid: 1c95cfac-270d-4bbf-b1c6-b8d1af38ef6f
  resourceVersion: "2533516828"
  uid: e7845ae8-042f-4e16-b31e-55ecd40ee6ac
spec:
  podCIDR: 10.2.248.0/24
  podCIDRs:
  - 10.2.248.0/24
  providerID: aws:///eu-central-1a/i-059805a98b7e75171
status: {}

nodeClaim

apiVersion: karpenter.sh/v1beta1
kind: NodeClaim
metadata:
  annotations:
    karpenter.k8s.aws/ec2nodeclass-hash: "2026609550328776800"
    karpenter.k8s.aws/ec2nodeclass-hash-version: v2
    karpenter.k8s.aws/tagged: "true"
    karpenter.sh/nodepool-hash: "4369624379001278596"
    karpenter.sh/nodepool-hash-version: v2
    kubectl.kubernetes.io/last-applied-configuration: {}
  creationTimestamp: "2024-05-02T12:38:47Z"
  finalizers:
  - karpenter.sh/termination
  generateName: default-karpenter-
  generation: 1
  labels:
    cluster-lifecycle-controller.zalan.do/replacement-strategy: none
    karpenter.k8s.aws/instance-category: m
    karpenter.k8s.aws/instance-cpu: "2"
    karpenter.k8s.aws/instance-cpu-manufacturer: intel
    karpenter.k8s.aws/instance-encryption-in-transit-supported: "false"
    karpenter.k8s.aws/instance-family: m5
    karpenter.k8s.aws/instance-generation: "5"
    karpenter.k8s.aws/instance-hypervisor: nitro
    karpenter.k8s.aws/instance-memory: "8192"
    karpenter.k8s.aws/instance-network-bandwidth: "750"
    karpenter.k8s.aws/instance-size: large
    karpenter.sh/capacity-type: spot
    karpenter.sh/nodepool: default-karpenter
    kubernetes.io/arch: amd64
    kubernetes.io/os: linux
    lifecycle-status: ready
    node.kubernetes.io/instance-type: m5.large
    node.kubernetes.io/node-pool: default-karpenter
    node.kubernetes.io/profile: worker-karpenter
    node.kubernetes.io/role: worker
    topology.kubernetes.io/region: eu-central-1
    topology.kubernetes.io/zone: eu-central-1a
  name: default-karpenter-hcj5f
  ownerReferences:
  - apiVersion: karpenter.sh/v1beta1
    blockOwnerDeletion: true
    kind: NodePool
    name: default-karpenter
    uid: 2536b136-fc71-40a9-a233-f51b81120e97
  resourceVersion: "2533447771"
  uid: 1c95cfac-270d-4bbf-b1c6-b8d1af38ef6f
spec:
  kubelet:
    clusterDNS:
    - 10.0.1.100
    cpuCFSQuota: false
    kubeReserved:
      cpu: 100m
      memory: 282Mi
    maxPods: 32
    systemReserved:
      cpu: 100m
      memory: 164Mi
  nodeClassRef:
    name: default-karpenter
  requirements:
  - key: topology.kubernetes.io/region
    operator: In
    values:
    - eu-central-1
  - key: karpenter.k8s.aws/instance-size
    operator: NotIn
    values:
    - metal
  - key: kubernetes.io/arch
    operator: In
    values:
    - amd64
    - arm64
  - key: topology.kubernetes.io/zone
    operator: In
    values:
    - eu-central-1a
  - key: node.kubernetes.io/node-pool
    operator: In
    values:
    - default-karpenter
  - key: node.kubernetes.io/profile
    operator: In
    values:
    - worker-karpenter
  - key: node.kubernetes.io/role
    operator: In
    values:
    - worker
  - key: karpenter.sh/nodepool
    operator: In
    values:
    - default-karpenter
  - key: karpenter.sh/capacity-type
    operator: In
    values:
    - spot
  - key: node.kubernetes.io/instance-type
    operator: In
    values:
    - c5.xlarge
    - c5d.xlarge
    - c6i.xlarge
    - c6id.xlarge
    - c6in.xlarge
    - m5.large
    - m5.xlarge
    - m5d.large
    - m5d.xlarge
    - m5n.large
    - m5n.xlarge
    - m6i.large
    - m6i.xlarge
    - m6id.large
    - m6in.large
    - r5.large
    - r5d.large
    - r5n.large
    - r6i.large
    - r6i.xlarge
    - r6id.large
  - key: karpenter.k8s.aws/instance-family
    operator: In
    values:
    - c5
    - c5d
    - c5n
    - c6i
    - c6id
    - c6in
    - m5
    - m5d
    - m5n
    - m6i
    - m6id
    - m6in
    - r5
    - r5d
    - r5n
    - r6i
    - r6id
    - r6in
  - key: cluster-lifecycle-controller.zalan.do/replacement-strategy
    operator: In
    values:
    - none
  - key: lifecycle-status
    operator: In
    values:
    - ready
  resources:
    requests:
      cpu: 1517m
      ephemeral-storage: 2816Mi
      memory: 5060Mi
      pods: "14"
  startupTaints:
  - effect: NoSchedule
    key: zalando.org/node-not-ready
status:
  allocatable:
    cpu: 1800m
    ephemeral-storage: 89Gi
    memory: 7031Mi
    pods: "32"
    vpc.amazonaws.com/pod-eni: "9"
  capacity:
    cpu: "2"
    ephemeral-storage: 100Gi
    memory: 7577Mi
    pods: "32"
    vpc.amazonaws.com/pod-eni: "9"
  conditions:
  - lastTransitionTime: "2024-05-02T12:40:21Z"
    status: "True"
    type: Initialized
  - lastTransitionTime: "2024-05-02T12:38:49Z"
    status: "True"
    type: Launched
  - lastTransitionTime: "2024-05-02T12:40:21Z"
    status: "True"
    type: Ready
  - lastTransitionTime: "2024-05-02T12:39:31Z"
    status: "True"
    type: Registered
  imageID: ******
  nodeName: ip-172-31-5-136.eu-central-1.compute.internal
  providerID: aws:///eu-central-1a/i-059805a98b7e75171

nodepool

piVersion: karpenter.sh/v1beta1
kind: NodePool
metadata:
  annotations:
    karpenter.sh/nodepool-hash: "4369624379001278596"
    karpenter.sh/nodepool-hash-version: v2
    kubectl.kubernetes.io/last-applied-configuration: |
      {"apiVersion":"karpenter.sh/v1beta1","kind":"NodePool","metadata":{"annotations":{},"name":"default-karpenter"},"spec":{"disruption":{"consolidationPolicy":"WhenUnderutilized","expireAfter":"Never"},"template":{"metadata":{"labels":{"cluster-lifecycle-controller.zalan.do/replacement-strategy":"none","lifecycle-status":"ready","node.kubernetes.io/node-pool":"default-karpenter","node.kubernetes.io/profile":"worker-karpenter","node.kubernetes.io/role":"worker"}},"spec":{"kubelet":{"clusterDNS":["10.0.1.100"],"cpuCFSQuota":false,"kubeReserved":{"cpu":"100m","memory":"282Mi"},"maxPods":32,"systemReserved":{"cpu":"100m","memory":"164Mi"}},"nodeClassRef":{"name":"default-karpenter"},"requirements":[{"key":"karpenter.k8s.aws/instance-family","operator":"In","values":["c5","m5","r5","c5d","m5d","r5d","c5n","m5n","r5n","c6i","m6i","r6i","c6id","m6id","r6id","c6in","m6in","r6in"]},{"key":"karpenter.k8s.aws/instance-size","operator":"NotIn","values":["metal"]},{"key":"node.kubernetes.io/instance-type","operator":"NotIn","values":["c5d.large"]},{"key":"karpenter.sh/capacity-type","operator":"In","values":["spot","on-demand"]},{"key":"kubernetes.io/arch","operator":"In","values":["arm64","amd64"]},{"key":"topology.kubernetes.io/zone","operator":"In","values":["eu-central-1a","eu-central-1b","eu-central-1c"]}],"startupTaints":[{"effect":"NoSchedule","key":"zalando.org/node-not-ready"}]}}}}
  creationTimestamp: "2024-02-08T15:16:14Z"
  generation: 2
  name: default-karpenter
  resourceVersion: "2534926162"
  uid: 2536b136-fc71-40a9-a233-f51b81120e97
spec:
  disruption:
    budgets:
    - nodes: 10%
    consolidationPolicy: WhenUnderutilized
    expireAfter: Never
  template:
    metadata:
      labels:
        cluster-lifecycle-controller.zalan.do/replacement-strategy: none
        lifecycle-status: ready
        node.kubernetes.io/node-pool: default-karpenter
        node.kubernetes.io/profile: worker-karpenter
        node.kubernetes.io/role: worker
    spec:
      kubelet:
        clusterDNS:
        - 10.0.1.100
        cpuCFSQuota: false
        kubeReserved:
          cpu: 100m
          memory: 282Mi
        maxPods: 32
        systemReserved:
          cpu: 100m
          memory: 164Mi
      nodeClassRef:
        name: default-karpenter
      requirements:
      - key: karpenter.k8s.aws/instance-family
        operator: In
        values:
        - c5
        - m5
        - r5
        - c5d
        - m5d
        - r5d
        - c5n
        - m5n
        - r5n
        - c6i
        - m6i
        - r6i
        - c6id
        - m6id
        - r6id
        - c6in
        - m6in
        - r6in
      - key: karpenter.k8s.aws/instance-size
        operator: NotIn
        values:
        - metal
      - key: node.kubernetes.io/instance-type
        operator: NotIn
        values:
        - c5d.large
      - key: karpenter.sh/capacity-type
        operator: In
        values:
        - spot
        - on-demand
      - key: kubernetes.io/arch
        operator: In
        values:
        - arm64
        - amd64
      - key: topology.kubernetes.io/zone
        operator: In
        values:
        - eu-central-1a
        - eu-central-1b
        - eu-central-1c
      startupTaints:
      - effect: NoSchedule
        key: zalando.org/node-not-ready
status:
  resources:
    cpu: "102"
    ephemeral-storage: 2713582276Ki
    memory: 482793344Ki
    pods: "1980"

ec2NodeClass

apiVersion: v1
items:
- apiVersion: karpenter.k8s.aws/v1beta1
  kind: EC2NodeClass
  metadata:
    annotations:
      karpenter.k8s.aws/ec2nodeclass-hash: "2026609550328776800"
      karpenter.k8s.aws/ec2nodeclass-hash-version: v2
      kubectl.kubernetes.io/last-applied-configuration: {}
      creationTimestamp: "2024-02-08T15:16:14Z"
    finalizers:
    - karpenter.k8s.aws/termination
    generation: 7
    name: default-karpenter
    resourceVersion: "2516183961"
    uid: 2d7763a9-397e-4ffb-865f-92dfdaa1179e
  spec:
    amiFamily: Custom
    amiSelectorTerms:
    - id: ami-*****
    - id: ami-*****
    associatePublicIPAddress: true
    blockDeviceMappings:
    - deviceName: /dev/sda1
      ebs:
        deleteOnTermination: true
        volumeSize: 100Gi
        volumeType: gp3
    detailedMonitoring: false
    instanceProfile: .******
    metadataOptions:
      httpEndpoint: enabled
      httpProtocolIPv6: disabled
      httpPutResponseHopLimit: 2
      httpTokens: optional
    securityGroupSelectorTerms:
    - tags:
        karpenter.sh/discovery: WorkerNodeSecurityGroup
    subnetSelectorTerms:
    - tags:
        kubernetes.io/role/karpenter: enabled
    tags:
      InfrastructureComponent: "true"
      Name: default-karpenter
      application: kubernetes
      component: shared-resource
      environment: test
      node.kubernetes.io/node-pool: default-karpenter
      node.kubernetes.io/role: worker
      zalando.de/cluster-local-id/kube-1: owned
      zalando.org/pod-max-pids: "4096"
    userData: {.....}
  status: {}
kind: List
metadata:
  resourceVersion: ""

myaser avatar May 03 '24 08:05 myaser

/assign @engedaam

billrayburn avatar May 08 '24 20:05 billrayburn

@myaser Apologize for the late response on this one. Are you still seeing this issue?

jonathan-innis avatar May 14 '24 04:05 jonathan-innis

@myaser Apologize for the late response on this one. Are you still seeing this issue?

yes, It is still happening on some of our clusters

myaser avatar May 14 '24 08:05 myaser

@myaser In the process of attempting to reproduce this issue, will update once we have more to share

engedaam avatar May 14 '24 18:05 engedaam

I have a better understanding now for this issue, and here is how to reproduce it

we found a pod that uses invalid node affinity, the affinity was preferredDuringSchedulingIgnoredDuringExecution so it was ignored/relaxed by karpenter during initial scheduling. later on when the node nominated for the pod got consolidated, karpenter logged this error message. it eventually managed to replace the node, but it took much longer it seems like it did not relax/ignore the preferred affinity, also the error message was strange/misleading

example pod:

apiVersion: apps/v1
kind: Deployment
metadata:
  labels:
    app: testing-nginx
    owner: mgaballah
  name: testing-nginx
spec:
  replicas: 1
  selector:
    matchLabels:
      app: testing-nginx
  template:
    metadata:
      labels:
        app: testing-nginx
    spec:
      affinity:
        nodeAffinity:
          preferredDuringSchedulingIgnoredDuringExecution:
            - weight: 50
              preference:
                matchExpressions:
                - key: karpenter.sh/provisioner-name
                  operator: DoesNotExist
      containers:
      - image: nginx
        name: nginx
        resources: 
          limits:
            cpu: 200m
            memory: 50Mi
          requests:
            cpu: 200m
            memory: 50Mi

after it gets scheduled, try to consolidate the node by (for example) deleting the node object we fixed the pod, and the issue disappeared for us. with this understanding, I think this issue is lesser than a bug, but still I would be interested to understand few things

  1. why karpenter did not relax the nodeAffinity constraints
  2. the error message was misleading

myaser avatar May 22 '24 08:05 myaser

/triage accepted

engedaam avatar May 22 '24 17:05 engedaam

Just encountered a similar problem. We have an EKS cluster deployed by Terraform with a NodeGroup of 1 node in which Karpenter v0.36 is installed and worked properly.

We recently added a soft nodeAffinity on a few pods to create a preference for the node managed by TF. As karpenter nodes already contains a few labels, we used a DoesNotExist operator on the karpenter.sh/nodepool-hash key and got errors similar to what OP had.

Initial Affinity we used

spec:
  affinity:
    nodeAffinity:
      preferredDuringSchedulingIgnoredDuringExecution:
      - preference:
          matchExpressions:
          - key: karpenter.sh/nodepool-hash
            operator: DoesNotExist
        weight: 50

associated Karpenter's controler logs

{"level":"INFO","time":"2024-06-03T13:20:40.941Z","logger":"controller.disruption","message":"triggering termination for expired node after TTL","commit":"6b868db","ttl":"1h0m0s"}
{"level":"INFO","time":"2024-06-03T13:20:40.941Z","logger":"controller.disruption","message":"disrupting via expiration replace, terminating 1 nodes (2 pods) xxxxxxxxxxxxxxxxxxxx.compute.internal/t4g.small/on-demand and replacing with on-demand node from types t4g.small, t3a.small, t3.small, t4g.medium, t3a.medium and 34 other(s)","commit":"6b868db","command-id":"58d486d2-012b-4f3e-ad32-f46f6e82d449"}
{"level":"ERROR","time":"2024-06-03T13:20:41.170Z","logger":"controller.disruption","message":"disrupting via \"expiration\", disrupting candidates, launching replacement nodeclaim (command-id: 58d486d2-012b-4f3e-ad32-f46f6e82d449), creating node claim, NodeClaim.karpenter.sh \"default-5g9xb\" is invalid: spec.requirements[4].key: Invalid value: \"string\": label domain \"karpenter.sh\" is restricted","commit":"6b868db"}
{"level":"INFO","time":"2024-06-03T13:21:14.244Z","logger":"controller.disruption","message":"triggering termination for expired node after TTL","commit":"6b868db","ttl":"1h0m0s"}
{"level":"INFO","time":"2024-06-03T13:21:14.244Z","logger":"controller.disruption","message":"disrupting via expiration replace, terminating 1 nodes (2 pods) xxxxxxxxxxxxxxxx.compute.internal/t4g.small/on-demand and replacing with on-demand node from types t4g.small, t3a.small, t3.small, t4g.medium, t3a.medium and 34 other(s)","commit":"6b868db","command-id":"7827779a-3b0c-4c65-83b0-8d427de328be"}
{"level":"ERROR","time":"2024-06-03T13:21:14.462Z","logger":"controller.disruption","message":"disrupting via \"expiration\", disrupting candidates, launching replacement nodeclaim (command-id: 7827779a-3b0c-4c65-83b0-8d427de328be), creating node claim, NodeClaim.karpenter.sh \"default-785zb\" is invalid: spec.requirements[1].key: Invalid value: \"string\": label domain \"karpenter.sh\" is restricted","commit":"6b868db"}

N.B. : sensitive information removed

Later on I questionned the key we used realising that karpenter.sh/nodepool-hash is an annotation key and not a label key. So I switched to karpenter.sh/nodepool and it seemed to have solved the problem.

Karpenter's controller last log before applying the patched nodeAffinity

{"level":"ERROR","time":"2024-06-04T10:29:23.741Z","logger":"controller.disruption","message":"disrupting via \"expiration\", disrupting candidates, launching replacement nodeclaim (command-id: cf21d12e-5e6c-417f-92d2-482bf9c78042), creating node claim, NodeClaim.karpenter.sh \"default-xznbg\" is invalid: spec.requirements[2].key: Invalid value: \"string\": label domain \"karpenter.k8s.aws\" is restricted","commit":"6b868db"}

Post-apply logs

{"level":"INFO","time":"2024-06-04T10:46:04.246Z","logger":"controller.disruption","message":"triggering termination for expired node after TTL","commit":"6b868db","ttl":"1h0m0s"}
{"level":"INFO","time":"2024-06-04T10:46:04.248Z","logger":"controller.disruption","message":"disrupting via expiration replace, terminating 1 nodes (2 pods) xxxxxxxxxxxxxxxxxxxxxxxx.compute.internal/t4g.small/on-demand and replacing with on-demand node from types t4g.small, t3a.small, t3.small, t4g.medium, t3a.medium and 34 other(s)","commit":"6b868db","command-id":"7c2ae915-8210-4df1-80a6-3462a95c16c8"}
{"level":"INFO","time":"2024-06-04T10:46:04.482Z","logger":"controller.disruption","message":"created nodeclaim","commit":"6b868db","nodepool":"default","nodeclaim":"default-pc2xq","requests":{"cpu":"1220m","memory":"690Mi","pods":"6"},"instance-types":"c5.large, c5.xlarge, c5a.large, c5a.xlarge, c5d.large and 34 other(s)"}
{"level":"INFO","time":"2024-06-04T10:46:07.114Z","logger":"controller.nodeclaim.lifecycle","message":"launched nodeclaim","commit":"6b868db","nodeclaim":"default-pc2xq","provider-id":"aws:///xxxxxxxxxx/i-0d9453af78ad7983e","instance-type":"t4g.small","zone":"xxxxxxxxxx","capacity-type":"on-demand","allocatable":{"cpu":"1930m","ephemeral-storage":"17Gi","memory":"1359Mi","pods":"32"}}
{"level":"INFO","time":"2024-06-04T10:46:15.644Z","logger":"controller.provisioner","message":"found provisionable pod(s)","commit":"6b868db","pods":"monitoring/prometheus-prometheus-kube-prometheus-prometheus-0, kube-system/coredns-dfd64456d-756fw","duration":"196.715971ms"}
{"level":"INFO","time":"2024-06-04T10:46:25.643Z","logger":"controller.provisioner","message":"found provisionable pod(s)","commit":"6b868db","pods":"monitoring/prometheus-prometheus-kube-prometheus-prometheus-0, kube-system/coredns-dfd64456d-756fw","duration":"194.918258ms"}
{"level":"INFO","time":"2024-06-04T10:46:29.844Z","logger":"controller.nodeclaim.lifecycle","message":"registered nodeclaim","commit":"6b868db","nodeclaim":"default-pc2xq","provider-id":"aws:///xxxxxxxxxx/i-0d9453af78ad7983e","node":"xxxxxxxxxxxxxxxxxxxxxxxx.compute.internal"}
{"level":"INFO","time":"2024-06-04T10:46:35.742Z","logger":"controller.provisioner","message":"found provisionable pod(s)","commit":"6b868db","pods":"monitoring/prometheus-prometheus-kube-prometheus-prometheus-0, kube-system/coredns-dfd64456d-756fw","duration":"293.341699ms"}
{"level":"INFO","time":"2024-06-04T10:46:39.474Z","logger":"controller.nodeclaim.lifecycle","message":"initialized nodeclaim","commit":"6b868db","nodeclaim":"default-pc2xq","provider-id":"aws:///xxxxxxxxxx/i-0d9453af78ad7983e","node":"xxxxxxxxxxxxxxxxxxxxxxxx.compute.internal","allocatable":{"cpu":"1930m","ephemeral-storage":"18233774458","hugepages-1Gi":"0","hugepages-2Mi":"0","hugepages-32Mi":"0","hugepages-64Ki":"0","memory":"1408504Ki","pods":"32"}}
{"level":"INFO","time":"2024-06-04T10:46:41.472Z","logger":"controller.disruption.queue","message":"command succeeded","commit":"6b868db","command-id":"7c2ae915-8210-4df1-80a6-3462a95c16c8"}
{"level":"INFO","time":"2024-06-04T10:46:41.567Z","logger":"controller.node.termination","message":"tainted node","commit":"6b868db","node":"xxxxxxxxxxxxxxxxxxxxxxxx.compute.internal"}
{"level":"INFO","time":"2024-06-04T10:46:43.242Z","logger":"controller.provisioner","message":"found provisionable pod(s)","commit":"6b868db","pods":"monitoring/prometheus-prometheus-kube-prometheus-prometheus-0, kube-system/coredns-dfd64456d-756fw","duration":"590.049476ms"}
{"level":"INFO","time":"2024-06-04T10:46:49.190Z","logger":"controller.node.termination","message":"deleted node","commit":"6b868db","node":"xxxxxxxxxxxxxxxxxxxxxxxx.compute.internal"}
{"level":"INFO","time":"2024-06-04T10:46:49.685Z","logger":"controller.nodeclaim.termination","message":"deleted nodeclaim","commit":"6b868db","nodeclaim":"default-7f7lf","node":"xxxxxxxxxxxxxxxxxxxxxxxx.compute.internal","provider-id":"aws:///xxxxxxxxxx/i-0a09fedd4e0233a92"}

N.B. : sensitive information removed

I would agree too that the error log can be missleading Hope it could help someone in need for help :)

dimitri-fert avatar Jun 04 '24 11:06 dimitri-fert