karpenter icon indicating copy to clipboard operation
karpenter copied to clipboard

Daemonset-driven consolidation

Open wdonne opened this issue 2 years ago • 37 comments

Version

Karpenter Version: v0.22.1

Kubernetes Version: v1.24.8

Hi,

I have set up Karpenter with the following cluster configuration:

apiVersion: eksctl.io/v1alpha5
kind: ClusterConfig
metadata:
  name: klss
  region: eu-central-1
  version: "1.24"
  tags:
    karpenter.sh/discovery: klss
managedNodeGroups:
  - instanceType: t3.small
    amiFamily: AmazonLinux2
    name: karpenter
    desiredCapacity: 2
    minSize: 2
    maxSize: 2
iam:
  withOIDC: true

This is the provisioner:

---
apiVersion: karpenter.sh/v1alpha5
kind: Provisioner
metadata:
  name: default
spec:
  requirements:
    - key: karpenter.sh/capacity-type
      operator: In
      values:
        - "on-demand"
        - "spot"
    - key: "kubernetes.io/arch"
      operator: In
      values:
        - "arm64"
        - "amd64"
    - key: "topology.kubernetes.io/zone"
      operator: In
      values:
        - "eu-central-1a"
        - "eu-central-1b"
        - "eu-central-1c"
  limits:
[nodes.zip](https://github.com/aws/karpenter/files/10483452/nodes.zip)

    resources:
      cpu: 32
      memory: 64Gi
  providerRef:
    name: default
  consolidation:
    enabled: true
---
apiVersion: karpenter.k8s.aws/v1alpha1
kind: AWSNodeTemplate
metadata:
  name: default
spec:
  subnetSelector:
    karpenter.sh/discovery: klss
  securityGroupSelector:
    karpenter.sh/discovery: klss

Karpenter has currently provisioned three spot instances. When installing Prometheus with Helm chart version 19.3.1, two of the five node exporters can't be scheduled. The message is: "0/5 nodes are available: 1 Too many pods. preemption: 0/5 nodes are available: 5 No preemption victims found for incoming pod.". The Karpenter controllers didn't output any log entries.

This is the values file for the chart:

prometheus:
  serviceAccounts:
    server:
      create: false
      name: "amp-iamproxy-ingest-service-account"
  server:
    remoteWrite:
      - url: https://aps-workspaces.eu-central-1.amazonaws.com/workspaces/xxxxxxxxxxxxxxxxxxxxx/api/v1/query
        sigv4:
          region: eu-central-1
        queue_config:
          max_samples_per_send: 1000
          max_shards: 200
          capacity: 2500
    persistentVolume:
      enabled: false

This is the live manifest of the DaemonSet of the Prometheus node exporter:

apiVersion: apps/v1
kind: DaemonSet
metadata:
  annotations:
    deprecated.daemonset.template.generation: '1'
    kubectl.kubernetes.io/last-applied-configuration: >
      {"apiVersion":"apps/v1","kind":"DaemonSet","metadata":{"annotations":{},"labels":{"app.kubernetes.io/component":"metrics","app.kubernetes.io/instance":"prometheus","app.kubernetes.io/managed-by":"Helm","app.kubernetes.io/name":"prometheus-node-exporter","app.kubernetes.io/part-of":"prometheus-node-exporter","app.kubernetes.io/version":"1.5.0","helm.sh/chart":"prometheus-node-exporter-4.8.1"},"name":"prometheus-prometheus-node-exporter","namespace":"prometheus"},"spec":{"selector":{"matchLabels":{"app.kubernetes.io/instance":"prometheus","app.kubernetes.io/name":"prometheus-node-exporter"}},"template":{"metadata":{"annotations":{"cluster-autoscaler.kubernetes.io/safe-to-evict":"true"},"labels":{"app.kubernetes.io/component":"metrics","app.kubernetes.io/instance":"prometheus","app.kubernetes.io/managed-by":"Helm","app.kubernetes.io/name":"prometheus-node-exporter","app.kubernetes.io/part-of":"prometheus-node-exporter","app.kubernetes.io/version":"1.5.0","helm.sh/chart":"prometheus-node-exporter-4.8.1"}},"spec":{"automountServiceAccountToken":false,"containers":[{"args":["--path.procfs=/host/proc","--path.sysfs=/host/sys","--path.rootfs=/host/root","--web.listen-address=[$(HOST_IP)]:9100"],"env":[{"name":"HOST_IP","value":"0.0.0.0"}],"image":"quay.io/prometheus/node-exporter:v1.5.0","imagePullPolicy":"IfNotPresent","livenessProbe":{"failureThreshold":3,"httpGet":{"httpHeaders":null,"path":"/","port":9100,"scheme":"HTTP"},"initialDelaySeconds":0,"periodSeconds":10,"successThreshold":1,"timeoutSeconds":1},"name":"node-exporter","ports":[{"containerPort":9100,"name":"metrics","protocol":"TCP"}],"readinessProbe":{"failureThreshold":3,"httpGet":{"httpHeaders":null,"path":"/","port":9100,"scheme":"HTTP"},"initialDelaySeconds":0,"periodSeconds":10,"successThreshold":1,"timeoutSeconds":1},"securityContext":{"allowPrivilegeEscalation":false},"volumeMounts":[{"mountPath":"/host/proc","name":"proc","readOnly":true},{"mountPath":"/host/sys","name":"sys","readOnly":true},{"mountPath":"/host/root","mountPropagation":"HostToContainer","name":"root","readOnly":true}]}],"hostNetwork":true,"hostPID":true,"securityContext":{"fsGroup":65534,"runAsGroup":65534,"runAsNonRoot":true,"runAsUser":65534},"serviceAccountName":"prometheus-prometheus-node-exporter","tolerations":[{"effect":"NoSchedule","operator":"Exists"}],"volumes":[{"hostPath":{"path":"/proc"},"name":"proc"},{"hostPath":{"path":"/sys"},"name":"sys"},{"hostPath":{"path":"/"},"name":"root"}]}},"updateStrategy":{"rollingUpdate":{"maxUnavailable":1},"type":"RollingUpdate"}}}
  creationTimestamp: '2023-01-23T19:32:19Z'
  generation: 1
  labels:
    app.kubernetes.io/component: metrics
    app.kubernetes.io/instance: prometheus
    app.kubernetes.io/managed-by: Helm
    app.kubernetes.io/name: prometheus-node-exporter
    app.kubernetes.io/part-of: prometheus-node-exporter
    app.kubernetes.io/version: 1.5.0
    helm.sh/chart: prometheus-node-exporter-4.8.1
  name: prometheus-prometheus-node-exporter
  namespace: prometheus
  resourceVersion: '1156021'
  uid: 3659924e-2902-4651-aa2a-1d20a1dc1ce7
spec:
  revisionHistoryLimit: 10
  selector:
    matchLabels:
      app.kubernetes.io/instance: prometheus
      app.kubernetes.io/name: prometheus-node-exporter
  template:
    metadata:
      annotations:
        cluster-autoscaler.kubernetes.io/safe-to-evict: 'true'
      creationTimestamp: null
      labels:
        app.kubernetes.io/component: metrics
        app.kubernetes.io/instance: prometheus
        app.kubernetes.io/managed-by: Helm
        app.kubernetes.io/name: prometheus-node-exporter
        app.kubernetes.io/part-of: prometheus-node-exporter
        app.kubernetes.io/version: 1.5.0
        helm.sh/chart: prometheus-node-exporter-4.8.1
    spec:
      automountServiceAccountToken: false
      containers:
        - args:
            - '--path.procfs=/host/proc'
            - '--path.sysfs=/host/sys'
            - '--path.rootfs=/host/root'
            - '--web.listen-address=[$(HOST_IP)]:9100'
          env:
            - name: HOST_IP
              value: 0.0.0.0
          image: 'quay.io/prometheus/node-exporter:v1.5.0'
          imagePullPolicy: IfNotPresent
          livenessProbe:
            failureThreshold: 3
            httpGet:
              path: /
              port: 9100
              scheme: HTTP
            periodSeconds: 10
            successThreshold: 1
            timeoutSeconds: 1
          name: node-exporter
          ports:
            - containerPort: 9100
              hostPort: 9100
              name: metrics
              protocol: TCP
          readinessProbe:
            failureThreshold: 3
            httpGet:
              path: /
              port: 9100
              scheme: HTTP
            periodSeconds: 10
            successThreshold: 1
            timeoutSeconds: 1
          resources: {}
          securityContext:
            allowPrivilegeEscalation: false
          terminationMessagePath: /dev/termination-log
          terminationMessagePolicy: File
          volumeMounts:
            - mountPath: /host/proc
              name: proc
              readOnly: true
            - mountPath: /host/sys
              name: sys
              readOnly: true
            - mountPath: /host/root
              mountPropagation: HostToContainer
              name: root
              readOnly: true
      dnsPolicy: ClusterFirst
      hostNetwork: true
      hostPID: true
      restartPolicy: Always
      schedulerName: default-scheduler
      securityContext:
        fsGroup: 65534
        runAsGroup: 65534
        runAsNonRoot: true
        runAsUser: 65534
      serviceAccount: prometheus-prometheus-node-exporter
      serviceAccountName: prometheus-prometheus-node-exporter
      terminationGracePeriodSeconds: 30
      tolerations:
        - effect: NoSchedule
          operator: Exists
      volumes:
        - hostPath:
            path: /proc
            type: ''
          name: proc
        - hostPath:
            path: /sys
            type: ''
          name: sys
        - hostPath:
            path: /
            type: ''
          name: root
  updateStrategy:
    rollingUpdate:
      maxSurge: 0
      maxUnavailable: 1
    type: RollingUpdate
status:
  currentNumberScheduled: 5
  desiredNumberScheduled: 5
  numberAvailable: 3
  numberMisscheduled: 0
  numberReady: 3
  numberUnavailable: 2
  observedGeneration: 1
  updatedNumberScheduled: 5

This is the live manifest of one of the pods that can't be scheduled:

apiVersion: v1
kind: Pod
metadata:
  annotations:
    cluster-autoscaler.kubernetes.io/safe-to-evict: 'true'
    kubernetes.io/psp: eks.privileged
  creationTimestamp: '2023-01-23T19:32:19Z'
  generateName: prometheus-prometheus-node-exporter-
  labels:
    app.kubernetes.io/component: metrics
    app.kubernetes.io/instance: prometheus
    app.kubernetes.io/managed-by: Helm
    app.kubernetes.io/name: prometheus-node-exporter
    app.kubernetes.io/part-of: prometheus-node-exporter
    app.kubernetes.io/version: 1.5.0
    controller-revision-hash: 7b4cd87594
    helm.sh/chart: prometheus-node-exporter-4.8.1
    pod-template-generation: '1'
  name: prometheus-prometheus-node-exporter-9c5s5
  namespace: prometheus
  ownerReferences:
    - apiVersion: apps/v1
      blockOwnerDeletion: true
      controller: true
      kind: DaemonSet
      name: prometheus-prometheus-node-exporter
      uid: 3659924e-2902-4651-aa2a-1d20a1dc1ce7
  resourceVersion: '1155915'
  uid: 98b0cea4-68fe-47ea-83f1-231d5b5809ca
spec:
  affinity:
    nodeAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        nodeSelectorTerms:
          - matchFields:
              - key: metadata.name
                operator: In
                values:
                  - ip-192-168-14-194.eu-central-1.compute.internal
  automountServiceAccountToken: false
  containers:
    - args:
        - '--path.procfs=/host/proc'
        - '--path.sysfs=/host/sys'
        - '--path.rootfs=/host/root'
        - '--web.listen-address=[$(HOST_IP)]:9100'
      env:
        - name: HOST_IP
          value: 0.0.0.0
      image: 'quay.io/prometheus/node-exporter:v1.5.0'
      imagePullPolicy: IfNotPresent
      livenessProbe:
        failureThreshold: 3
        httpGet:
          path: /
          port: 9100
          scheme: HTTP
        periodSeconds: 10
        successThreshold: 1
        timeoutSeconds: 1
      name: node-exporter
      ports:
        - containerPort: 9100
          hostPort: 9100
          name: metrics
          protocol: TCP
      readinessProbe:
        failureThreshold: 3
        httpGet:
          path: /
          port: 9100
          scheme: HTTP
        periodSeconds: 10
        successThreshold: 1
        timeoutSeconds: 1
      resources: {}
      securityContext:
        allowPrivilegeEscalation: false
      terminationMessagePath: /dev/termination-log
      terminationMessagePolicy: File
      volumeMounts:
        - mountPath: /host/proc
          name: proc
          readOnly: true
        - mountPath: /host/sys
          name: sys
          readOnly: true
        - mountPath: /host/root
          mountPropagation: HostToContainer
          name: root
          readOnly: true
  dnsPolicy: ClusterFirst
  enableServiceLinks: true
  hostNetwork: true
  hostPID: true
  preemptionPolicy: PreemptLowerPriority
  priority: 0
  restartPolicy: Always
  schedulerName: default-scheduler
  securityContext:
    fsGroup: 65534
    runAsGroup: 65534
    runAsNonRoot: true
    runAsUser: 65534
  serviceAccount: prometheus-prometheus-node-exporter
  serviceAccountName: prometheus-prometheus-node-exporter
  terminationGracePeriodSeconds: 30
  tolerations:
    - effect: NoSchedule
      operator: Exists
    - effect: NoExecute
      key: node.kubernetes.io/not-ready
      operator: Exists
    - effect: NoExecute
      key: node.kubernetes.io/unreachable
      operator: Exists
    - effect: NoSchedule
      key: node.kubernetes.io/disk-pressure
      operator: Exists
    - effect: NoSchedule
      key: node.kubernetes.io/memory-pressure
      operator: Exists
    - effect: NoSchedule
      key: node.kubernetes.io/pid-pressure
      operator: Exists
    - effect: NoSchedule
      key: node.kubernetes.io/unschedulable
      operator: Exists
    - effect: NoSchedule
      key: node.kubernetes.io/network-unavailable
      operator: Exists
  volumes:
    - hostPath:
        path: /proc
        type: ''
      name: proc
    - hostPath:
        path: /sys
        type: ''
      name: sys
    - hostPath:
        path: /
        type: ''
      name: root
status:
  conditions:
    - lastProbeTime: null
      lastTransitionTime: '2023-01-23T19:32:19Z'
      message: >-
        0/5 nodes are available: 1 Too many pods. preemption: 0/5 nodes are
        available: 5 No preemption victims found for incoming pod.
      reason: Unschedulable
      status: 'False'
      type: PodScheduled
  phase: Pending
  qosClass: BestEffort

I also did a test with the node selector "karpenter.sh/capacity-type: on-demand". Then one of the spot instances is deleted, but no new instance is created. The DaemonSet also doesn't create any pods.

PR aws/karpenter#1155 should have fixed the issue of DaemonSets not being part of the scaling decision, but perhaps this is a special case? The node exporter wants a pod on each node because it wants to tap telemetry.

Best regards,

Werner.

Expected Behavior

An extra node to be provisioned.

Actual Behavior

No extra ode is provisioned while two DaemonSet pods can't be scheduled.

Steps to Reproduce the Problem

I did this when there were already three Karpenter nodes, but I think you can just install Prometheus because the nodes are not full.

Resource Specs and Logs

karpenter-6d57cdbbd6-dqgcj.log karpenter-6d57cdbbd6-lsv9f.log

Community Note

  • Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
  • Please do not leave "+1" or "me too" comments, they generate extra noise for issue followers and do not help prioritize the request
  • If you are interested in working on this issue or have submitted a pull request, please leave a comment

wdonne avatar Jan 23 '23 20:01 wdonne

nodes.zip

wdonne avatar Jan 23 '23 20:01 wdonne

This is the current expected behavior. Karpenter provisions enough capacity for the pods and daemonsets that exist at the time of scheduling. When you add a new daemonset, for this to work properly, Karpenter would need to replace any existing nodes that the daemonset won't fit on.

tzneal avatar Jan 23 '23 20:01 tzneal

We typically recommend that you set a high priority for daemonsets to cover this use case. When scaling up a daemonset, it will trigger eviction for existing pods, which will feed back into karpenter's provisioning algorithm.

https://kubernetes.io/docs/concepts/scheduling-eviction/pod-priority-preemption/#how-to-use-priority-and-preemption

ellistarn avatar Jan 23 '23 21:01 ellistarn

I'll update the FAQ to cover this.

tzneal avatar Jan 23 '23 21:01 tzneal

I forgot to mention that I tried setting the priorityClassField of the DaemonSet to system-node-critical and also once to system-cluster-critical. In both cases all pods were scheduled, but both Karpenter controllers were evicted. I will try to avoid this by changing the pod disruption budget in the values file of Karpenter’s Helm chart.

wdonne avatar Jan 23 '23 22:01 wdonne

I could avoid the eviction of the Karpenter controllers, which have priority system-cluster-critical, by giving the Prometheus DaemonSet a PriorityClass with a priority half of system-cluster-critical. So, it works. Thanks!

wdonne avatar Jan 24 '23 10:01 wdonne

This trick doesn't always work. I removed Prometheus and installed the AWS CloudWatch agent as a DaemonSet. They also have priority of 1000000000. One of the four pods can't be scheduled, but no node is added.

wdonne avatar Jan 24 '23 15:01 wdonne

Here are some files that reflect the new situation. files.zip

According to the generated nodeAffinity the node onto which the pod is supposed to be scheduled doesn't have enough memory. Shouldn't this evict other pods? There are several with priority 0. Without eviction Karpenter has no reason to provision a new node.

wdonne avatar Jan 25 '23 09:01 wdonne

This feature is very necessary, Karpenter should auto adjust upon introducing a new daemonset. We should NOT have to set priority classes on every single resource in the cluster. The correct solution here would be that if a node cant fit a newly installed daemonset pod due to CPU/RAM a new bigger node should be auto ordered that is bigger and could all pods that were housed on old and nodes as well as the daemonset.

ospiegel91 avatar Jan 30 '23 07:01 ospiegel91

@tzneal Added https://github.com/kubernetes/website/pull/40851

billrayburn avatar May 31 '23 18:05 billrayburn

I have been using the following Kyverno policy to make sure DaemonSets have the right priority class:

apiVersion: kyverno.io/v1
kind: ClusterPolicy
metadata:
  name: add-priority-class
  annotations:
    policies.kyverno.io/title: Add priority class for DaemonSets to help Karpenter.
    policies.kyverno.io/subject: Pod
    policies.kyverno.io/minversion: 1.6.0
    policies.kyverno.io/description: Add priority class for DaemonSets to help Karpenter.
spec:
  rules:
    - name: add-priority-class-context
      match:
        any:
          - resources:
              kinds:
                - DaemonSet
      mutate:
        patchStrategicMerge:
          spec:
            template:
              spec:
                priorityClassName: system-node-critical

wdonne avatar Jun 01 '23 07:06 wdonne

Nice @wdonne , Kyverno might be interested in taking that upstream. I think it would be useful for any autoscaler.

See https://kyverno.io/policies/?policytypes=Karpenter and https://github.com/kyverno/policies/tree/main/karpenter

tzneal avatar Jun 01 '23 12:06 tzneal

Hi @tzneal , thanks for the tip. I have created a pull request: https://github.com/kyverno/policies/pull/631.

If it gets merged I will create another one called "set-karpenter-non-cpu-limits". It relates to a best practice when using consolidation mode.

I have a third one that sets the annotation kubernetes.io/arch: arm64 if there is no such annotation. This way arm64 becomes the default. You would then only have to set the annotation for images that are not multi-architecture. This is not really a best practice. Would it fit in the same folder in the Kyverno policies project?

wdonne avatar Jun 02 '23 14:06 wdonne

I have been using the following Kyverno policy to make sure DaemonSets have the right priority class:

apiVersion: kyverno.io/v1
kind: ClusterPolicy
metadata:
  name: add-priority-class
  annotations:
    policies.kyverno.io/title: Add priority class for DaemonSets to help Karpenter.
    policies.kyverno.io/subject: Pod
    policies.kyverno.io/minversion: 1.6.0
    policies.kyverno.io/description: Add priority class for DaemonSets to help Karpenter.
spec:
  rules:
    - name: add-priority-class-context
      match:
        any:
          - resources:
              kinds:
                - DaemonSet
      mutate:
        patchStrategicMerge:
          spec:
            template:
              spec:
                priorityClassName: system-node-critical

this is a NOT a solution. It will not solve 99% of cases. It simply prioritizes daemonsets over non system critical items. But what if a node ordered by Karpenter cant even fit all system critical pods.... then issue not solved at all. So again , the priority class is "cute". But it is NOT a solution.

Karpenter team, I at least by all means consider your product incomplete as is. please pay attention to this comment: https://github.com/aws/karpenter-core/issues/731 This is not a feature request. This is a bug.

Community please upvote so AWS understands that it charges money for an incomplete product. Let's not give them the idea that this ticket is somehow optional.

ospiegel91 avatar Jun 19 '23 11:06 ospiegel91

I have been using the following Kyverno policy to make sure DaemonSets have the right priority class:

apiVersion: kyverno.io/v1
kind: ClusterPolicy
metadata:
  name: add-priority-class
  annotations:
    policies.kyverno.io/title: Add priority class for DaemonSets to help Karpenter.
    policies.kyverno.io/subject: Pod
    policies.kyverno.io/minversion: 1.6.0
    policies.kyverno.io/description: Add priority class for DaemonSets to help Karpenter.
spec:
  rules:
    - name: add-priority-class-context
      match:
        any:
          - resources:
              kinds:
                - DaemonSet
      mutate:
        patchStrategicMerge:
          spec:
            template:
              spec:
                priorityClassName: system-node-critical

this is a NOT a solution. It will not solve 99% of cases. It simply prioritizes daemonsets over non system critical items. But what if a node ordered by Karpenter cant even fit all system critical pods.... then issue not solved at all. So again , the priority class is "cute". But it is NOT a solution.

Karpenter team, I at least by all means consider your product incomplete as is. please pay attention to this comment: #731 This is not a feature request. This is a bug.

Community please upvote so AWS understands that it charges money for an incomplete product. Let's not give them the idea that this ticket is somehow optional.

couldn't've said it better myself. we're facing the same issue too.

can't believe this bug is open for nearly a year :(

galarbel avatar Nov 15 '23 23:11 galarbel

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle stale
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot avatar Feb 14 '24 00:02 k8s-triage-robot

/remove-lifecycle stale

Bryce-Soghigian avatar Feb 14 '24 00:02 Bryce-Soghigian

can't believe this bug is open for nearly a year :(

As this is an open source project: code contributions are welcome. If nobody writes the code, it doesn't get merged.

sftim avatar Mar 02 '24 14:03 sftim

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle stale
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot avatar May 31 '24 15:05 k8s-triage-robot

/remove-lifecycle stale

tinhhn avatar Jun 08 '24 01:06 tinhhn

/remove-lifecycle stale

kpiroto avatar Jun 28 '24 13:06 kpiroto

any update?

Idan-Lazar avatar Aug 16 '24 11:08 Idan-Lazar

Requesting AWS to release formal update in which Karpenter trully lives into "Just-in-time Nodes for Any Kubernetes Cluster" even for this case.

Depending on a plethora of workarounds that doesn't cover all cases tends to question the operational, production readiness of what's now considered a stable product, put in the market as proclaimed new default.

xer0devit avatar Oct 08 '24 09:10 xer0devit

We are now many versions on and this is still an issue. Any updates?

evaleah avatar Oct 17 '24 12:10 evaleah

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle stale
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot avatar Jan 15 '25 12:01 k8s-triage-robot

/remove-lifecycle stale

pdf avatar Jan 15 '25 13:01 pdf

One way this could be solved (or at least mitigated) is by adding a nodepool field for the minimum number of pods per node, based on the instance class.

For example, c7a.medium nodes can only have 8 pods per node, which after the standard aws-node, ebs-csi-node, and kube-proxy daemonsets only leaves 5 pods available. Add some monitoring daemonsets, such as promtail, loki-canary, and the prometheus node exporter, and we're down to 2 pods which is a pretty useless node.

If we could say "only make node types that can hold at least 12 pods" then we could plan in advance for daemonsets that we know will get added later

Edit: This apparently used to exist, but was removed. https://github.com/aws/karpenter-provider-aws/issues/5055

tculp avatar Apr 07 '25 19:04 tculp

Edit: This apparently used to exist, but was removed. aws/karpenter-provider-aws#5055

If you read https://github.com/aws/karpenter-provider-aws/issues/5055#issuecomment-1806120226 you'll see why that metric is not particularly useful.

pdf avatar Apr 07 '25 22:04 pdf

If you read aws/karpenter-provider-aws#5055 (comment) you'll see why that metric is not particularly useful.

I disagree with the argument against that metric. I would rather say "I want at least 12 pods to be able to run on a node" than "I guess any AWS instance type that has 2 cores will probably be able to have more than 8 pods right?"

My workaround was to use instance-size NotIn ["medium"], but that feels like a hacky workaround rather than the proper way

tculp avatar Apr 08 '25 14:04 tculp

"I want at least 12 pods to be able to run on a node"

Number of pods doesn't make sense as a metric because not all pods are created equal - some pods require more resources than others.

pdf avatar Apr 08 '25 21:04 pdf