karpenter-provider-aws icon indicating copy to clipboard operation
karpenter-provider-aws copied to clipboard

Karpenter "Underutilised" disruption causing excessive node churn

Open headj-origami opened this issue 1 year ago • 30 comments

Description

Context We run Karpenter in production for an O(15) node EKS cluster, using four (mutually exclusive) NodePools for different classes of application.

Our primary NodePool is workload, which provisions capacity for the majority of our pods.

Observed Behavior: Approximately daily, we experience a period of high (karpenter) workload node volatility caused by consolidation disruptions (reason: Underutilised).

This usually means that a large proportion of workload nodes get disrupted and replaced in a short period of time. We usually see the newly-created nodes run for about 5-10 minutes, before they too are disrupted as Underutilised.

This disruption period usually occurs for 2-3 generations of replacement nodes, before stopping abruptly. The resulting nodes then typically run without disruption for many hours.

Notably, these events typically occur outside of office hours where changes to the running pods are very unlikely (e.g. rolling upgrades) and traffic is usually very low.

The end result node topology is usually comparable to the starting topology, if not more complex, which doesn't seem to suggest there was any significant resource underutilisation. However in endemic cases, the pods hosted on these nodes may have been restarted up to four times in rapid succession, which is not desirable.

For example: On 27th September between at 22:45 (local time) we had 14 running workload nodeclaims:

  • 7x m6a.large (or equivalent)
  • 4x m7i-flex.xlarge (or equivalent)
  • 3x m7i-flex-2xlarge

Between 22:45 and 23:15 (local time), 7 of these nodeclaims were disrupted and replaced with successive generations of m7i-flex.large nodeclaims (or equivalent) - a total of 15 "Underutilised" disruptions.

At the end of this process we were running 17 workload nodeclaims:

  • 11x m7i-flex.large (or equivalent)
  • 3x m6a.xlarge (or equivalent)
  • 3x m7i-flex-2xlarge

So the net effect was replacing one xlarge node with 4 large nodes, and shuffling the instance generations slightly.

Pictorially: image (each green bar represents a nodeclaim, with time along the x axis)

Expected Behavior:

  • Consolidation disruption due to underutilisation occurs as a single operation, such that pods hosted on these nodes only experience one restart.
  • Nodeclaims created due to "Underutilised" consolidation should not be provisioned in an Underutilised state, necessitating further disruption.

Reproduction Steps (Please include YAML): Our workload NodePool config is:

apiVersion: karpenter.sh/v1
kind: NodePool
metadata:
  annotations:
    meta.helm.sh/release-namespace: karpenter
  labels:
    app.kubernetes.io/managed-by: Helm
  name: workload
spec:
  disruption:
    budgets:
      - nodes: 50%
    consolidateAfter: 5m
    consolidationPolicy: WhenEmptyOrUnderutilized
  limits:
    cpu: '256'
    memory: 1Ti
  template:
    metadata:
      labels:
        app: karpenter
        environment: prod
        name: karpenter
    spec:
      expireAfter: 336h
      nodeClassRef:
        group: karpenter.k8s.aws
        kind: EC2NodeClass
        name: default
      requirements:
        - key: karpenter.k8s.aws/instance-category
          operator: In
          values:
            - r
            - m
        - key: karpenter.k8s.aws/instance-generation
          operator: Gt
          values:
            - '3'
        - key: karpenter.k8s.aws/instance-cpu
          operator: Lt
          values:
            - '17'
        - key: karpenter.k8s.aws/instance-cpu
          operator: Gt
          values:
            - '0'
        - key: karpenter.k8s.aws/instance-memory
          operator: Lt
          values:
            - '131073'
        - key: karpenter.k8s.aws/instance-memory
          operator: Gt
          values:
            - '2047'
        - key: topology.kubernetes.io/zone
          operator: In
          values:
            - eu-west-2a
            - eu-west-2b
            - eu-west-2c
        - key: karpenter.sh/capacity-type
          operator: In
          values:
            - spot
        - key: kubernetes.io/os
          operator: In
          values:
            - linux
        - key: kubernetes.io/arch
          operator: In
          values:
            - amd64
      taints:
        - effect: NoSchedule
          key: karpenter.sh

The corresponding EC2NodeClass is:

apiVersion: karpenter.k8s.aws/v1
kind: EC2NodeClass
metadata:
  annotations:
    meta.helm.sh/release-namespace: karpenter
  labels:
    app: karpenter
    app.kubernetes.io/managed-by: Helm
    name: karpenter
  name: default
spec:
  amiFamily: AL2
  amiSelectorTerms:
    - name: amazon-eks-node-1.29-*
      owner: amazon
  blockDeviceMappings:
    - deviceName: /dev/xvda
      ebs:
        deleteOnTermination: true
        encrypted: true
        kmsKeyID: >-
          <masked>
        volumeSize: 100Gi
        volumeType: gp3
  instanceProfile: KarpenterNodeInstanceProfile-prod-eu-west-2-eks
  metadataOptions:
    httpEndpoint: enabled
    httpProtocolIPv6: disabled
    httpPutResponseHopLimit: 1
    httpTokens: required
  securityGroupSelectorTerms:
    - tags:
        karpenter.sh/discovery: prod-eu-west-2-eks
  subnetSelectorTerms:
    - tags:
        karpenter.sh/discovery: prod-eu-west-2-eks

We also have the spot-to-spot consolidation feature flag enabled.

Versions:

  • Chart Version: 1.0.2
  • Kubernetes Version (kubectl version): 1.29

However, we have observed this behaviour as far back as chart v0.36.2 using the v1beta1 CRDs. We've also seen this on v1.28 and earlier version of Kubernetes.

Additional Questions:

  • Can you explain what threshold the Karpenter controller uses to determine Underutilisation?
  • We see 50% of nodes being affected during the consolidation disruption window, (matching our disruption budget), but why do we not see similar disruption before and after this ~30 minute window?

headj-origami avatar Oct 02 '24 15:10 headj-origami

For me, It's flapping nodes like crazy. How does it determine that a node is underutilized? Can we lower the threshold?

{"level":"INFO","time":"2024-10-14T19:21:24.227Z","logger":"controller","caller":"disruption/controller.go:176","message":"disrupting nodeclaim(s) via delete, terminating 1 nodes (21 pods) ip-10-60-213-231.ec2.internal/t3.2xlarge/on-demand","commit":"5bdf9c3","controller":"disruption","namespace":"","name":"","reconcileID":"c6a54af5-967b-4783-b3fc-f5844afe5342","command-id":"c41ae5f3-5b0b-4d9f-9fd4-ad7abcb05cff","reason":"underutilized"}
{"level":"INFO","time":"2024-10-14T19:23:11.702Z","logger":"controller","caller":"disruption/controller.go:176","message":"disrupting nodeclaim(s) via delete, terminating 1 nodes (19 pods) ip-10-60-149-172.ec2.internal/t3.2xlarge/on-demand","commit":"5bdf9c3","controller":"disruption","namespace":"","name":"","reconcileID":"166e47a5-24fe-468c-b979-f8a8a8a31e90","command-id":"7df82325-f24a-4a6e-8759-83fc6267c234","reason":"underutilized"}
{"level":"INFO","time":"2024-10-14T19:25:50.448Z","logger":"controller","caller":"disruption/controller.go:176","message":"disrupting nodeclaim(s) via delete, terminating 1 nodes (24 pods) ip-10-60-150-105.ec2.internal/t3.2xlarge/on-demand","commit":"5bdf9c3","controller":"disruption","namespace":"","name":"","reconcileID":"5cab5197-7eaa-47b9-ba5b-c31918b99733","command-id":"d03d50e8-22fe-406b-b299-d6363573051e","reason":"underutilized"}
{"level":"INFO","time":"2024-10-14T19:27:27.762Z","logger":"controller","caller":"disruption/controller.go:176","message":"disrupting nodeclaim(s) via delete, terminating 1 nodes (20 pods) ip-10-60-29-142.ec2.internal/t3.2xlarge/on-demand","commit":"5bdf9c3","controller":"disruption","namespace":"","name":"","reconcileID":"2d125665-5702-4351-9fdc-199c30e67439","command-id":"806ed776-e8be-4eeb-95d2-63e12b68df28","reason":"underutilized"}
{"level":"INFO","time":"2024-10-14T19:29:59.176Z","logger":"controller","caller":"disruption/controller.go:176","message":"disrupting nodeclaim(s) via delete, terminating 1 nodes (28 pods) ip-10-60-26-48.ec2.internal/t3.2xlarge/on-demand","commit":"5bdf9c3","controller":"disruption","namespace":"","name":"","reconcileID":"32612bb0-5066-4dc3-adae-5779c6557fe8","command-id":"eaeac7f6-ab26-41f4-99ba-daa5d0b8eb43","reason":"underutilized"}
{"level":"INFO","time":"2024-10-14T19:32:24.048Z","logger":"controller","caller":"disruption/controller.go:176","message":"disrupting nodeclaim(s) via delete, terminating 1 nodes (28 pods) ip-10-60-145-213.ec2.internal/t3.2xlarge/on-demand","commit":"5bdf9c3","controller":"disruption","namespace":"","name":"","reconcileID":"9658aa9f-c84b-4ceb-973e-f4ff304d2bb3","command-id":"fc06784c-b80e-4668-ae79-7696f9cad551","reason":"underutilized"}
{"level":"INFO","time":"2024-10-14T19:37:33.396Z","logger":"controller","caller":"disruption/controller.go:176","message":"disrupting nodeclaim(s) via delete, terminating 1 nodes (36 pods) ip-10-60-146-14.ec2.internal/t3.2xlarge/on-demand","commit":"5bdf9c3","controller":"disruption","namespace":"","name":"","reconcileID":"74a7d084-82ed-48b6-996c-9667dcd4ee39","command-id":"d336a720-a745-4b36-971a-fd30c5e51ecd","reason":"underutilized"}
{"level":"INFO","time":"2024-10-14T19:40:48.090Z","logger":"controller","caller":"disruption/controller.go:176","message":"disrupting nodeclaim(s) via delete, terminating 1 nodes (24 pods) ip-10-60-86-19.ec2.internal/t3.2xlarge/on-demand","commit":"5bdf9c3","controller":"disruption","namespace":"","name":"","reconcileID":"339a1581-48c9-46da-a825-84c87103c507","command-id":"f56bdaac-dca0-4388-b7f2-9fdec4a66942","reason":"underutilized"}

The most annoying thing is that it terminates nodes with cpu requests ~95%.For example these 2 from the log. image image

But in general it terminates and creates new(the only gap is restart, b/c i have confolidateAfter: 30m) image

apiVersion: karpenter.sh/v1
kind: NodePool
metadata:
  name: eks-dev-linux-amd64
spec:
  disruption:
    budgets:
    - nodes: "3"
      reasons:
      - Empty
    - nodes: "1"
      reasons:
      - Underutilized
      - Drifted
    consolidateAfter: 30m
    consolidationPolicy: WhenEmptyOrUnderutilized
  limits:
    cpu: 1k
    memory: 3600Gi
  template:
    metadata: {}
    spec:
      expireAfter: 360h
      nodeClassRef:
        group: karpenter.k8s.aws
        kind: EC2NodeClass
        name: eks-dev-private-linux-amd64-483662afb67fa01941078b5fb6059056e47
      requirements:
      - key: node.kubernetes.io/instance-type
        operator: In
        values:
        - t3.2xlarge
      - key: karpenter.sh/capacity-type
        operator: In
        values:
        - on-demand
      - key: kubernetes.io/arch
        operator: In
        values:
        - amd64
      - key: kubernetes.io/os
        operator: In
        values:
        - linux
  weight: 50

sergii-auctane avatar Oct 14 '24 19:10 sergii-auctane

We are seeing the same issue, lots of nodes flapping, seemingly because Karpenter believes everything is underutilized, eg.

karpenter-<redacted>-wrrqr controller {"level":"INFO","time":"2024-10-15T03:51:16.165Z","logger":"controller","message":"disrupting nodeclaim(s) via delete, terminating 1 nodes (6 pods) ip-<redacted>on-demand","commit":"6174c75","controller":"disruption","namespace":"","name":"","reconcileID":"fa42cbd2-273c-4970-8246-d18594d2f78f","command-id":"8e30b622-a60d-4eab-b1d9-b998186c54b0","reason":"underutilized"}

It would be good to know how to determine what Karpenter is using to base this utilisation check on, and whether we can fix it ourselves.

cemery93 avatar Oct 15 '24 04:10 cemery93

Has anyone tried PDBs ? source: https://catalog.workshops.aws/karpenter/en-US/scheduling-constraints/pod-disruption#4.-deploy-pdb-and-application

PDBs might reduce interruption frequency for a specific group of pods but that wouldn't change any of Karpenter's desire to reclaim a node in a way that doesn't actually improve things or reduce costs.

Having the ability to influence this by requiring a minimum cost savings when replacing a node or tuning the threshold of what's deemed underutilized would be useful.

jukie avatar Oct 28 '24 17:10 jukie

I also see this also when running 1.0.6 on a large production cluster, same setup worked fine in 0.37.5 (We have some topologySpreadConstraints in place)

ChrisV78 avatar Oct 29 '24 11:10 ChrisV78

It seems that disabling the spot-to-spot consolidation makes Karpenter behave again like it should. After disabling it the excessive amount of node flapping/swapping is less but not completely like before.

ChrisV78 avatar Oct 29 '24 18:10 ChrisV78

@ChrisV78 Are you including a mix of Spot and On-Demand node types in your NodePool config?

Is it possible that you're not seeing the excessive node churn because disabling the spot-to-spot consolidation feature flag has effectively disabled consolidation altogether?

headj-origami avatar Oct 31 '24 13:10 headj-origami

Only NodePool with spot and I still have plenty of consolidation on spots with "underutilised" (and "empty"), but not in the excessive amount when spot-to-spot consolidation was still on true.

ChrisV78 avatar Oct 31 '24 13:10 ChrisV78

I only have On-Demand instances and still see constant node cycling that leads to the same/similar number of nodes and types.

jukie avatar Oct 31 '24 19:10 jukie

Same here, I have a single NodePool with only on-demand instances and having consolidationPolicy: WhenUnderutilized it makes nodes continuously disrupting. I am not sure if limitating cpu also worsen this situation, I have a limits cpu:1000. Could you please gave us more advice on how to address this situation?

Shirueopseo avatar Nov 07 '24 09:11 Shirueopseo

Currently seeing very similar in our fleet. Relatively static workloads in a nodepool of just on-demand instances sometimes leads to a situation with constant node churn.

edit: worth noting that in our case, we're assigning a single instance type to these nodepools we're seeing this happen with most

wendtek avatar Nov 07 '24 21:11 wendtek

@ChrisV78 disabling the spot-to-spot consolidation will help for on-demand instances?? I have all the nodepools for on-demand instances and consolidationPolicy set to WhenUnderutilized. I am facing similar issue here- https://github.com/aws/karpenter-provider-aws/issues/7344 It makes nodes continuously disrupting even though SpotToSpotConsolidation is already set to false.

sushama-kothawale avatar Nov 08 '24 04:11 sushama-kothawale

Same as every one here does anybody got workaround without disabling SpotToSpotConsolidation?

disabling SpotToSpotConsolidation didn't solve the problem for me. anybody got version thats works? or know how to changes configuration to configure underutilization percentage

it's seems it's gets the minimum between resource types which can cause underutilization

Screenshot 2024-11-18 at 9 13 09

raanand-dig avatar Nov 17 '24 06:11 raanand-dig

@ChrisV78 disabling the spot-to-spot consolidation will help for on-demand instances?? I have all the nodepools for on-demand instances and consolidationPolicy set to WhenUnderutilized. I am facing similar issue here- #7344 It makes nodes continuously disrupting even though SpotToSpotConsolidation is already set to false.

That's not what I posted, see also https://github.com/aws/karpenter-provider-aws/issues/7146#issuecomment-2449864999

ChrisV78 avatar Nov 18 '24 08:11 ChrisV78

I'm seeing similar behavior on v1.0.6 with only on-demand instances.

  • karpenter will terminate nodes it thinks are underutilized, but then will replace it with the exact same instance type m5a.large Screenshot 2024-12-15 at 12 24 21 AM

here's an example of the make up of the nodes in a cluster that I see the issue, are the utilizations for some nodes so low that it triggers this behavior? Screenshot 2024-12-15 at 10 49 03 AM

k24dizzle avatar Dec 15 '24 08:12 k24dizzle

You can easily have the following scenario, and I have provoked these scenarios many times;

  • 95%ish cpu and memory utilization across the nodepool, (after some time, randomness and luck you can see Karpenter reach a very nice and stable state like this)
  • all Karpenter chosen nodes being m family.

Now you Drain a node, because of how Karpenter happens to batch these pending pods you see the immediate replacement is a c node, followed by another c node. Now a few minutes later Karpenter decides to restructure other nodes even though these are also very well utilized because it decides it can rebalance things better… and the newer nodes now had very recent activity so the consolidateAfter: NN prevents the actually BETTER candidate nodes to be replaced.

If you wait half an hour you may end up in the exact same ending scenario as before you decided to “provoke” the cluster by draining a node; except that Karpenter passed workloads through several different r nodes, more c nodes and some even of varying sizes.

My workaround for now is to try as much as possible to get Karpenter to be Cluster Autoscaler by looking at the average resource distribution of the workloads and configure node pools to have one single instance family so Karpenter doesn’t take “rash” decisions (ie choose the wrong cpu:ram ratio instance just because that particular isolated pending pod batch happened to match a certain instance type).

I have now done the following, all to avoid some of the random havoc from how Karpenter is designed and default deployed:

  • single instance family
  • Very limited instance size choice
  • Bigger pod batch windows to give more data for better decisions (and higher chance of favouring the slightly bigger node candidate)
  • expireAfter: Never because otherwise the uptime of a node influences Karpenters node eviction decisions when consolidating (ie it will have a higher chance of choosing a high uptime and higher-binpacked node because it is closer to expiration and thus pod eviction costs are lower)

Some of the basis for these decisions can only be found in the source code.

frimik avatar Dec 15 '24 16:12 frimik

@frimik so do you have multiple "generic" NodePools for each instance family or are different workloads targeting m vs r? If the latter, I would've expected Karpenter to still end up consolidating nodes between the two pools.

I couldn't find a workaround that actually stopped this behavior so instead opted for a setup that aims to reduce it. For a ~500 node cluster the following was a good enough balance where >60% nodes tend to live until expiration:

  • Longer consolidateAfter duration (9-12hr) to limit the constant node thrash
  • Run StatefulSets with do-not-disrupt to further limit the scope of nodes for consolidation consideration. We happen to have a good balance of StatefulSets vs other workloads where it makes sense but YMMV.
  • Require a higher baseline of instance cpu/memory in the NodePools both to limit instance-type choice and to limit the number of overall nodes which makes the thrash a bit less frequent.

jukie avatar Dec 21 '24 05:12 jukie

Hi everyone,

I have enabled the underutilised condition with scheduling enabled. consolidationPolicy: WhenEmptyOrUnderutilized budgets: - nodes: "20%" schedule: "30 18 * * 6,0" duration: 7h reasons: - "Underutilized"

If this above condition is enabled then the consolidation is happens at night time from 12am right, but disruption happens at daytime also, why. I have tried with many conditions to enable the underutlisiation needs to happens at night time but all the conditions are failed.

Please help me in this case why it is happening and give the solution.

VLK1802 avatar Jan 20 '25 11:01 VLK1802

Hi everyone,

I have enabled the underutilised condition with scheduling enabled. consolidationPolicy: WhenEmptyOrUnderutilized budgets: - nodes: "20%" schedule: "30 18 * * 6,0" duration: 7h reasons: - "Underutilized"

If this above condition is enabled then the consolidation is happens at night time from 12am right, but disruption happens at daytime also, why. I have tried with many conditions to enable the underutlisiation needs to happens at night time but all the conditions are failed.

Please help me in this case why it is happening and give the solution.

30 18 * * 6,0 means “At 18:30 on Saturday and Sunday.”

ChrisV78 avatar Jan 20 '25 12:01 ChrisV78

Off topic... but eh

I have enabled the underutilised condition with scheduling enabled. consolidationPolicy: WhenEmptyOrUnderutilized budgets: - nodes: "20%" schedule: "30 18 * * 6,0" duration: 7h reasons: - "Underutilized"

What are you trying to achieve exactly? I'm guessing you want to turn it around to blocking disruption during office hours, following example:

budgets:
  - nodes: 20% # First budget is cycle max 20%
  - duration: 18h # duration added after the schedule
    nodes: 0 # Actually setting budget to 0 nodes, meaning no changes
    schedule: "0 4 * * mon-fri" # 4AM monday to friday, start of block window. Add duration to determine end of window (here 22PM)

Note this only covers voluntary disruption, but does include other reasons (your example only includes Underutilized, not Empty, Drifted).

jortkoopmans avatar Jan 20 '25 17:01 jortkoopmans

30 18 * * 6,0 means “At 18:30 on Saturday and Sunday.”

"My intention is not that, After I enabled this condition also the termination due to underutilized happens in business hours like monday to friday."

VLK1802 avatar Jan 21 '25 06:01 VLK1802

Off topic... but eh What are you trying to achieve exactly? I'm guessing you want to turn it around to blocking disruption during office hours, following example:

budgets:

  • nodes: 20% # First budget is cycle max 20%
  • duration: 18h # duration added after the schedule nodes: 0 # Actually setting budget to 0 nodes, meaning no changes schedule: "0 4 * * mon-fri" # 4AM monday to friday, start of block window. Add duration to determine end of window (here 22PM) Note this only covers voluntary disruption, but does include other reasons (your example only includes Underutilized, not Empty, Drifted).

"I have setup this condition ("30 18 * * 6,0") to happen termination of nodes only on sunday and saturdays due to undeutilized status, but the termination is happening in business hours also (monday to friday). "30 18 * * 6,0" this condition is not working here. why" this is my issue.

For others like "whenEmpty" and "drifted" I have stated other conditions and those are working as expected.

VLK1802 avatar Jan 21 '25 06:01 VLK1802

Without your complete budget overview, it is hard to troubleshoot. Most likely the budget you're pinpointing is not accurately defining a ceiling for the other budgets. In simple terms, saying that 20% of your nodes can cycle on Saturday and Sunday, does not impact what happens during all the other days. That is why typically you want to flip these conditions around to explicitly block changes on the defined interval (i.e. workdays).

jortkoopmans avatar Jan 21 '25 11:01 jortkoopmans

Without your complete budget overview, it is hard to troubleshoot. Most likely the budget you're pinpointing is not accurately defining a ceiling for the other budgets. In simple terms, saying that 20% of your nodes can cycle on Saturday and Sunday, does not impact what happens during all the other days. That is why typically you want to flip these conditions around to explicitly block changes on the defined interval (i.e. workdays).

apiVersion: policy/v1 kind: PodDisruptionBudget metadata: name: karpenter spec: maxUnavailable: 1

the above is my PDB

VLK1802 avatar Jan 21 '25 12:01 VLK1802

Without your complete budget overview, it is hard to troubleshoot. Most likely the budget you're pinpointing is not accurately defining a ceiling for the other budgets. In simple terms, saying that 20% of your nodes can cycle on Saturday and Sunday, does not impact what happens during all the other days. That is why typically you want to flip these conditions around to explicitly block changes on the defined interval (i.e. workdays).

apiVersion: policy/v1 kind: PodDisruptionBudget metadata: name: karpenter spec: maxUnavailable: 1

the above is my PDB

He is talking about node disruption budgets in Karpeter nodepools, not PDB's for the karpenter pods itself. :)

Did you already read https://karpenter.sh/docs/concepts/disruption/ ? Lot's of info there.

ChrisV78 avatar Jan 21 '25 12:01 ChrisV78

We have an on-demand node pool with expireAfter: Never (comment). Karpenter: 1.3.2.

Testing Scenario:

  • Scale up an application Deployment to trigger a new Karpenter NodeClaim and wait for the node to become ready.
  • The new node has only the newly scaled pods and basic DaemonSet pods, such as aws-node, kube-proxy.
  • Scale down the application to remove all pods from the new node.
  • The new node is now empty.
  • Karpenter moves all pods (20+) from an already utilized node to the newly created one.
  • Karpenter deletes the old node.

Is it possible to disable this behavior in Karpenter? In this scenario, Cluster Autoscaler would simply remove the new node instead.

apiVersion: karpenter.sh/v1
kind: NodePool
metadata:
  name: test-node
spec:
  template:
    metadata:
      labels:
        test-node: "true"
    spec:
      nodeClassRef:
        group: karpenter.k8s.aws
        kind: EC2NodeClass
        name: default
      requirements:
        - key: "node.kubernetes.io/instance-type"
          operator: In
          values: ["c6a.xlarge"]
        - key: "kubernetes.io/arch"
          operator: In
          values: ["amd64"]
        - key: "karpenter.sh/capacity-type"
          operator: In
          values: ["on-demand"]
      expireAfter: Never
  limits:
    cpu: 56
  disruption:
    consolidationPolicy: WhenEmptyOrUnderutilized
    consolidateAfter: 5m

andrii29 avatar Mar 12 '25 16:03 andrii29

Tuning these parameters significantly reduced node churn in our setup (docs). However, Karpenter still removes nodes with higher CPU and memory requests, even when there are nodes with much lower requests.

settings:
  batchMaxDuration: 30s
  batchIdleDuration: 15s

andrii29 avatar Mar 17 '25 07:03 andrii29

I think this is related to this section of code where add() is called before relaxing the scheduling preferences and as a result the preferences are being treated as strictly required.

func (s *Scheduler) trySchedule(ctx context.Context, p *corev1.Pod) error {
	for {
		if ctx.Err() != nil {
			return ctx.Err()
		}
		err := s.add(ctx, p)
		if err == nil {
			return nil
		}
		// We should only relax the pod's requirements when the error is not a reserved offering error because the pod may be
		// able to schedule later without relaxing constraints. This could occur in this scheduling run, if other NodeClaims
		// release the required reservations when constrained, or in subsequent runs. For an example, reference the following
		// test: "shouldn't relax preferences when a pod fails to schedule due to a reserved offering error".
		if IsReservedOfferingError(err) {
			return err
		}
		// Eventually we won't be able to relax anymore and this while loop will exit
		if relaxed := s.preferences.Relax(ctx, p); !relaxed {
			return err
		}
		if e := s.topology.Update(ctx, p); e != nil && !errors.Is(e, context.DeadlineExceeded) {
			log.FromContext(ctx).Error(e, "failed updating topology")
		}
		// Update the cached podData since the pod was relaxed, and it could have changed its requirement set
		s.updateCachedPodData(p)
	}
}

jukie avatar Apr 21 '25 15:04 jukie

This is most likely related to https://github.com/kubernetes-sigs/karpenter/issues/666 as well and should be helped by https://github.com/kubernetes-sigs/karpenter/pull/2122.

jukie avatar Apr 21 '25 16:04 jukie

Adding on to what @frimik said:

You can easily have the following scenario, and I have provoked these scenarios many times;

  • 95%ish cpu and memory utilization across the nodepool, (after some time, randomness and luck you can see Karpenter reach a very nice and stable state like this)

  • all Karpenter chosen nodes being m family.

Now you Drain a node, because of how Karpenter happens to batch these pending pods you see the immediate replacement is a c node, followed by another c node. Now a few minutes later Karpenter decides to restructure other nodes even though these are also very well utilized because it decides it can rebalance things better… and the newer nodes now had very recent activity so the consolidateAfter: NN prevents the actually BETTER candidate nodes to be replaced.

It's easy to imagine another situation where Karpenter decides to consolidate old nodes very frequently, caused by pod churn on a small subset of new nodes. ReplicaSets will choose to delete their newest pods first, so if a ReplicaSet scale-up caused Karpenter to create nodes, a corresponding scale-down may cause only those new nodes to be underutilized (or even empty). So what happens in a stabilized cluster where all nodes are "consolidatable", when a ReplicaSet like that scales down for a bit?

This is a simplified diagram of 1 app on 3 nodes. A black node is consolidatable, and blue is unconsolidatable. Total app requests on the node are represented by the red area.

Image

Also, the default k8s scheduling behavior is to schedule pods to the least-utilized nodes, so with a relatively high-churn workload, it means the least-utilized nodes are the most likely to be unconsolidatable in Karpenter due to the consolidateAfter timer constantly resetting on them. In turn, this means any highly-utilized, "old" node which has passed its consolidateAfter time is immediately considered as a candidate for consolidating to the other less-utilized unconsolidatable nodes, creating a consistent cycle of node churn.

In cluster-autoscaler, this would normally be prevented by setting a node consolidation threshold, where only nodes <50% utilized (or whatever you set) are considered for consolidation. That approximates the effect of workloads on less-utilized nodes consolidating onto the more-utilized nodes. A node also has to be consolidatable for a long-enough time in cluster-autoscaler before it finally gets consolidated, preventing overly-frequent scaling.

Adding a threshold to Karpenter above which a node is not considered consolidatable would help resolve this. And to keep cost-related consolidation working on all nodes, I think an exception would have to be added for nodes above that allocation threshold where pods on that node are considered only for consolidation to a single cheaper node, not to any other node on the cluster.

dqsully avatar Apr 22 '25 00:04 dqsully