karpenter-blueprints icon indicating copy to clipboard operation
karpenter-blueprints copied to clipboard

Add an example for different budget for different disruption reasons

Open InsomniaCoder opened this issue 1 year ago • 3 comments

Issue #, if available:

Description of changes:

  • Adding an example of different disruption budget for different disruption reasons.

I came across this after upgrading in v1 and it's quite useful as we needed to keep the disruption quite strict to limit blast radius of situation like AMI update/EKS upgrade, and we noticed that it affected the consolidation activity.

with this now we are allowed to consolidate more efficiently while keeping the strict policy for update.

Let me know if it makes sense or if it's not that useful feel free to close it.

Thank you

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

InsomniaCoder avatar Sep 23 '24 22:09 InsomniaCoder

Thank you for submitting this contribution! Its great to hear how these blueprints / Karpenter v1 can a help solve real world challenges.

For users to fully benefit from your example could you share how this was tested and how we could replicate this. Although we are adding this as an example to the README.md, it would be great for users confidence to know how to validate the configuration. Thanks again.

jakeskyaws avatar Sep 24 '24 07:09 jakeskyaws

Thank you for submitting this contribution! Its great to hear how these blueprints / Karpenter v1 can a help solve real world challenges.

For users to fully benefit from your example could you share how this was tested and how we could replicate this. Although we are adding this as an example to the README.md, it would be great for users confidence to know how to validate the configuration. Thanks again.

I have recently applied this new nodepool configuration in production (after finishing v1 upgrade).

apiVersion: karpenter.sh/v1
kind: NodePool
metadata:
  name: multiple-consolidations
spec:
  disruption:
    budgets:
    - nodes: "1"
      reasons:
      - Drifted
    - duration: 14m0s
      nodes: "0"
      reasons:
      - Drifted
      schedule: '*/15 * * * *'
    - nodes: "3"
      reasons:
      - Empty
      - Underutilized
    - duration: 9m0s
      nodes: "0"
      reasons:
      - Empty
      - Underutilized
      schedule: '*/10 * * * *'
    consolidateAfter: 5m0s
    consolidationPolicy: WhenEmptyOrUnderutilized
  template:
    spec:
      expireAfter: 720h
      nodeClassRef:
        group: karpenter.k8s.aws
        kind: EC2NodeClass
        name: default
      requirements:
      - key: karpenter.k8s.aws/instance-memory
        operator: Gt
        values:
        - "16000"
      - key: karpenter.k8s.aws/instance-category
        operator: In
        values:
        - r
        - m
        - c
      - key: karpenter.k8s.aws/instance-generation
        operator: Gt
        values:
        - "4"
      - key: karpenter.sh/capacity-type
        operator: In
        values:
        - on-demand
      - key: kubernetes.io/arch
        operator: In
        values:
        - amd64
        - arm64
      - key: topology.kubernetes.io/zone
        operator: In
        values:
        - eu-west-1a
        - eu-west-1b
        - eu-west-1c
      - key: kubernetes.io/os
        operator: In
        values:
        - linux
  weight: 1

(redacted some details such as taint, and selector)

This is the metrics showing that it behaves as needed. the metric being used is karpenter_nodepools_allowed_disruptions

image

green line is drifted reason which is 1 node every 15 minutes (start acting for example minutes 44, 59)

read and blue represents underutilised and empty which is 3 every 10 minutes starting for example 39, 49, 59

Let me know if you need me to share these in the doc somehow or anything needed.

Thank you

InsomniaCoder avatar Sep 24 '24 18:09 InsomniaCoder

@InsomniaCoder first of all, THANK YOU so much for not just letting us know these blueprints have been useful for you, but also for making contributions as well, you rock! I have a few recommendations about this:

  • Can you please make this part of the "Multipe Budgets" section, and move the "Multiple Budgets" section after the "Reasons" section? That way we can keep a consistent order and going deeper every time.
  • As Jake suggested (and you already answered), it would be really helpful if you can incorporate what you described here, specially to show the results others will see by having this configuration in place.
  • Can you please break down each budget, you're already doing it partially but it was a bit hard for me to follow along. Maybe you can explain the four scenarios, then show the NodePool config, and then the results.
  • Can you also either add a note or directly make it explicit that the budget config will "in a given time frame, at most x nodes can be disrupting at a given moment".
  • Let's see how long it ends up being this blueprint, maybe it will be worth it to actually have a dedicated blueprint for this and tested (following Jake's recommendation).

We think this contributions the blueprint will end up being even more awesome :)

chrismld avatar Sep 26 '24 15:09 chrismld

@InsomniaCoder hello Tanat, I was just wondering if this is still in your radar?

chrismld avatar Jun 02 '25 16:06 chrismld

Hi @chrismld I need to apologize, this has been out of my context for some time. I will go ahead and close it and will look into this later when I regain the context.

Thank you so much!

InsomniaCoder avatar Jun 05 '25 08:06 InsomniaCoder

thank you! and no problem, looking forward to hearing back from you soon :)

chrismld avatar Jun 05 '25 13:06 chrismld