karpenter-provider-aws icon indicating copy to clipboard operation
karpenter-provider-aws copied to clipboard

Spot instances are not being consolidated.

Open githubeto opened this issue 1 year ago • 3 comments
trafficstars

Description

There are 2 On-Demand instances for placing Karpenter's Pods and 3 Spot instance nodes for scheduling workload Pods. I believe the utilization is low, but why aren't they being replaced with smaller instance types?

image

Spot1.

kubectl describe node ip-10-219-212-23.ap-northeast-1.compute.internal | grep -A 30 "Events:"

Events:
  Type     Reason                   Age                From                   Message
  ----     ------                   ----               ----                   -------
  Normal   Starting                 28m                kube-proxy             
  Normal   NodeAllocatableEnforced  28m                kubelet                Updated Node Allocatable limit across pods
  Normal   Starting                 28m                kubelet                Starting kubelet.
  Warning  InvalidDiskCapacity      28m                kubelet                invalid capacity 0 on image filesystem
  Normal   NodeHasSufficientMemory  28m (x2 over 28m)  kubelet                Node ip-10-219-212-23.ap-northeast-1.compute.internal status is now: NodeHasSufficientMemory
  Normal   NodeHasNoDiskPressure    28m (x2 over 28m)  kubelet                Node ip-10-219-212-23.ap-northeast-1.compute.internal status is now: NodeHasNoDiskPressure
  Normal   NodeHasSufficientPID     28m (x2 over 28m)  kubelet                Node ip-10-219-212-23.ap-northeast-1.compute.internal status is now: NodeHasSufficientPID
  Normal   Synced                   28m                cloud-node-controller  Node synced successfully
  Normal   RegisteredNode           28m                node-controller        Node ip-10-219-212-23.ap-northeast-1.compute.internal event: Registered Node ip-10-219-212-23.ap-northeast-1.compute.internal in Controller
  Normal   NodeReady                28m                kubelet                Node ip-10-219-212-23.ap-northeast-1.compute.internal status is now: NodeReady
  Normal   DisruptionBlocked        27m                karpenter              Cannot disrupt Node: Nominated for a pending pod
  Normal   DisruptionBlocked        23m                karpenter              Cannot disrupt Node: PDB "cattle-gatekeeper-system/gatekeeper-controller-manager" prevents pod evictions
  Normal   DisruptionBlocked        21m (x2 over 25m)  karpenter              Cannot disrupt Node: PDB "tempo/tempo-distributed-distributor" prevents pod evictions
  Normal   DisruptionBlocked        19m                karpenter              Cannot disrupt Node: PDB "kube-system/coredns" prevents pod evictions
  Normal   Unconsolidatable         53s (x2 over 16m)  karpenter              Can't replace with a cheaper node

Question : What does "Can't replace with a cheaper node" mean? I don't understand the specific reasons why consolidation is not possible.

Spot2.

kubectl describe node ip-10-219-208-22.ap-northeast-1.compute.internal | grep -A 30 "Events:"

Events:
  Type     Reason                       Age                  From                   Message
  ----     ------                       ----                 ----                   -------
  Normal   Starting                     25m                  kube-proxy             
  Normal   Starting                     26m                  kubelet                Starting kubelet.
  Warning  InvalidDiskCapacity          26m                  kubelet                invalid capacity 0 on image filesystem
  Normal   NodeHasSufficientPID         26m (x2 over 26m)    kubelet                Node ip-10-219-208-22.ap-northeast-1.compute.internal status is now: NodeHasSufficientPID
  Normal   NodeHasSufficientMemory      26m (x2 over 26m)    kubelet                Node ip-10-219-208-22.ap-northeast-1.compute.internal status is now: NodeHasSufficientMemory
  Normal   NodeHasNoDiskPressure        26m (x2 over 26m)    kubelet                Node ip-10-219-208-22.ap-northeast-1.compute.internal status is now: NodeHasNoDiskPressure
  Normal   NodeAllocatableEnforced      26m                  kubelet                Updated Node Allocatable limit across pods
  Normal   RegisteredNode               25m                  node-controller        Node ip-10-219-208-22.ap-northeast-1.compute.internal event: Registered Node ip-10-219-208-22.ap-northeast-1.compute.internal in Controller
  Normal   Synced                       25m                  cloud-node-controller  Node synced successfully
  Normal   NodeReady                    25m                  kubelet                Node ip-10-219-208-22.ap-northeast-1.compute.internal status is now: NodeReady
  Normal   SpotRebalanceRecommendation  22m                  karpenter              Spot rebalance recommendation was triggered
  Normal   DisruptionBlocked            19m (x4 over 25m)    karpenter              Cannot disrupt Node: Nominated for a pending pod
  Normal   Unconsolidatable             2m20s (x2 over 18m)  karpenter              Can't remove without creating 2 candidates

Question : What does "Can't remove without creating 2 candidates" mean? I don't understand the specific reasons why consolidation is not possible.

kubectl describe node ip-10-219-210-8.ap-northeast-1.compute.internal | grep -A 30 "Events:"

Events:
  Type     Reason                   Age                  From                   Message
  ----     ------                   ----                 ----                   -------
  Normal   Starting                 24m                  kube-proxy             
  Normal   NodeHasSufficientPID     24m (x2 over 24m)    kubelet                Node ip-10-219-210-8.ap-northeast-1.compute.internal status is now: NodeHasSufficientPID
  Normal   Starting                 24m                  kubelet                Starting kubelet.
  Warning  InvalidDiskCapacity      24m                  kubelet                invalid capacity 0 on image filesystem
  Normal   NodeHasSufficientMemory  24m (x2 over 24m)    kubelet                Node ip-10-219-210-8.ap-northeast-1.compute.internal status is now: NodeHasSufficientMemory
  Normal   NodeHasNoDiskPressure    24m (x2 over 24m)    kubelet                Node ip-10-219-210-8.ap-northeast-1.compute.internal status is now: NodeHasNoDiskPressure
  Normal   NodeAllocatableEnforced  24m                  kubelet                Updated Node Allocatable limit across pods
  Normal   Synced                   24m                  cloud-node-controller  Node synced successfully
  Normal   RegisteredNode           24m                  node-controller        Node ip-10-219-210-8.ap-northeast-1.compute.internal event: Registered Node ip-10-219-210-8.ap-northeast-1.compute.internal in Controller
  Normal   NodeReady                23m                  kubelet                Node ip-10-219-210-8.ap-northeast-1.compute.internal status is now: NodeReady
  Normal   DisruptionBlocked        23m                  karpenter              Cannot disrupt Node: Nominated for a pending pod
  Normal   DisruptionBlocked        21m                  karpenter              Cannot disrupt Node: PDB "istio-system/istiod" prevents pod evictions
  Normal   Unconsolidatable         2m57s (x2 over 22m)  karpenter              SpotToSpotConsolidation requires 15 cheaper instance type options than the current candidate to consolidate, got 5

Question : I understand that there are only five cheaper Spot instances available, but since we are specifying at least large and xlarge instance sizes, I believe there should be more than five candidates. Is there an issue with the way we are specifying the NodePool?

Versions:

  • Chart Version: 0.37
  • Kubernetes Version (kubectl version): 1.28

values.yaml


replicas: 2 # default
serviceAccount:
  annotations:
    eks.amazonaws.com/role-arn: arn:aws:iam::xxxxxxxxxx:role/aws-stg-karpenter
settings:
  clusterName: "aws-stg"
  interruptionQueue: "aws-stg"
  resources:
    requests:
      cpu: 1
      memory: 1Gi
    limits:
      cpu: 1
      memory: 1Gi
  featureGates:
    spotToSpotConsolidation: true
logLevel: debug
tolerations:
  - key: CriticalAddonsOnly
    operator: Exists

install commands

helm upgrade --install karpenter -f values.yaml oci://public.ecr.aws/karpenter/karpenter --version "0.37.0" 

nodepool.yaml

apiVersion: karpenter.sh/v1beta1
kind: NodePool
metadata:
  name: default
spec:
  template:
    metadata:
      labels:
        karpenter-nodepool: default 
    spec:
      requirements:
        - key: kubernetes.io/arch
          operator: In
          values: ["amd64"]
        - key: kubernetes.io/os
          operator: In
          values: ["linux"]
        - key: karpenter.sh/capacity-type
          operator: In
          values: ["spot"]
        - key: karpenter.k8s.aws/instance-category
          operator: In
          values: ["c", "m", "r"]
        - key: "karpenter.k8s.aws/instance-family"
          operator: In
          values: ["m6i", "m6a", "m6id", "m6in", "m6idn", "m7a", "m7i", "c6a", "c6i", "c6id", "c6in", "c7a", "c7i", "r6i", "r6a", "r6id", "r6in", "r6idn", "r7a", "r7i"]
        - key: karpenter.k8s.aws/instance-generation
          operator: Gt
          values: ["2"]
        - key: karpenter.k8s.aws/instance-size
          operator: In
          values: ["large", "xlarge", "2xlarge", "4xlarge", "8xlarge"]
        - key: karpenter.k8s.aws/instance-cpu
          operator: Gt
          values: ["4"]
      nodeClassRef:
        apiVersion: karpenter.k8s.aws/v1beta1
        kind: EC2NodeClass
        name: default
      kubelet:
        maxPods: 110
  limits:
    cpu: "160"
    memory: 640Gi
  disruption:
    consolidationPolicy: WhenUnderutilized
---
apiVersion: karpenter.k8s.aws/v1beta1
kind: EC2NodeClass
metadata:
  name: default
spec:
  amiFamily: AL2 # Amazon Linux 2
  role: "KarpenterNodeRole-aws-stg"
  subnetSelectorTerms:
    - tags:
        karpenter.sh/discovery: aws-stg
  securityGroupSelectorTerms:
    - tags:
        aws:eks:cluster-name: aws-stg
  tags:
    eks:nodegroup-name: "karpenter-spot-instances"
  metadataOptions:
    httpEndpoint: enabled
    httpProtocolIPv6: disabled
    httpPutResponseHopLimit: 2
    httpTokens: optional
  blockDeviceMappings:
    - deviceName: /dev/xvda
      ebs:
        volumeSize: 100Gi
        volumeType: gp3
        iops: 3000
        throughput: 125
        deleteOnTermination: true
  detailedMonitoring: true

githubeto avatar Aug 02 '24 11:08 githubeto

What does "Can't remove without creating 2 candidates" mean?

Karpenter will not consolidate one node into two nodes. This is meant as a safeguard against node launch failures when replacing nodes, and ensuring that we only consolidate when we know it's safe to. This also generally saves us from doing too many consolidation actions, and prioritizing larger nodes.

Normal Unconsolidatable 2m57s (x2 over 22m) karpenter SpotToSpotConsolidation requires 15 cheaper instance type options than the current candidate to consolidate, got 5

This is a requirement of spot to spot consolidation. Since spot instances have a tradeoff of availability for cost, if you always accept a consolidation from a spot instance to a cheaper one, you'll have continual consolidations until you get the cheapest and smallest node per pod, we call this race to the bottom. If you have more questions on how this works, feel free to read the design: https://github.com/kubernetes-sigs/karpenter/blob/main/designs/spot-consolidation.md

njtran avatar Aug 05 '24 18:08 njtran

@njtran

What does "Can't remove without creating 2 candidates" mean?

Karpenter will not consolidate one node into two nodes. This is meant as a safeguard against node launch failures when replacing nodes, and ensuring that we only consolidate when we know it's safe to. This also generally saves us from doing too many consolidation actions, and prioritizing larger nodes.

Normal Unconsolidatable 2m57s (x2 over 22m) karpenter SpotToSpotConsolidation requires 15 cheaper instance type options than the current candidate to consolidate, got 5

This is a requirement of spot to spot consolidation. Since spot instances have a tradeoff of availability for cost, if you always accept a consolidation from a spot instance to a cheaper one, you'll have continual consolidations until you get the cheapest and smallest node per pod, we call this race to the bottom. If you have more questions on how this works, feel free to read the design: https://github.com/kubernetes-sigs/karpenter/blob/main/designs/spot-consolidation.md

Thank you for your response!

I'm sorry, but I think I might not fully understand. What does "Karpenter will not consolidate one node into two nodes" mean?

Usually, if there is a lot of free capacity, I would expect it to replace with a cheaper node. I understand that there are cases where this might fail, but I don't understand the meaning of preventing it as a safety measure only in this case.

Does this message indicate that instead of shutting down a node, it will start two new nodes and distribute the workload's Pods across these two new nodes? In other words, if the instance type is reduced, the current node is using so many resources that it cannot fit into a single node, so it decides to consolidate into two nodes, and this action is being suppressed because it is risky? (In summary, does it mean that it simply cannot scale in any further?)

Normal Unconsolidatable 2m57s (x2 over 22m) karpenter SpotToSpotConsolidation requires 15 cheaper instance type options than the current candidate to consolidate, got 5

It turned out that the issue was due to a misconfiguration of my NodePool. Since all conditions were set with AND logic, it unnecessarily narrowed down the target instance types.

What does "Can't replace with a cheaper node" mean?

Do you have any comments on this?

githubeto avatar Aug 06 '24 07:08 githubeto

This issue has been inactive for 14 days. StaleBot will close this stale issue after 14 more days of inactivity.

github-actions[bot] avatar Aug 20 '24 12:08 github-actions[bot]