karpenter
karpenter copied to clipboard
Daemonset-driven consolidation
Version
Karpenter Version: v0.22.1
Kubernetes Version: v1.24.8
Hi,
I have set up Karpenter with the following cluster configuration:
apiVersion: eksctl.io/v1alpha5
kind: ClusterConfig
metadata:
name: klss
region: eu-central-1
version: "1.24"
tags:
karpenter.sh/discovery: klss
managedNodeGroups:
- instanceType: t3.small
amiFamily: AmazonLinux2
name: karpenter
desiredCapacity: 2
minSize: 2
maxSize: 2
iam:
withOIDC: true
This is the provisioner:
---
apiVersion: karpenter.sh/v1alpha5
kind: Provisioner
metadata:
name: default
spec:
requirements:
- key: karpenter.sh/capacity-type
operator: In
values:
- "on-demand"
- "spot"
- key: "kubernetes.io/arch"
operator: In
values:
- "arm64"
- "amd64"
- key: "topology.kubernetes.io/zone"
operator: In
values:
- "eu-central-1a"
- "eu-central-1b"
- "eu-central-1c"
limits:
[nodes.zip](https://github.com/aws/karpenter/files/10483452/nodes.zip)
resources:
cpu: 32
memory: 64Gi
providerRef:
name: default
consolidation:
enabled: true
---
apiVersion: karpenter.k8s.aws/v1alpha1
kind: AWSNodeTemplate
metadata:
name: default
spec:
subnetSelector:
karpenter.sh/discovery: klss
securityGroupSelector:
karpenter.sh/discovery: klss
Karpenter has currently provisioned three spot instances. When installing Prometheus with Helm chart version 19.3.1, two of the five node exporters can't be scheduled. The message is: "0/5 nodes are available: 1 Too many pods. preemption: 0/5 nodes are available: 5 No preemption victims found for incoming pod.". The Karpenter controllers didn't output any log entries.
This is the values file for the chart:
prometheus:
serviceAccounts:
server:
create: false
name: "amp-iamproxy-ingest-service-account"
server:
remoteWrite:
- url: https://aps-workspaces.eu-central-1.amazonaws.com/workspaces/xxxxxxxxxxxxxxxxxxxxx/api/v1/query
sigv4:
region: eu-central-1
queue_config:
max_samples_per_send: 1000
max_shards: 200
capacity: 2500
persistentVolume:
enabled: false
This is the live manifest of the DaemonSet of the Prometheus node exporter:
apiVersion: apps/v1
kind: DaemonSet
metadata:
annotations:
deprecated.daemonset.template.generation: '1'
kubectl.kubernetes.io/last-applied-configuration: >
{"apiVersion":"apps/v1","kind":"DaemonSet","metadata":{"annotations":{},"labels":{"app.kubernetes.io/component":"metrics","app.kubernetes.io/instance":"prometheus","app.kubernetes.io/managed-by":"Helm","app.kubernetes.io/name":"prometheus-node-exporter","app.kubernetes.io/part-of":"prometheus-node-exporter","app.kubernetes.io/version":"1.5.0","helm.sh/chart":"prometheus-node-exporter-4.8.1"},"name":"prometheus-prometheus-node-exporter","namespace":"prometheus"},"spec":{"selector":{"matchLabels":{"app.kubernetes.io/instance":"prometheus","app.kubernetes.io/name":"prometheus-node-exporter"}},"template":{"metadata":{"annotations":{"cluster-autoscaler.kubernetes.io/safe-to-evict":"true"},"labels":{"app.kubernetes.io/component":"metrics","app.kubernetes.io/instance":"prometheus","app.kubernetes.io/managed-by":"Helm","app.kubernetes.io/name":"prometheus-node-exporter","app.kubernetes.io/part-of":"prometheus-node-exporter","app.kubernetes.io/version":"1.5.0","helm.sh/chart":"prometheus-node-exporter-4.8.1"}},"spec":{"automountServiceAccountToken":false,"containers":[{"args":["--path.procfs=/host/proc","--path.sysfs=/host/sys","--path.rootfs=/host/root","--web.listen-address=[$(HOST_IP)]:9100"],"env":[{"name":"HOST_IP","value":"0.0.0.0"}],"image":"quay.io/prometheus/node-exporter:v1.5.0","imagePullPolicy":"IfNotPresent","livenessProbe":{"failureThreshold":3,"httpGet":{"httpHeaders":null,"path":"/","port":9100,"scheme":"HTTP"},"initialDelaySeconds":0,"periodSeconds":10,"successThreshold":1,"timeoutSeconds":1},"name":"node-exporter","ports":[{"containerPort":9100,"name":"metrics","protocol":"TCP"}],"readinessProbe":{"failureThreshold":3,"httpGet":{"httpHeaders":null,"path":"/","port":9100,"scheme":"HTTP"},"initialDelaySeconds":0,"periodSeconds":10,"successThreshold":1,"timeoutSeconds":1},"securityContext":{"allowPrivilegeEscalation":false},"volumeMounts":[{"mountPath":"/host/proc","name":"proc","readOnly":true},{"mountPath":"/host/sys","name":"sys","readOnly":true},{"mountPath":"/host/root","mountPropagation":"HostToContainer","name":"root","readOnly":true}]}],"hostNetwork":true,"hostPID":true,"securityContext":{"fsGroup":65534,"runAsGroup":65534,"runAsNonRoot":true,"runAsUser":65534},"serviceAccountName":"prometheus-prometheus-node-exporter","tolerations":[{"effect":"NoSchedule","operator":"Exists"}],"volumes":[{"hostPath":{"path":"/proc"},"name":"proc"},{"hostPath":{"path":"/sys"},"name":"sys"},{"hostPath":{"path":"/"},"name":"root"}]}},"updateStrategy":{"rollingUpdate":{"maxUnavailable":1},"type":"RollingUpdate"}}}
creationTimestamp: '2023-01-23T19:32:19Z'
generation: 1
labels:
app.kubernetes.io/component: metrics
app.kubernetes.io/instance: prometheus
app.kubernetes.io/managed-by: Helm
app.kubernetes.io/name: prometheus-node-exporter
app.kubernetes.io/part-of: prometheus-node-exporter
app.kubernetes.io/version: 1.5.0
helm.sh/chart: prometheus-node-exporter-4.8.1
name: prometheus-prometheus-node-exporter
namespace: prometheus
resourceVersion: '1156021'
uid: 3659924e-2902-4651-aa2a-1d20a1dc1ce7
spec:
revisionHistoryLimit: 10
selector:
matchLabels:
app.kubernetes.io/instance: prometheus
app.kubernetes.io/name: prometheus-node-exporter
template:
metadata:
annotations:
cluster-autoscaler.kubernetes.io/safe-to-evict: 'true'
creationTimestamp: null
labels:
app.kubernetes.io/component: metrics
app.kubernetes.io/instance: prometheus
app.kubernetes.io/managed-by: Helm
app.kubernetes.io/name: prometheus-node-exporter
app.kubernetes.io/part-of: prometheus-node-exporter
app.kubernetes.io/version: 1.5.0
helm.sh/chart: prometheus-node-exporter-4.8.1
spec:
automountServiceAccountToken: false
containers:
- args:
- '--path.procfs=/host/proc'
- '--path.sysfs=/host/sys'
- '--path.rootfs=/host/root'
- '--web.listen-address=[$(HOST_IP)]:9100'
env:
- name: HOST_IP
value: 0.0.0.0
image: 'quay.io/prometheus/node-exporter:v1.5.0'
imagePullPolicy: IfNotPresent
livenessProbe:
failureThreshold: 3
httpGet:
path: /
port: 9100
scheme: HTTP
periodSeconds: 10
successThreshold: 1
timeoutSeconds: 1
name: node-exporter
ports:
- containerPort: 9100
hostPort: 9100
name: metrics
protocol: TCP
readinessProbe:
failureThreshold: 3
httpGet:
path: /
port: 9100
scheme: HTTP
periodSeconds: 10
successThreshold: 1
timeoutSeconds: 1
resources: {}
securityContext:
allowPrivilegeEscalation: false
terminationMessagePath: /dev/termination-log
terminationMessagePolicy: File
volumeMounts:
- mountPath: /host/proc
name: proc
readOnly: true
- mountPath: /host/sys
name: sys
readOnly: true
- mountPath: /host/root
mountPropagation: HostToContainer
name: root
readOnly: true
dnsPolicy: ClusterFirst
hostNetwork: true
hostPID: true
restartPolicy: Always
schedulerName: default-scheduler
securityContext:
fsGroup: 65534
runAsGroup: 65534
runAsNonRoot: true
runAsUser: 65534
serviceAccount: prometheus-prometheus-node-exporter
serviceAccountName: prometheus-prometheus-node-exporter
terminationGracePeriodSeconds: 30
tolerations:
- effect: NoSchedule
operator: Exists
volumes:
- hostPath:
path: /proc
type: ''
name: proc
- hostPath:
path: /sys
type: ''
name: sys
- hostPath:
path: /
type: ''
name: root
updateStrategy:
rollingUpdate:
maxSurge: 0
maxUnavailable: 1
type: RollingUpdate
status:
currentNumberScheduled: 5
desiredNumberScheduled: 5
numberAvailable: 3
numberMisscheduled: 0
numberReady: 3
numberUnavailable: 2
observedGeneration: 1
updatedNumberScheduled: 5
This is the live manifest of one of the pods that can't be scheduled:
apiVersion: v1
kind: Pod
metadata:
annotations:
cluster-autoscaler.kubernetes.io/safe-to-evict: 'true'
kubernetes.io/psp: eks.privileged
creationTimestamp: '2023-01-23T19:32:19Z'
generateName: prometheus-prometheus-node-exporter-
labels:
app.kubernetes.io/component: metrics
app.kubernetes.io/instance: prometheus
app.kubernetes.io/managed-by: Helm
app.kubernetes.io/name: prometheus-node-exporter
app.kubernetes.io/part-of: prometheus-node-exporter
app.kubernetes.io/version: 1.5.0
controller-revision-hash: 7b4cd87594
helm.sh/chart: prometheus-node-exporter-4.8.1
pod-template-generation: '1'
name: prometheus-prometheus-node-exporter-9c5s5
namespace: prometheus
ownerReferences:
- apiVersion: apps/v1
blockOwnerDeletion: true
controller: true
kind: DaemonSet
name: prometheus-prometheus-node-exporter
uid: 3659924e-2902-4651-aa2a-1d20a1dc1ce7
resourceVersion: '1155915'
uid: 98b0cea4-68fe-47ea-83f1-231d5b5809ca
spec:
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchFields:
- key: metadata.name
operator: In
values:
- ip-192-168-14-194.eu-central-1.compute.internal
automountServiceAccountToken: false
containers:
- args:
- '--path.procfs=/host/proc'
- '--path.sysfs=/host/sys'
- '--path.rootfs=/host/root'
- '--web.listen-address=[$(HOST_IP)]:9100'
env:
- name: HOST_IP
value: 0.0.0.0
image: 'quay.io/prometheus/node-exporter:v1.5.0'
imagePullPolicy: IfNotPresent
livenessProbe:
failureThreshold: 3
httpGet:
path: /
port: 9100
scheme: HTTP
periodSeconds: 10
successThreshold: 1
timeoutSeconds: 1
name: node-exporter
ports:
- containerPort: 9100
hostPort: 9100
name: metrics
protocol: TCP
readinessProbe:
failureThreshold: 3
httpGet:
path: /
port: 9100
scheme: HTTP
periodSeconds: 10
successThreshold: 1
timeoutSeconds: 1
resources: {}
securityContext:
allowPrivilegeEscalation: false
terminationMessagePath: /dev/termination-log
terminationMessagePolicy: File
volumeMounts:
- mountPath: /host/proc
name: proc
readOnly: true
- mountPath: /host/sys
name: sys
readOnly: true
- mountPath: /host/root
mountPropagation: HostToContainer
name: root
readOnly: true
dnsPolicy: ClusterFirst
enableServiceLinks: true
hostNetwork: true
hostPID: true
preemptionPolicy: PreemptLowerPriority
priority: 0
restartPolicy: Always
schedulerName: default-scheduler
securityContext:
fsGroup: 65534
runAsGroup: 65534
runAsNonRoot: true
runAsUser: 65534
serviceAccount: prometheus-prometheus-node-exporter
serviceAccountName: prometheus-prometheus-node-exporter
terminationGracePeriodSeconds: 30
tolerations:
- effect: NoSchedule
operator: Exists
- effect: NoExecute
key: node.kubernetes.io/not-ready
operator: Exists
- effect: NoExecute
key: node.kubernetes.io/unreachable
operator: Exists
- effect: NoSchedule
key: node.kubernetes.io/disk-pressure
operator: Exists
- effect: NoSchedule
key: node.kubernetes.io/memory-pressure
operator: Exists
- effect: NoSchedule
key: node.kubernetes.io/pid-pressure
operator: Exists
- effect: NoSchedule
key: node.kubernetes.io/unschedulable
operator: Exists
- effect: NoSchedule
key: node.kubernetes.io/network-unavailable
operator: Exists
volumes:
- hostPath:
path: /proc
type: ''
name: proc
- hostPath:
path: /sys
type: ''
name: sys
- hostPath:
path: /
type: ''
name: root
status:
conditions:
- lastProbeTime: null
lastTransitionTime: '2023-01-23T19:32:19Z'
message: >-
0/5 nodes are available: 1 Too many pods. preemption: 0/5 nodes are
available: 5 No preemption victims found for incoming pod.
reason: Unschedulable
status: 'False'
type: PodScheduled
phase: Pending
qosClass: BestEffort
I also did a test with the node selector "karpenter.sh/capacity-type: on-demand". Then one of the spot instances is deleted, but no new instance is created. The DaemonSet also doesn't create any pods.
PR aws/karpenter#1155 should have fixed the issue of DaemonSets not being part of the scaling decision, but perhaps this is a special case? The node exporter wants a pod on each node because it wants to tap telemetry.
Best regards,
Werner.
Expected Behavior
An extra node to be provisioned.
Actual Behavior
No extra ode is provisioned while two DaemonSet pods can't be scheduled.
Steps to Reproduce the Problem
I did this when there were already three Karpenter nodes, but I think you can just install Prometheus because the nodes are not full.
Resource Specs and Logs
karpenter-6d57cdbbd6-dqgcj.log karpenter-6d57cdbbd6-lsv9f.log
Community Note
- Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
- Please do not leave "+1" or "me too" comments, they generate extra noise for issue followers and do not help prioritize the request
- If you are interested in working on this issue or have submitted a pull request, please leave a comment
This is the current expected behavior. Karpenter provisions enough capacity for the pods and daemonsets that exist at the time of scheduling. When you add a new daemonset, for this to work properly, Karpenter would need to replace any existing nodes that the daemonset won't fit on.
We typically recommend that you set a high priority for daemonsets to cover this use case. When scaling up a daemonset, it will trigger eviction for existing pods, which will feed back into karpenter's provisioning algorithm.
https://kubernetes.io/docs/concepts/scheduling-eviction/pod-priority-preemption/#how-to-use-priority-and-preemption
I'll update the FAQ to cover this.
I forgot to mention that I tried setting the priorityClassField of the DaemonSet to system-node-critical and also once to system-cluster-critical. In both cases all pods were scheduled, but both Karpenter controllers were evicted. I will try to avoid this by changing the pod disruption budget in the values file of Karpenter’s Helm chart.
I could avoid the eviction of the Karpenter controllers, which have priority system-cluster-critical, by giving the Prometheus DaemonSet a PriorityClass with a priority half of system-cluster-critical. So, it works. Thanks!
This trick doesn't always work. I removed Prometheus and installed the AWS CloudWatch agent as a DaemonSet. They also have priority of 1000000000. One of the four pods can't be scheduled, but no node is added.
Here are some files that reflect the new situation. files.zip
According to the generated nodeAffinity the node onto which the pod is supposed to be scheduled doesn't have enough memory. Shouldn't this evict other pods? There are several with priority 0. Without eviction Karpenter has no reason to provision a new node.
This feature is very necessary, Karpenter should auto adjust upon introducing a new daemonset. We should NOT have to set priority classes on every single resource in the cluster. The correct solution here would be that if a node cant fit a newly installed daemonset pod due to CPU/RAM a new bigger node should be auto ordered that is bigger and could all pods that were housed on old and nodes as well as the daemonset.
@tzneal Added https://github.com/kubernetes/website/pull/40851
I have been using the following Kyverno policy to make sure DaemonSets have the right priority class:
apiVersion: kyverno.io/v1
kind: ClusterPolicy
metadata:
name: add-priority-class
annotations:
policies.kyverno.io/title: Add priority class for DaemonSets to help Karpenter.
policies.kyverno.io/subject: Pod
policies.kyverno.io/minversion: 1.6.0
policies.kyverno.io/description: Add priority class for DaemonSets to help Karpenter.
spec:
rules:
- name: add-priority-class-context
match:
any:
- resources:
kinds:
- DaemonSet
mutate:
patchStrategicMerge:
spec:
template:
spec:
priorityClassName: system-node-critical
Nice @wdonne , Kyverno might be interested in taking that upstream. I think it would be useful for any autoscaler.
See https://kyverno.io/policies/?policytypes=Karpenter and https://github.com/kyverno/policies/tree/main/karpenter
Hi @tzneal , thanks for the tip. I have created a pull request: https://github.com/kyverno/policies/pull/631.
If it gets merged I will create another one called "set-karpenter-non-cpu-limits". It relates to a best practice when using consolidation mode.
I have a third one that sets the annotation kubernetes.io/arch: arm64 if there is no such annotation. This way arm64 becomes the default. You would then only have to set the annotation for images that are not multi-architecture. This is not really a best practice. Would it fit in the same folder in the Kyverno policies project?
I have been using the following Kyverno policy to make sure
DaemonSets have the right priority class:apiVersion: kyverno.io/v1 kind: ClusterPolicy metadata: name: add-priority-class annotations: policies.kyverno.io/title: Add priority class for DaemonSets to help Karpenter. policies.kyverno.io/subject: Pod policies.kyverno.io/minversion: 1.6.0 policies.kyverno.io/description: Add priority class for DaemonSets to help Karpenter. spec: rules: - name: add-priority-class-context match: any: - resources: kinds: - DaemonSet mutate: patchStrategicMerge: spec: template: spec: priorityClassName: system-node-critical
this is a NOT a solution. It will not solve 99% of cases. It simply prioritizes daemonsets over non system critical items. But what if a node ordered by Karpenter cant even fit all system critical pods.... then issue not solved at all. So again , the priority class is "cute". But it is NOT a solution.
Karpenter team, I at least by all means consider your product incomplete as is. please pay attention to this comment: https://github.com/aws/karpenter-core/issues/731 This is not a feature request. This is a bug.
Community please upvote so AWS understands that it charges money for an incomplete product. Let's not give them the idea that this ticket is somehow optional.
I have been using the following Kyverno policy to make sure
DaemonSets have the right priority class:apiVersion: kyverno.io/v1 kind: ClusterPolicy metadata: name: add-priority-class annotations: policies.kyverno.io/title: Add priority class for DaemonSets to help Karpenter. policies.kyverno.io/subject: Pod policies.kyverno.io/minversion: 1.6.0 policies.kyverno.io/description: Add priority class for DaemonSets to help Karpenter. spec: rules: - name: add-priority-class-context match: any: - resources: kinds: - DaemonSet mutate: patchStrategicMerge: spec: template: spec: priorityClassName: system-node-criticalthis is a NOT a solution. It will not solve 99% of cases. It simply prioritizes daemonsets over non system critical items. But what if a node ordered by Karpenter cant even fit all system critical pods.... then issue not solved at all. So again , the priority class is "cute". But it is NOT a solution.
Karpenter team, I at least by all means consider your product incomplete as is. please pay attention to this comment: #731 This is not a feature request. This is a bug.
Community please upvote so AWS understands that it charges money for an incomplete product. Let's not give them the idea that this ticket is somehow optional.
couldn't've said it better myself. we're facing the same issue too.
can't believe this bug is open for nearly a year :(
The Kubernetes project currently lacks enough contributors to adequately respond to all issues.
This bot triages un-triaged issues according to the following rules:
- After 90d of inactivity,
lifecycle/staleis applied - After 30d of inactivity since
lifecycle/stalewas applied,lifecycle/rottenis applied - After 30d of inactivity since
lifecycle/rottenwas applied, the issue is closed
You can:
- Mark this issue as fresh with
/remove-lifecycle stale - Close this issue with
/close - Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale
/remove-lifecycle stale
can't believe this bug is open for nearly a year :(
As this is an open source project: code contributions are welcome. If nobody writes the code, it doesn't get merged.
The Kubernetes project currently lacks enough contributors to adequately respond to all issues.
This bot triages un-triaged issues according to the following rules:
- After 90d of inactivity,
lifecycle/staleis applied - After 30d of inactivity since
lifecycle/stalewas applied,lifecycle/rottenis applied - After 30d of inactivity since
lifecycle/rottenwas applied, the issue is closed
You can:
- Mark this issue as fresh with
/remove-lifecycle stale - Close this issue with
/close - Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale
/remove-lifecycle stale
/remove-lifecycle stale
any update?
Requesting AWS to release formal update in which Karpenter trully lives into "Just-in-time Nodes for Any Kubernetes Cluster" even for this case.
Depending on a plethora of workarounds that doesn't cover all cases tends to question the operational, production readiness of what's now considered a stable product, put in the market as proclaimed new default.
We are now many versions on and this is still an issue. Any updates?
The Kubernetes project currently lacks enough contributors to adequately respond to all issues.
This bot triages un-triaged issues according to the following rules:
- After 90d of inactivity,
lifecycle/staleis applied - After 30d of inactivity since
lifecycle/stalewas applied,lifecycle/rottenis applied - After 30d of inactivity since
lifecycle/rottenwas applied, the issue is closed
You can:
- Mark this issue as fresh with
/remove-lifecycle stale - Close this issue with
/close - Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale
/remove-lifecycle stale
One way this could be solved (or at least mitigated) is by adding a nodepool field for the minimum number of pods per node, based on the instance class.
For example, c7a.medium nodes can only have 8 pods per node, which after the standard aws-node, ebs-csi-node, and kube-proxy daemonsets only leaves 5 pods available. Add some monitoring daemonsets, such as promtail, loki-canary, and the prometheus node exporter, and we're down to 2 pods which is a pretty useless node.
If we could say "only make node types that can hold at least 12 pods" then we could plan in advance for daemonsets that we know will get added later
Edit: This apparently used to exist, but was removed. https://github.com/aws/karpenter-provider-aws/issues/5055
Edit: This apparently used to exist, but was removed. aws/karpenter-provider-aws#5055
If you read https://github.com/aws/karpenter-provider-aws/issues/5055#issuecomment-1806120226 you'll see why that metric is not particularly useful.
If you read aws/karpenter-provider-aws#5055 (comment) you'll see why that metric is not particularly useful.
I disagree with the argument against that metric. I would rather say "I want at least 12 pods to be able to run on a node" than "I guess any AWS instance type that has 2 cores will probably be able to have more than 8 pods right?"
My workaround was to use instance-size NotIn ["medium"], but that feels like a hacky workaround rather than the proper way
"I want at least 12 pods to be able to run on a node"
Number of pods doesn't make sense as a metric because not all pods are created equal - some pods require more resources than others.