karpenter
karpenter copied to clipboard
Better Scheduling Default Behavior
Been musing on the idea of building a scheduling template that enables users to set default scheduling rules for pods. We've seen this asked at the provisioner scope, but it doesn't make much sense because many provisioners are capable of scheduling the same pod. Alternatively, users may use a policy agent like https://kyverno.io/policies/karpenter/add-karpenter-nodeselector/add-karpenter-nodeselector/, but this requires another dependency.
A SchedulingTemplate would configure DefaultingWebhookConfigurations for kind Pod
, and then the karpenter controller would reconcile these, handle admission requests, and inject the fields
A couple of open design questions:
- Namespaced vs global
- Fail open or closed
kind: SchedulingTemplate
spec:
selector:
my: app # could be global, or namespacable?
template: # this is a subset of pod template spec
metadata:
annotations:
karpenter.sh/do-not-evict: true # Use case 1: any pods the match this selector cannot be evicted
spec:
topologySpreadConstraints: # Use case 2: Default to zonal spread
- maxSkew: 1
topologyKey: 'topology.kubernetes.io/zone'
whenUnsatisfiable: ScheduleAnyway
Why not simply best effort spread pods across AZs by default?
This is a deceptively nuanced question:
- What are we spreading? Pods? Nodes? Deployments? Selectors?
- Should pods from different deployments be spread from each other?
- What happens if capacity is constrained (e.g. spot), and spreading will result in 2x the cost of a node
- How would users disable this behavior? Would it be disabled per provisioner? per deployment? global to karpenter? A user running batch workloads might care more about optimizing price, and not need the HA benefits of zonal spread.
- What if one workload has node affinity, should that influence future pods to weigh into a different zone? What if that caused another deployment to be overly weighted to a specific zone, reducing its availability (we see this with ASG/default kube scheduler all the time)
In my mind, an easier way to reason about this is that Kubernetes has provided a way to specify scheduling constrains in a PodTemplateSpec. If constraints are defined, then the pod is unconstrained. There are many mechanisms to set these constraints, including policy agents, custom webhooks, and developer best practices. The remaining question is whether or not Karpenter should support a built in mechanism for this.
I think it's best for Karpenter to spread nodes across zones by default. kube-scheduler's defaults appear to do the same thing for workloads based on replicaset membership.
https://kubernetes.io/docs/reference/config-api/kube-scheduler-config.v1beta3/#kubescheduler-config-k8s-io-v1beta3-PodTopologySpreadArgs
defaultConstraints | DefaultConstraints defines topology spread constraints to be applied to Pods that don't define any in pod.spec.topologySpreadConstraints. .defaultConstraints[*].labelSelectors must be empty, as they are deduced from the Pod's membership to Services, ReplicationControllers, ReplicaSets or StatefulSets. When not empty, .defaultingType must be "List".
defaultingType | DefaultingType determines how .defaultConstraints are deduced. Can be one of "System" or "List".
"System": Use kubernetes defined constraints that spread Pods among Nodes and Zones.
"List": Use constraints defined in .defaultConstraints.
Defaults to "System".
Looks like pods owned by replicasets
, replicationcontrollers
, and statefulsets
all get zonal/hostname topology spread injected by default. This seems reasonable and helps w/ the batch computing case as well.
How does this impact consolidation @tzneal ?
I'm still not sure about doing this by default the way that kube-scheduler does. For kube-scheduler, it makes sense as spreading pods doesn't cost any more. For us, we will launch additional nodes.
E.g. assume you have a provisioner limited to 2xlarge nodes. If you launch three pods, we launch three nodes even though they would all fit on the same node.
A slight variation that I think makes more sense is if we need to launch multiple nodes we can put them in different zones.
A slight variation that I think makes more sense is if we need to launch multiple nodes we can put them in different zones.
Harder to do now w/o capacity-optimized-prioritized, but would be nice if we could somehow communicate to EC2 to weigh towards a particular zone if it has no impact on cost.
Why not make the decision local to the providerRef
?
AWSNodeTemplate defines a playing field into which the provisioners create instances for the Pods. As they (provisioners) do so, they can apply heuristics to these nodes:
providerRef:
name: default
topologyPolicy:
...
Not sure I fully comprehend the notion that the application (i.e. Pod spec) should decide this. Topology spread merely defines how to distribute the Pods across an existing topology, it doesn't define how the topology is created. This answers some of the questions @ellistarn raised above -- specifically:
- we are spreading both: nodes across subnets (which are defined by the node template) and Pods across those nodes
- as long as Pods from different deployments are targeted by the same provisioner they will benefit from the spread of nodes across subnets, but not from Pod spread unless explicitly defined by the Pod spec via topology spread
- capacity may be constrained and this may be a case, but in most instances AWS recommends expanding the amount spot pools, which includes AZ spread, you can decide which is more important -- getting, sometimes, a pricier instance, but having higher availability
Now, Karpenter does treat these Pod-level requirements as something to create the nodes by, which is fine, just that you get additional control over how the topology is created.
It is there even today. If I were to provide a single subnet, all Pods would be there (excluding hard NoSchedule ones), regardless of their topology spread.
Labeled for closure due to inactivity in 10 days.
Labeled for closure due to inactivity in 10 days.
The Kubernetes project currently lacks enough contributors to adequately respond to all issues.
This bot triages un-triaged issues according to the following rules:
- After 90d of inactivity,
lifecycle/stale
is applied - After 30d of inactivity since
lifecycle/stale
was applied,lifecycle/rotten
is applied - After 30d of inactivity since
lifecycle/rotten
was applied, the issue is closed
You can:
- Mark this issue as fresh with
/remove-lifecycle stale
- Close this issue with
/close
- Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.
This bot triages un-triaged issues according to the following rules:
- After 90d of inactivity,
lifecycle/stale
is applied - After 30d of inactivity since
lifecycle/stale
was applied,lifecycle/rotten
is applied - After 30d of inactivity since
lifecycle/rotten
was applied, the issue is closed
You can:
- Mark this issue as fresh with
/remove-lifecycle rotten
- Close this issue with
/close
- Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle rotten
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.
This bot triages issues according to the following rules:
- After 90d of inactivity,
lifecycle/stale
is applied - After 30d of inactivity since
lifecycle/stale
was applied,lifecycle/rotten
is applied - After 30d of inactivity since
lifecycle/rotten
was applied, the issue is closed
You can:
- Reopen this issue with
/reopen
- Mark this issue as fresh with
/remove-lifecycle rotten
- Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/close not-planned
@k8s-triage-robot: Closing this issue, marking it as "Not Planned".
In response to this:
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.
This bot triages issues according to the following rules:
- After 90d of inactivity,
lifecycle/stale
is applied- After 30d of inactivity since
lifecycle/stale
was applied,lifecycle/rotten
is applied- After 30d of inactivity since
lifecycle/rotten
was applied, the issue is closedYou can:
- Reopen this issue with
/reopen
- Mark this issue as fresh with
/remove-lifecycle rotten
- Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/close not-planned
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.
/reopen
@jonathan-innis: Reopened this issue.
In response to this:
/reopen
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.
/remove-lifecycle rotten
/triage accepted