karpenter Better Scheduling Default Behavior

Been musing on the idea of building a scheduling template that enables users to set default scheduling rules for pods. We've seen this asked at the provisioner scope, but it doesn't make much sense because many provisioners are capable of scheduling the same pod. Alternatively, users may use a policy agent like https://kyverno.io/policies/karpenter/add-karpenter-nodeselector/add-karpenter-nodeselector/, but this requires another dependency.

A SchedulingTemplate would configure DefaultingWebhookConfigurations for kind Pod, and then the karpenter controller would reconcile these, handle admission requests, and inject the fields

A couple of open design questions:

Namespaced vs global
Fail open or closed

kind: SchedulingTemplate
spec:
  selector:
     my: app # could be global, or namespacable?
  template: # this is a subset of pod template spec
    metadata: 
      annotations: 
        karpenter.sh/do-not-evict: true # Use case 1: any pods the match this selector cannot be evicted
    spec: 
      topologySpreadConstraints: # Use case 2: Default to zonal spread
        - maxSkew: 1
          topologyKey: 'topology.kubernetes.io/zone'
          whenUnsatisfiable: ScheduleAnyway

Nov 17 '22 16:11 ellistarn

Why not simply best effort spread pods across AZs by default?

This is a deceptively nuanced question:

What are we spreading? Pods? Nodes? Deployments? Selectors?
Should pods from different deployments be spread from each other?
What happens if capacity is constrained (e.g. spot), and spreading will result in 2x the cost of a node
How would users disable this behavior? Would it be disabled per provisioner? per deployment? global to karpenter? A user running batch workloads might care more about optimizing price, and not need the HA benefits of zonal spread.
What if one workload has node affinity, should that influence future pods to weigh into a different zone? What if that caused another deployment to be overly weighted to a specific zone, reducing its availability (we see this with ASG/default kube scheduler all the time)

In my mind, an easier way to reason about this is that Kubernetes has provided a way to specify scheduling constrains in a PodTemplateSpec. If constraints are defined, then the pod is unconstrained. There are many mechanisms to set these constraints, including policy agents, custom webhooks, and developer best practices. The remaining question is whether or not Karpenter should support a built in mechanism for this.

Nov 17 '22 16:11 ellistarn

I think it's best for Karpenter to spread nodes across zones by default. kube-scheduler's defaults appear to do the same thing for workloads based on replicaset membership.

https://kubernetes.io/docs/reference/config-api/kube-scheduler-config.v1beta3/#kubescheduler-config-k8s-io-v1beta3-PodTopologySpreadArgs

defaultConstraints | DefaultConstraints defines topology spread constraints to be applied to Pods that don't define any in pod.spec.topologySpreadConstraints. .defaultConstraints[*].labelSelectors must be empty, as they are deduced from the Pod's membership to Services, ReplicationControllers, ReplicaSets or StatefulSets. When not empty, .defaultingType must be "List".

defaultingType | DefaultingType determines how .defaultConstraints are deduced. Can be one of "System" or "List".

"System": Use kubernetes defined constraints that spread Pods among Nodes and Zones.
"List": Use constraints defined in .defaultConstraints.
Defaults to "System".

Nov 17 '22 16:11 bwagner5

Looks like pods owned by replicasets, replicationcontrollers, and statefulsets all get zonal/hostname topology spread injected by default. This seems reasonable and helps w/ the batch computing case as well.

Nov 17 '22 17:11 ellistarn

How does this impact consolidation @tzneal ?

Nov 17 '22 17:11 ellistarn

I'm still not sure about doing this by default the way that kube-scheduler does. For kube-scheduler, it makes sense as spreading pods doesn't cost any more. For us, we will launch additional nodes.

E.g. assume you have a provisioner limited to 2xlarge nodes. If you launch three pods, we launch three nodes even though they would all fit on the same node.

A slight variation that I think makes more sense is if we need to launch multiple nodes we can put them in different zones.

Nov 17 '22 19:11 tzneal

A slight variation that I think makes more sense is if we need to launch multiple nodes we can put them in different zones.

Harder to do now w/o capacity-optimized-prioritized, but would be nice if we could somehow communicate to EC2 to weigh towards a particular zone if it has no impact on cost.

Nov 17 '22 21:11 ellistarn

Why not make the decision local to the providerRef?

AWSNodeTemplate defines a playing field into which the provisioners create instances for the Pods. As they (provisioners) do so, they can apply heuristics to these nodes:

  providerRef:
    name: default
    topologyPolicy:
    ...

Not sure I fully comprehend the notion that the application (i.e. Pod spec) should decide this. Topology spread merely defines how to distribute the Pods across an existing topology, it doesn't define how the topology is created. This answers some of the questions @ellistarn raised above -- specifically:

we are spreading both: nodes across subnets (which are defined by the node template) and Pods across those nodes
as long as Pods from different deployments are targeted by the same provisioner they will benefit from the spread of nodes across subnets, but not from Pod spread unless explicitly defined by the Pod spec via topology spread
capacity may be constrained and this may be a case, but in most instances AWS recommends expanding the amount spot pools, which includes AZ spread, you can decide which is more important -- getting, sometimes, a pricier instance, but having higher availability

Now, Karpenter does treat these Pod-level requirements as something to create the nodes by, which is fine, just that you get additional control over how the topology is created.

It is there even today. If I were to provide a single subnet, all Pods would be there (excluding hard NoSchedule ones), regardless of their topology spread.

Nov 18 '22 11:11 dnutels

Labeled for closure due to inactivity in 10 days.

Apr 05 '23 12:04 github-actions[bot]

Labeled for closure due to inactivity in 10 days.

May 08 '23 12:05 github-actions[bot]

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

Mar 25 '24 05:03 k8s-triage-robot

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle rotten
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

Apr 24 '24 06:04 k8s-triage-robot

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue with /reopen
Mark this issue as fresh with /remove-lifecycle rotten
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

May 24 '24 07:05 k8s-triage-robot

@k8s-triage-robot: Closing this issue, marking it as "Not Planned".

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied

After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied

After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue with /reopen

Mark this issue as fresh with /remove-lifecycle rotten

Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

May 24 '24 07:05 k8s-ci-robot

/reopen

May 24 '24 15:05 jonathan-innis

@jonathan-innis: Reopened this issue.

In response to this:

/reopen

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

May 24 '24 15:05 k8s-ci-robot

/remove-lifecycle rotten

May 24 '24 15:05 jonathan-innis

/triage accepted

May 24 '24 15:05 jonathan-innis

karpenter karpenter copied to clipboard

Better Scheduling Default Behavior

karpenter
karpenter copied to clipboard