karpenter icon indicating copy to clipboard operation
karpenter copied to clipboard

Capacity Type Distribution

Open bwagner5 opened this issue 4 years ago • 23 comments
trafficstars

Some application deployments would like the benefits of using Spot capacity, but would also like somewhat of a stability guarantee for the application. I propose a capacity-type percentage distribution of the k8s Deployment resource. Since Capacity-Type is likely to be implemented at the cloud-provider level, this too would need to be at the cloud-provider layer.

For example:

apiVersion: apps/v1
kind: Deployment
metadata: 
  name: inflate
spec:
  replicas: 10
  template:
    metadata:
      labels:
        app: inflate
        node.k8s.aws/capacity-type-distribution/spot-percentage: 90
    spec:
      containers:
      - image: public.ecr.aws/eks-distro/kubernetes/pause:3.2
        name: inflate
        resources:
          requests:
            cpu: "100m"

The above deployment spec would result in the deployment controller creating Pods for the 10 replicas. Karpenter would register a mutating admission webhook which would check if the pod's deployment spec has these labels and then check any current pods belonging to the deployment to determine which capacity-type label to apply. The pod resource after the admission webhook would look like this:

apiVersion: v1
kind: Pod
metadata:
  labels:
    app: inflate
    pod-template-hash: 8567cd588
  name: inflate-8567cd588-bjqzf
  namespace: default
  ownerReferences:
  - apiVersion: apps/v1
    kind: ReplicaSet
    name: inflate-8567cd588
spec:
  containers:
  - image: public.ecr.aws/eks-distro/kubernetes/pause:3.2
    name: inflate
    resources:
      requests:
        cpu: "100m"
  schedulerName: default-scheduler
  nodeSelector:
      node.k8s.aws/capacity-type: spot

^^ duplicated 8 more times, and then:

apiVersion: v1
kind: Pod
metadata:
  labels:
    app: inflate
    pod-template-hash: 4567dc765
  name: inflate-4567dc765-asdf
  namespace: default
  ownerReferences:
  - apiVersion: apps/v1
    kind: ReplicaSet
    name: inflate-4567dc765
spec:
  containers:
  - image: public.ecr.aws/eks-distro/kubernetes/pause:3.2
    name: inflate
    resources:
      requests:
        cpu: "100m"
  schedulerName: default-scheduler
  nodeSelector:
      node.k8s.aws/capacity-type: on-demand

bwagner5 avatar Apr 01 '21 17:04 bwagner5

My first thought is that this webhook should be decoupled from Karpenter's core controller. Maybe something plugged into aws cloud provider once we break it apart?

ellistarn avatar Apr 01 '21 18:04 ellistarn

My first thought is that this webhook should be decoupled from Karpenter's core controller. Maybe something plugged into aws cloud provider once we break it apart?

Yeah, I think that makes the most sense.

bwagner5 avatar Apr 01 '21 18:04 bwagner5

We have one application that has been testing out using ASGs with mixed purchasing options(on-demand and spot). This one application is using this: https://docs.aws.amazon.com/autoscaling/ec2/userguide/ec2-auto-scaling-mixed-instances-groups.html

In the future it would be great to find a mechanism to migrate this ASG to karpenter.

vinayan3 avatar Dec 16 '21 03:12 vinayan3

Two (maybe random) questions.

Should node.k8s.aws/capacity-type-distribution/spot-percentage be an annotation instead of a label?

Now that we'd have 2 separate deployments how would that work with HPA? How would it know which deployment to scale and keep the total application deployment balanced?

rothgar avatar Jan 26 '22 17:01 rothgar

+1

aeciopires avatar Jan 31 '22 18:01 aeciopires

We may also think how to leverage the current topology constraints instead of new annotations

rverma-dev avatar Feb 02 '22 03:02 rverma-dev

We've discussed expanding the topologyspreadconstraints concept to include percent based spread. I think this is a perfect fit.

ellistarn avatar Feb 02 '22 04:02 ellistarn

+1

Rokeguilherme avatar Feb 04 '22 18:02 Rokeguilherme

Any updates please

rverma-dev avatar Mar 06 '22 07:03 rverma-dev

+1

himanshurajput32 avatar Jun 13 '22 09:06 himanshurajput32

Hey folks, just a reminder to 👍 the original issue, rather than +1 in the comments, since it's easier for us to sort issues by most upvoted.

ellistarn avatar Jun 13 '22 22:06 ellistarn

Any update please.

himanshurajput32 avatar Jun 20 '22 07:06 himanshurajput32

I've documented another method for achieving something similar at https://karpenter.sh/preview/tasks/scheduling/#on-demandspot-ratio-split that may work for some.

tzneal avatar Aug 05 '22 18:08 tzneal

👍

yum-dev avatar Aug 09 '22 13:08 yum-dev

+1

andredeo avatar Sep 28 '22 13:09 andredeo

👍

leeloo87 avatar Oct 25 '22 08:10 leeloo87

@tzneal This is the correct link for on-demand x spot ratio split https://karpenter.sh/preview/concepts/scheduling/#on-demandspot-ratio-split

gucarreira avatar Mar 05 '23 18:03 gucarreira

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle stale
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot avatar Jan 31 '24 19:01 k8s-triage-robot

/remove-lifecycle stale

James-Quigley avatar Jan 31 '24 21:01 James-Quigley

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle stale
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot avatar Apr 30 '24 22:04 k8s-triage-robot

/remove-lifecycle stale

James-Quigley avatar May 06 '24 13:05 James-Quigley

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle stale
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot avatar Aug 04 '24 14:08 k8s-triage-robot

/remove-lifecycle stale

sidewinder12s avatar Aug 05 '24 17:08 sidewinder12s

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle rotten
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

k8s-triage-robot avatar Sep 04 '24 17:09 k8s-triage-robot

/remove-lifecycle rotten

sidewinder12s avatar Sep 04 '24 20:09 sidewinder12s