karpenter icon indicating copy to clipboard operation
karpenter copied to clipboard

Mega Issue: Karpenter doesnt support custom resources requests/limit

Open prateekkhera opened this issue 2 years ago • 38 comments

Version

Karpenter: v0.10.1

Kubernetes: v1.20.15

Expected Behavior

Karpenter should be able to trigger an autoscale

Actual Behavior

Karpenter isnt able to trigger an autoscale

Steps to Reproduce the Problem

We're using Karpenter on EKS. We have pods that has custom resource requests/limits in their spec definition - smarter-devices/fuse: 1. Karpenter seems to not respecting this resource and fails to autoscale and the pod remains to be in pending state

Resource Specs and Logs

Provisioner spec

apiVersion: karpenter.sh/v1alpha5
kind: Provisioner
metadata:
  name: default
spec:
  limits:
    resources:
      cpu: "100"
  provider:
    launchTemplate: xxxxx
    subnetSelector:
      xxxxx: xxxxx
  requirements:
  - key: karpenter.sh/capacity-type
    operator: In
    values:
    - on-demand
  - key: node.kubernetes.io/instance-type
    operator: In
    values:
    - m5.large
    - m5.2xlarge
    - m5.4xlarge
    - m5.8xlarge
    - m5.12xlarge
  - key: kubernetes.io/arch
    operator: In
    values:
    - amd64
  ttlSecondsAfterEmpty: 30
status:
  resources:
    cpu: "32"
    memory: 128830948Ki

pod spec

apiVersion: apps/v1
kind: Deployment
metadata:
  name: fuse-test
  labels:
    app: fuse-test
spec:
  replicas: 1
  selector:
    matchLabels:
      name: fuse-test
  template:
    metadata:
      labels:
        name: fuse-test
    spec:
      containers:
      - name: fuse-test
        image: ubuntu:latest
        ports:
          - containerPort: 8080
            name: web
            protocol: TCP
        securityContext:
          capabilities:
            add:
              - SYS_ADMIN
        resources:
          limits:
            cpu: 32
            memory: 4Gi
            smarter-devices/fuse: 1  # Custom resource
          requests:
            cpu: 32
            memory: 2Gi
            smarter-devices/fuse: 1  # Custom resource
        env:
        - name: S3_BUCKET
          value: test-s3
        - name: S3_REGION
          value: eu-west-1

karpenter controller logs:

controller 2022-06-06T15:59:00.499Z ERROR controller no instance type satisfied resources {"cpu":"32","memory":"2Gi","pods":"1","smarter-devices/fuse":"1"} and requirements kubernetes.io/os In [linux], karpenter.sh/capacity-type In [on-demand], kubernetes.io/hostname In [hostname-placeholder-3403], node.kubernetes.io/instance-type In [m5.12xlarge m5.2xlarge m5.4xlarge m5.8xlarge m5.large], karpenter.sh/provisioner-name In [default], topology.kubernetes.io/zone In [eu-west-1a eu-west-1b], kubernetes.io/arch In [amd64];

prateekkhera avatar Jun 06 '22 16:06 prateekkhera

Looks like you're running purely into the CPU resources here. I added the feature label as it looks like you're requesting to be able to add custom resources into the ProvisionerSpec.Limits?

njtran avatar Jun 06 '22 19:06 njtran

@njtran , this is the bit:

smarter-devices/fuse: 1 # Custom resource

ellistarn avatar Jun 06 '22 20:06 ellistarn

As discussed on slack:

@Todd Neal and I were recently discussing a mechanism to allow users to define extended resources that karpenter isn't aware of. Right now, we are aware of the extended resources on specific EC2 instance types, which is how we binpack them. One option would be to enable users to define a configmap of [{instancetype, provisioner, extendedresource}] that karpenter could use for binpacking.

ellistarn avatar Jun 06 '22 20:06 ellistarn

Thanks @ellistarn - the proposed solution looks good. Sorry for asking, but any ETA on this? as we're unable to use Karpenter because of this.

prateekkhera avatar Jun 07 '22 04:06 prateekkhera

I'm having the same issue with vGPU.

CodeBooster97 avatar Jun 07 '22 07:06 CodeBooster97

@ellistarn Hope you are doing well ! I encountered the same issue while working on karpenter, So wanted to know it's been implemented via. any existing PR ?

parmeet-kumar avatar Jun 29 '22 08:06 parmeet-kumar

This isn't currently being worked on -- we're prioritizing consolidation and test/release infrastructure at the moment. If you're interested in picking up this work, check out https://karpenter.sh/v0.13.1/contributing/

ellistarn avatar Jul 03 '22 00:07 ellistarn

For us this is a blocking issue with Karpenter. Our use case is fuse and snd devices that are created as custom device resources from smarter device manager

As a simpler workaround @ellistarn @tzneal why not just ignore resources that Karpenter is unaware of? Instead of having to create a configMap as a whitelist, Karpenter could just filter down well-known resources and act upon those, but ignore other resource is has no idea of. It can't do anything good about those anyway...

Taking this error message:

Failed to provision new node, incompatible with provisioner "default", no instance type satisfied resources {....smarter-devices/fuse":"2"} ...

it looks like Karpenter has all information available of "manageable" resources and those that are not?

universam1 avatar Jul 19 '22 10:07 universam1

I'm having the same issue with hugepages

ghost avatar Feb 09 '23 07:02 ghost

https://gitlab.com/gitlab-org/gitlab-runner/-/merge_requests/3717

universam1 avatar Mar 22 '23 15:03 universam1

We also need this, for nitro enclaves.

james-callahan avatar Apr 16 '23 13:04 james-callahan

we also need this when using "fuse" device plugin resoruce, here is what we met and currently working around this issue. #308

lzjqsdd avatar May 04 '23 01:05 lzjqsdd

If Karpenter were able to support https://kubernetes.io/docs/concepts/scheduling-eviction/dynamic-resource-allocation/ in its compute scheduling, would that satisfy the different devices listed on this thread?

Note: This is only an alpha feature in 1.27 so still early days - but it does look like the "correct" avenue from a Kubernetes perspective

bryantbiggs avatar May 12 '23 20:05 bryantbiggs

If Karpenter were able to support https://kubernetes.io/docs/concepts/scheduling-eviction/dynamic-resource-allocation/ in its compute scheduling, would that satisfy the different devices listed on this thread?

Note: This is only an alpha feature in 1.27 so still early days - but it does look like the "correct" avenue from a Kubernetes perspective

I think so? In that with some effort all custom resources could be re-written as dynamic resource allocations.

This is probably a good fit for nitro enclaves; but probably a bad fit for e.g. hugepages.

Likely Karpenter will need to gain support for both.

james-callahan avatar May 16 '23 03:05 james-callahan

We tried enabling the hugepages on all nodes with sysctl "vm.nr_hugepages" = "2048" and transparent_hugepage = ["always"].

After this karpenter went crazy spinning up 50 new worker nodes for one of the existing pods. That pod does not have anything related to hugepages, just a RAM requests of 3Gb. Karpenter spins up a new node with 8Gb of RAM, then scheduler is not able to run the pod on the new node (because part of RAM is reserved for the hugepages). After that karpenter spins up the new node (again with 8Gb of RAM), and once again scheduler can't run the pod.

It looks like hugepages Linux option messes up the karpenter ability to calculate the memory resources properly.

Even without having any custom resources requests/limits set, only the existence of some custom resources might be enough to introduce problems with karpenter behavior.

project-administrator avatar Nov 10 '23 07:11 project-administrator

We're facing the same issue with KubeVirt. Given that it's been ongoing for a while, it might be good to consider both a short-term solution to unblock and a long-term solution?

I noticed PR https://github.com/kubernetes-sigs/karpenter/pull/603 mentioning the deprecated Karpenter config map and a Slack conversation started here. As an alternative, I created a fork using the same approach but sourcing configuration from options (arg or environment variable). Would this be an interesting direction to explore? Or is the current state of this issue more "not a priority, maintain your forks until we have a better design / long-term approach for it"?

chomatdam avatar Dec 24 '23 23:12 chomatdam

Bringing in some context on huge pages which I think is more problematic that just "defining custom allocable". Huge pages are essentially a user configurable based on a mix of instance type + user's need. That means that you could have different huge page allocable even within the same instance type based on what the node is used for. To add to this problem, hugepages are pre-allocated at boot time set at the linux level so it at best can be set in NodeClass level and must be passed through via startup script of the node BUT because nodeclasses can be used by different instance types nodeclass itself cannot be relied upon to know ahead of time how much hugepage resource can be provided.

What does this all mean? Implementation would be difficult because if we want Karpenter to work Karpenter would need to know ahead of time a mapping of all permutation of instance type + possible hugepages. This means user must input a mapping in node pool for instance types to huge pages resource in addition to instance type to huge page in nodeclass to specify how instances come up.

I am likely missing a few pieces of this puzzle but this what I think needs to be solved for hugepages

garvinp-stripe avatar Feb 28 '24 18:02 garvinp-stripe

I am likely missing a few pieces of this puzzle but this what I think needs to be solved for hugepages

I think there are a couple simplifications that we could do here to support hugepages if we wanted to:

  1. Consider the entire available memory to be used for hugepages. Add up all of the hugepages into the resource requests for the NodeClaim and then launch an instance, configuring the startup script to start with that many hugepages so that all of the pods can schedule
  2. Allow users to configure a percentage of the memory to be allocated to hugepages. We would calculate the hugepages during the GetInstanceTypes() call and then use that for scheduling. If we allowed this, this would most likely be a setting on the NodeClass and then we would just pass it down through the GetInstanceTypes() call from the CloudProvider.

jonathan-innis avatar Feb 29 '24 00:02 jonathan-innis

Consider the entire available memory to be used for hugepages. Add up all of the hugepages into the resource requests for the NodeClaim and then launch an instance, configuring the startup script to start with that many hugepages so that all of the pods can schedule

One issue here is huge pages are carved out of memory, for us this doesn't matter because we actually do want to move to all huge pages but most users likely have a mix of huge pages and normal memory. If you advertise all huge pages then your nodes technically doesn't have memory

Allow users to configure a percentage of the memory to be allocated to hugepages. We would calculate the hugepages during the GetInstanceTypes() call and then use that for scheduling. If we allowed this, this would most likely be a setting on the NodeClass and then we would just pass it down through the GetInstanceTypes() call from the CloudProvider.

That feels reasonable and removes the need to map instance types to huge pages. Once again that works for us but I am unsure if other users have more unique configuration

garvinp-stripe avatar Feb 29 '24 00:02 garvinp-stripe

all huge pages but most users likely have a mix of huge pages and normal memory

Sure, when you are calculating the total of all of your huge-pages, you would just have to also subtract that away from the memory requests because you intuitively know that one takes away from the other.

Once again that works for us but I am unsure if other users have more unique configuration

Yeah, it's a little tough to boil the ocean here without creating wayyyyy too much configuration and making this likely unreasonable to manage for users.

jonathan-innis avatar Feb 29 '24 01:02 jonathan-innis

Yeah, it's a little tough to boil the ocean here without creating wayyyyy too much configuration and making this likely unreasonable to manage for users.

Agreed. I think this general approach should work, need some time to bake in my head if we would encounter any issues.

garvinp-stripe avatar Feb 29 '24 02:02 garvinp-stripe

We also need this feature. Our use-case is related to a controller which creates extended resources to nodes immediately when a new node is created. Karpenter will not create a node for pods using such extended resources, because it doesn't understand the extended resources.

In our case, using node affinity and node selectors together with existing node labels is sufficient to direct Karpenter to pick a good node. The only thing we need is Karpenter to ignore a list of extended resources, when finding the correct instance type. Having said that, I do have a forked workaround, but forked workarounds are not acceptable where I work, for good reason.

Having ignorable extended resources wouldn't be new in Kubernetes. They exist also in the scheduler.

uniemimu avatar Mar 04 '24 15:03 uniemimu

How much appetite is there to simply have an override config map that has per instance type override on resources capacity just for Karpenter simulation (Support for huge pages and possibly other extend resources!?!)? https://github.com/aws/karpenter-provider-aws/blob/main/pkg/providers/instancetype/types.go#L179

Config map that as instance types + any resource override and if a particular resource isn't override, take what is provided from cloud provider.

Pin the configmap per NodeClass via a new setting on NodeClass instanceTypeResourceOverride. Note that changes to the configmap won't be reflected on current nodes, we would use drift to reconcile the changes.

This push the onus onto the users to ensure that their overrides are correct. We won't provide any sophisticated pattern matching and users can build their own generator for making this map.

apiVersion: v1
kind: ConfigMap
metadata:
  name: karpenter-instance-type-resource-override-config
  namespace: karpenter
data:
  nodepoolexample.overrides: |
    {
       m5.xlarge: {
            memory: 4 GiB,
            hugepages-2Mi: 10GiB,

       }

    }


garvinp-stripe avatar Mar 13 '24 23:03 garvinp-stripe

Hopefully users wouldn't need to maintain their own list of acceptable instance types in order to handle the "fuse" use case, as fuse doesn't depend on particular instance types.

It's a bit frustrating that the fuse use case is being held up by hugepages. The fuse use case is probably common enough to justify being handled out of the box.

johngmyers avatar Mar 13 '24 23:03 johngmyers

I think fuse's use case is not the same as hugepages and shouldn't be tied together. Fuse likely can do https://kubernetes.io/docs/concepts/scheduling-eviction/dynamic-resource-allocation/ ?

GnatorX avatar Mar 14 '24 00:03 GnatorX

I think fuse's use case is not the same as hugepages and shouldn't be tied together. Fuse likely can do https://kubernetes.io/docs/concepts/scheduling-eviction/dynamic-resource-allocation/ ?

In its current form DRA does not work with cluster autoscalers. Some future versions of DRA might work with cluster autoscalers, but such a version isn't available yet.

The current DRA relies on a node-level entity, namely the resource driver kubelet plugin daemonset, which will not deploy before the node is created. Since cluster autoscalers don't know anything about DRA, they will not create a node for a pending pod that requires DRA resource claims. DRA users are in the same limbo as are the extended resource users. The cluster autoscaler can't know whether the new resources will pop up in the node as a result of some controller or daemonset. Maybe they will, maybe they won't.

I'm all for giving the users the possibility to configure the resources for Karpenter in a form of a configmap or CRD or similar. A nice bonus would be if one could also define extended resources which are applied to all instance types, covering in a simple fashion the fuse-case.

uniemimu avatar Mar 14 '24 08:03 uniemimu

A nice bonus would be if one could also define extended resources which are applied to all instance types, covering in a simple fashion the fuse-case.

That feels fine also. Let me try to bring this up during working group meeting

GnatorX avatar Mar 14 '24 17:03 GnatorX

Curious if this could be a configuration on the NodePool; we're able to add custom Requirements to allow Karpenter to schedule when hard affinities or tolerances are defined. Would having an entry to define capacity hints to karpenter 'this node pool will satisfy requests/limits for [custom capacity]' be an option?

My use case is smarter-devices/kvm - which can be filtered on a nodepool as metal. I could imagine the same for hugePages or similar - we know what the instances which has these so we can filter them using custom NodePools.

By using weighting we can define these after the main nodepools - so in my example, I would have Spot for all weight 100, on-demand for all weight 90, and then our KVM with capacity hints of 80.

In the mean time, I'm using an Overprovisioner marked with hard affinity for metal instances to ensure these pods can be scheduled; it's a tradeoff with extra cost but the ability to use Karpenter exclusively.

Bourne-ID avatar Mar 23 '24 00:03 Bourne-ID

I wonder if this is something that might be useful to configure both at the node pool level, and at the instance type level. Ultimately, we were learning away from an InstanceTypeOverride CRD due to the level of effort to configure it, but perhaps with support for both, it provides an escape hatch as well as the ability to define a simple blanket policy.

We could choose any/all of the following:

  1. Cloudprovider automatically knows extended resource values (e.g. GPU)
  2. NodePool (or class) lets you specify a flat resource value per node pool
  3. NodePool (or class) lets you specify a scalar resource values (e.g. hugePageMemoryPercent)
  4. InstanceType CRD (or config map) let's you define per instance type resource overrides.

ellistarn avatar Mar 25 '24 05:03 ellistarn

/cc

fmuyassarov avatar Mar 25 '24 14:03 fmuyassarov