karpenter
karpenter copied to clipboard
Mega Issue: Karpenter doesnt support custom resources requests/limit
Version
Karpenter: v0.10.1
Kubernetes: v1.20.15
Expected Behavior
Karpenter should be able to trigger an autoscale
Actual Behavior
Karpenter isnt able to trigger an autoscale
Steps to Reproduce the Problem
We're using Karpenter on EKS. We have pods that has custom resource requests/limits in their spec definition - smarter-devices/fuse: 1
. Karpenter seems to not respecting this resource and fails to autoscale and the pod remains to be in pending state
Resource Specs and Logs
Provisioner spec
apiVersion: karpenter.sh/v1alpha5
kind: Provisioner
metadata:
name: default
spec:
limits:
resources:
cpu: "100"
provider:
launchTemplate: xxxxx
subnetSelector:
xxxxx: xxxxx
requirements:
- key: karpenter.sh/capacity-type
operator: In
values:
- on-demand
- key: node.kubernetes.io/instance-type
operator: In
values:
- m5.large
- m5.2xlarge
- m5.4xlarge
- m5.8xlarge
- m5.12xlarge
- key: kubernetes.io/arch
operator: In
values:
- amd64
ttlSecondsAfterEmpty: 30
status:
resources:
cpu: "32"
memory: 128830948Ki
pod spec
apiVersion: apps/v1
kind: Deployment
metadata:
name: fuse-test
labels:
app: fuse-test
spec:
replicas: 1
selector:
matchLabels:
name: fuse-test
template:
metadata:
labels:
name: fuse-test
spec:
containers:
- name: fuse-test
image: ubuntu:latest
ports:
- containerPort: 8080
name: web
protocol: TCP
securityContext:
capabilities:
add:
- SYS_ADMIN
resources:
limits:
cpu: 32
memory: 4Gi
smarter-devices/fuse: 1 # Custom resource
requests:
cpu: 32
memory: 2Gi
smarter-devices/fuse: 1 # Custom resource
env:
- name: S3_BUCKET
value: test-s3
- name: S3_REGION
value: eu-west-1
karpenter controller logs:
controller 2022-06-06T15:59:00.499Z ERROR controller no instance type satisfied resources {"cpu":"32","memory":"2Gi","pods":"1","smarter-devices/fuse":"1"} and requirements kubernetes.io/os In [linux], karpenter.sh/capacity-type In [on-demand], kubernetes.io/hostname In [hostname-placeholder-3403], node.kubernetes.io/instance-type In [m5.12xlarge m5.2xlarge m5.4xlarge m5.8xlarge m5.large], karpenter.sh/provisioner-name In [default], topology.kubernetes.io/zone In [eu-west-1a eu-west-1b], kubernetes.io/arch In [amd64];
Looks like you're running purely into the CPU resources here. I added the feature label as it looks like you're requesting to be able to add custom resources into the ProvisionerSpec.Limits?
@njtran , this is the bit:
smarter-devices/fuse: 1 # Custom resource
As discussed on slack:
@Todd Neal and I were recently discussing a mechanism to allow users to define extended resources that karpenter isn't aware of. Right now, we are aware of the extended resources on specific EC2 instance types, which is how we binpack them. One option would be to enable users to define a configmap of [{instancetype, provisioner, extendedresource}] that karpenter could use for binpacking.
Thanks @ellistarn - the proposed solution looks good. Sorry for asking, but any ETA on this? as we're unable to use Karpenter because of this.
I'm having the same issue with vGPU.
@ellistarn Hope you are doing well ! I encountered the same issue while working on karpenter, So wanted to know it's been implemented via. any existing PR ?
This isn't currently being worked on -- we're prioritizing consolidation and test/release infrastructure at the moment. If you're interested in picking up this work, check out https://karpenter.sh/v0.13.1/contributing/
For us this is a blocking issue with Karpenter. Our use case is fuse
and snd
devices that are created as custom device resources from smarter device manager
As a simpler workaround @ellistarn @tzneal why not just ignore resources that Karpenter is unaware of? Instead of having to create a configMap as a whitelist, Karpenter could just filter down well-known resources and act upon those, but ignore other resource is has no idea of. It can't do anything good about those anyway...
Taking this error message:
Failed to provision new node, incompatible with provisioner "default", no instance type satisfied resources {....smarter-devices/fuse":"2"} ...
it looks like Karpenter has all information available of "manageable" resources and those that are not?
I'm having the same issue with hugepages
https://gitlab.com/gitlab-org/gitlab-runner/-/merge_requests/3717
We also need this, for nitro enclaves.
we also need this when using "fuse" device plugin resoruce, here is what we met and currently working around this issue. #308
If Karpenter were able to support https://kubernetes.io/docs/concepts/scheduling-eviction/dynamic-resource-allocation/ in its compute scheduling, would that satisfy the different devices listed on this thread?
Note: This is only an alpha feature in 1.27 so still early days - but it does look like the "correct" avenue from a Kubernetes perspective
If Karpenter were able to support https://kubernetes.io/docs/concepts/scheduling-eviction/dynamic-resource-allocation/ in its compute scheduling, would that satisfy the different devices listed on this thread?
Note: This is only an alpha feature in 1.27 so still early days - but it does look like the "correct" avenue from a Kubernetes perspective
I think so? In that with some effort all custom resources could be re-written as dynamic resource allocations.
This is probably a good fit for nitro enclaves; but probably a bad fit for e.g. hugepages.
Likely Karpenter will need to gain support for both.
We tried enabling the hugepages on all nodes with sysctl "vm.nr_hugepages" = "2048"
and transparent_hugepage = ["always"]
.
After this karpenter went crazy spinning up 50 new worker nodes for one of the existing pods. That pod does not have anything related to hugepages, just a RAM requests of 3Gb. Karpenter spins up a new node with 8Gb of RAM, then scheduler is not able to run the pod on the new node (because part of RAM is reserved for the hugepages). After that karpenter spins up the new node (again with 8Gb of RAM), and once again scheduler can't run the pod.
It looks like hugepages Linux option messes up the karpenter ability to calculate the memory resources properly.
Even without having any custom resources requests/limits set, only the existence of some custom resources might be enough to introduce problems with karpenter behavior.
We're facing the same issue with KubeVirt. Given that it's been ongoing for a while, it might be good to consider both a short-term solution to unblock and a long-term solution?
I noticed PR https://github.com/kubernetes-sigs/karpenter/pull/603 mentioning the deprecated Karpenter config map and a Slack conversation started here. As an alternative, I created a fork using the same approach but sourcing configuration from options (arg or environment variable). Would this be an interesting direction to explore? Or is the current state of this issue more "not a priority, maintain your forks until we have a better design / long-term approach for it"?
Bringing in some context on huge pages which I think is more problematic that just "defining custom allocable". Huge pages are essentially a user configurable based on a mix of instance type + user's need. That means that you could have different huge page allocable even within the same instance type based on what the node is used for. To add to this problem, hugepages are pre-allocated at boot time set at the linux level so it at best can be set in NodeClass level and must be passed through via startup script of the node BUT because nodeclasses can be used by different instance types nodeclass itself cannot be relied upon to know ahead of time how much hugepage resource can be provided.
What does this all mean? Implementation would be difficult because if we want Karpenter to work Karpenter would need to know ahead of time a mapping of all permutation of instance type + possible hugepages. This means user must input a mapping in node pool for instance types to huge pages resource in addition to instance type to huge page in nodeclass to specify how instances come up.
I am likely missing a few pieces of this puzzle but this what I think needs to be solved for hugepages
I am likely missing a few pieces of this puzzle but this what I think needs to be solved for hugepages
I think there are a couple simplifications that we could do here to support hugepages if we wanted to:
- Consider the entire available memory to be used for hugepages. Add up all of the hugepages into the resource requests for the NodeClaim and then launch an instance, configuring the startup script to start with that many hugepages so that all of the pods can schedule
- Allow users to configure a percentage of the memory to be allocated to hugepages. We would calculate the hugepages during the GetInstanceTypes() call and then use that for scheduling. If we allowed this, this would most likely be a setting on the NodeClass and then we would just pass it down through the GetInstanceTypes() call from the CloudProvider.
Consider the entire available memory to be used for hugepages. Add up all of the hugepages into the resource requests for the NodeClaim and then launch an instance, configuring the startup script to start with that many hugepages so that all of the pods can schedule
One issue here is huge pages are carved out of memory, for us this doesn't matter because we actually do want to move to all huge pages but most users likely have a mix of huge pages and normal memory. If you advertise all huge pages then your nodes technically doesn't have memory
Allow users to configure a percentage of the memory to be allocated to hugepages. We would calculate the hugepages during the GetInstanceTypes() call and then use that for scheduling. If we allowed this, this would most likely be a setting on the NodeClass and then we would just pass it down through the GetInstanceTypes() call from the CloudProvider.
That feels reasonable and removes the need to map instance types to huge pages. Once again that works for us but I am unsure if other users have more unique configuration
all huge pages but most users likely have a mix of huge pages and normal memory
Sure, when you are calculating the total of all of your huge-pages, you would just have to also subtract that away from the memory requests because you intuitively know that one takes away from the other.
Once again that works for us but I am unsure if other users have more unique configuration
Yeah, it's a little tough to boil the ocean here without creating wayyyyy too much configuration and making this likely unreasonable to manage for users.
Yeah, it's a little tough to boil the ocean here without creating wayyyyy too much configuration and making this likely unreasonable to manage for users.
Agreed. I think this general approach should work, need some time to bake in my head if we would encounter any issues.
We also need this feature. Our use-case is related to a controller which creates extended resources to nodes immediately when a new node is created. Karpenter will not create a node for pods using such extended resources, because it doesn't understand the extended resources.
In our case, using node affinity and node selectors together with existing node labels is sufficient to direct Karpenter to pick a good node. The only thing we need is Karpenter to ignore a list of extended resources, when finding the correct instance type. Having said that, I do have a forked workaround, but forked workarounds are not acceptable where I work, for good reason.
Having ignorable extended resources wouldn't be new in Kubernetes. They exist also in the scheduler.
How much appetite is there to simply have an override config map that has per instance type override on resources capacity just for Karpenter simulation (Support for huge pages and possibly other extend resources!?!)? https://github.com/aws/karpenter-provider-aws/blob/main/pkg/providers/instancetype/types.go#L179
Config map that as instance types + any resource override and if a particular resource isn't override, take what is provided from cloud provider.
Pin the configmap per NodeClass via a new setting on NodeClass instanceTypeResourceOverride
. Note that changes to the configmap won't be reflected on current nodes, we would use drift to reconcile the changes.
This push the onus onto the users to ensure that their overrides are correct. We won't provide any sophisticated pattern matching and users can build their own generator for making this map.
apiVersion: v1
kind: ConfigMap
metadata:
name: karpenter-instance-type-resource-override-config
namespace: karpenter
data:
nodepoolexample.overrides: |
{
m5.xlarge: {
memory: 4 GiB,
hugepages-2Mi: 10GiB,
}
}
Hopefully users wouldn't need to maintain their own list of acceptable instance types in order to handle the "fuse" use case, as fuse doesn't depend on particular instance types.
It's a bit frustrating that the fuse use case is being held up by hugepages. The fuse use case is probably common enough to justify being handled out of the box.
I think fuse's use case is not the same as hugepages and shouldn't be tied together. Fuse likely can do https://kubernetes.io/docs/concepts/scheduling-eviction/dynamic-resource-allocation/ ?
I think fuse's use case is not the same as hugepages and shouldn't be tied together. Fuse likely can do https://kubernetes.io/docs/concepts/scheduling-eviction/dynamic-resource-allocation/ ?
In its current form DRA does not work with cluster autoscalers. Some future versions of DRA might work with cluster autoscalers, but such a version isn't available yet.
The current DRA relies on a node-level entity, namely the resource driver kubelet plugin daemonset, which will not deploy before the node is created. Since cluster autoscalers don't know anything about DRA, they will not create a node for a pending pod that requires DRA resource claims. DRA users are in the same limbo as are the extended resource users. The cluster autoscaler can't know whether the new resources will pop up in the node as a result of some controller or daemonset. Maybe they will, maybe they won't.
I'm all for giving the users the possibility to configure the resources for Karpenter in a form of a configmap or CRD or similar. A nice bonus would be if one could also define extended resources which are applied to all instance types, covering in a simple fashion the fuse-case.
A nice bonus would be if one could also define extended resources which are applied to all instance types, covering in a simple fashion the fuse-case.
That feels fine also. Let me try to bring this up during working group meeting
Curious if this could be a configuration on the NodePool; we're able to add custom Requirements to allow Karpenter to schedule when hard affinities or tolerances are defined. Would having an entry to define capacity hints to karpenter 'this node pool will satisfy requests/limits for [custom capacity]' be an option?
My use case is smarter-devices/kvm - which can be filtered on a nodepool as metal. I could imagine the same for hugePages or similar - we know what the instances which has these so we can filter them using custom NodePools.
By using weighting we can define these after the main nodepools - so in my example, I would have Spot for all weight 100, on-demand for all weight 90, and then our KVM with capacity hints of 80.
In the mean time, I'm using an Overprovisioner marked with hard affinity for metal instances to ensure these pods can be scheduled; it's a tradeoff with extra cost but the ability to use Karpenter exclusively.
I wonder if this is something that might be useful to configure both at the node pool level, and at the instance type level. Ultimately, we were learning away from an InstanceTypeOverride CRD due to the level of effort to configure it, but perhaps with support for both, it provides an escape hatch as well as the ability to define a simple blanket policy.
We could choose any/all of the following:
- Cloudprovider automatically knows extended resource values (e.g. GPU)
- NodePool (or class) lets you specify a flat resource value per node pool
- NodePool (or class) lets you specify a scalar resource values (e.g. hugePageMemoryPercent)
- InstanceType CRD (or config map) let's you define per instance type resource overrides.
/cc