karpenter
karpenter copied to clipboard
RFC: NodeOverlay
Fixes:
- https://github.com/aws/karpenter-provider-aws/issues/3860
- https://github.com/aws/karpenter-provider-aws/pull/4697
- https://github.com/kubernetes-sigs/karpenter/issues/751
- https://github.com/kubernetes-sigs/karpenter/issues/729
- https://github.com/aws/karpenter-provider-aws/issues/5161
Description
---
# Reduce on-demand prices to 90%
# https://github.com/aws/karpenter-provider-aws/issues/3860
# https://github.com/aws/karpenter-provider-aws/pull/4697
kind: NodeOverlay
metadata:
name: discount
spec:
selector:
matchLabels:
karpenter.sh/capacity-type: on-demand
pricePercent: 90
---
# Support for extended resource types (e.g. smarter-devices/fuse)
# https://github.com/kubernetes-sigs/karpenter/issues/751
# https://github.com/kubernetes-sigs/karpenter/issues/729
kind: NodeOverlay
metadata:
name: discount
spec:
selector:
matchLabels:
karpenter.sh/capacity-type: on-demand
matchExpressions:
- key: node.kubernetes.io/instance-type
operator: In
values:
- m5.large
- m5.2xlarge
- m5.4xlarge
- m5.8xlarge
- m5.12xlarge
capacity:
smarter-devices/fuse: 1
---
# Add memory overhead of 10Mi to all instances with 2Gi memory
# https://github.com/aws/karpenter-provider-aws/issues/5161
kind: NodeOverlay
metadata:
name: discount
spec:
selector:
matchLabels:
karpenter.k8s.aws/instance-memory: 2048
overhead:
memory: 10Mi
---
How was this change tested?
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
[APPROVALNOTIFIER] This PR is APPROVED
This pull-request has been approved by: ellistarn
The full list of commands accepted by this bot can be found here.
The pull request process is described here
- ~~OWNERS~~ [ellistarn]
Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment
(nit) In the PR description, the NodeOverlays all have the same .metadata.name
Is this an overlay for Nodes, or an overlay for NodePools (maybe more than one NodePool, if we support matching via a selector)?
In a real scenario with a saving plan, there is a commitment to a consistent amount of usage. Once the committed usage is exhausted, the price will revert to the on-demand rate. Does this PR take that into consideration?
What I expect is:
- When there is remaining committed usage, Karpenter should continue to select on-demand instances(instance type) under the saving plan.
- When the committed usage is exhausted, Karpenter should choose spot instances.
@ellistarn We redefined the value of topology.kubernetes.io/zone. However, the current zone value is provided by the CloudProvider. Is it possible to support custom topology.kubernetes.io/zone and other WellKnownLabel values?
@ellistarn We redefined the value of
topology.kubernetes.io/zone. However, the current zone value is provided by the CloudProvider. Is it possible to support customtopology.kubernetes.io/zoneand other WellKnownLabel values?
It sounds like you don't agree with the zone naming used in an existing integration. You can:
- use a different label
- this is the easy and recommended option
- replace the cloud provider integration code with your own (Karpenter, cloud controller manager, etc)
I don't think this PR would need to change to accommodate the ask.
Thanks for trying to tackle this!
I wonder if there is a cleaner way to represent the overlays to reduce sprawl.
Thinking more generally, each dimension can be additive, adding a new custom resource, and/or adjustment, price changes or resource modification. (Assuming we aren't supporting deleting properties at this time).
Price is a slight outlier from resources so we can separate out. Would it make sense to have at the highest level price and resources. Then subsection of adjustment and additions. Adjustment would support +/- percent or values and additions would just be a dict of new custom resources and values?
@sftim I mean nodeoverlay can provide more overlay options, not just resources. For example, Lables.
Hi folks. Please comment inline in the docs w/ threads, so I can respond directly.
For those watching. I just pushed a rev with an RFC document.
Pull Request Test Coverage Report for Build 9605413160
Details
- 0 of 60 (0.0%) changed or added relevant lines in 1 file are covered.
- 2 unchanged lines in 1 file lost coverage.
- Overall coverage decreased (-0.4%) to 80.062%
| Changes Missing Coverage | Covered Lines | Changed/Added Lines | % |
|---|---|---|---|
| pkg/apis/v1alpha1/zz_generated.deepcopy.go | 0 | 60 | 0.0% |
| <!-- | Total: | 0 | 60 |
| Files with Coverage Reduction | New Missed Lines | % |
|---|---|---|
| pkg/controllers/node/termination/controller.go | 2 | 59.81% |
| <!-- | Total: | 2 |
| Totals | |
|---|---|
| Change from base Build 9589529985: | -0.4% |
| Covered Lines: | 8324 |
| Relevant Lines: | 10397 |
๐ - Coveralls
Pull Request Test Coverage Report for Build 9649497449
Details
- 0 of 60 (0.0%) changed or added relevant lines in 1 file are covered.
- 9 unchanged lines in 2 files lost coverage.
- Overall coverage decreased (-0.6%) to 79.994%
| Changes Missing Coverage | Covered Lines | Changed/Added Lines | % |
|---|---|---|---|
| pkg/apis/v1alpha1/zz_generated.deepcopy.go | 0 | 60 | 0.0% |
| <!-- | Total: | 0 | 60 |
| Files with Coverage Reduction | New Missed Lines | % |
|---|---|---|
| pkg/controllers/node/termination/controller.go | 2 | 59.81% |
| pkg/controllers/provisioning/scheduling/preferences.go | 7 | 86.67% |
| <!-- | Total: | 9 |
| Totals | |
|---|---|
| Change from base Build 9639333499: | -0.6% |
| Covered Lines: | 8317 |
| Relevant Lines: | 10397 |
๐ - Coveralls
Pull Request Test Coverage Report for Build 9666238327
Details
- 0 of 60 (0.0%) changed or added relevant lines in 1 file are covered.
- 4 unchanged lines in 2 files lost coverage.
- Overall coverage decreased (-0.4%) to 80.042%
| Changes Missing Coverage | Covered Lines | Changed/Added Lines | % |
|---|---|---|---|
| pkg/apis/v1alpha1/zz_generated.deepcopy.go | 0 | 60 | 0.0% |
| <!-- | Total: | 0 | 60 |
| Files with Coverage Reduction | New Missed Lines | % |
|---|---|---|
| pkg/controllers/provisioning/scheduling/nodeclaim.go | 2 | 89.13% |
| pkg/controllers/node/termination/controller.go | 2 | 59.81% |
| <!-- | Total: | 4 |
| Totals | |
|---|---|
| Change from base Build 9653637355: | -0.4% |
| Covered Lines: | 8322 |
| Relevant Lines: | 10397 |
๐ - Coveralls
Pull Request Test Coverage Report for Build 9667685247
Details
- 0 of 60 (0.0%) changed or added relevant lines in 1 file are covered.
- 2 unchanged lines in 1 file lost coverage.
- Overall coverage decreased (-0.4%) to 80.067%
| Changes Missing Coverage | Covered Lines | Changed/Added Lines | % |
|---|---|---|---|
| pkg/apis/v1alpha1/zz_generated.deepcopy.go | 0 | 60 | 0.0% |
| <!-- | Total: | 0 | 60 |
| Files with Coverage Reduction | New Missed Lines | % |
|---|---|---|
| pkg/scheduling/requirements.go | 2 | 98.01% |
| <!-- | Total: | 2 |
| Totals | |
|---|---|
| Change from base Build 9666563133: | -0.4% |
| Covered Lines: | 8327 |
| Relevant Lines: | 10400 |
๐ - Coveralls
Pull Request Test Coverage Report for Build 9719413107
Details
- 0 of 60 (0.0%) changed or added relevant lines in 1 file are covered.
- 2 unchanged lines in 1 file lost coverage.
- Overall coverage decreased (-0.4%) to 78.371%
| Changes Missing Coverage | Covered Lines | Changed/Added Lines | % |
|---|---|---|---|
| pkg/apis/v1alpha1/zz_generated.deepcopy.go | 0 | 60 | 0.0% |
| <!-- | Total: | 0 | 60 |
| Files with Coverage Reduction | New Missed Lines | % |
|---|---|---|
| pkg/controllers/node/termination/terminator/eviction.go | 2 | 89.09% |
| <!-- | Total: | 2 |
| Totals | |
|---|---|
| Change from base Build 9719097361: | -0.4% |
| Covered Lines: | 8595 |
| Relevant Lines: | 10967 |
๐ - Coveralls
Hi all! Do we have an estimation of when this feature will be publicly available? We have been waiting for it for over a year now..
Our use case is we run Circle CI runner jobs on a EKS cluster. And we have this limit for buildah to work:
resources:
limits:
github.com/fuse: 2
But we got the following error:
{
"level": "ERROR",
"logger": "controller.provisioner",
"message": "Could not schedule pod, incompatible with nodepool \"default\", daemonset overhead={\"cpu\":\"180m\",\"memory\":\"120Mi\",\"pods\":\"8\"}, no instance type satisfied resources {\"cpu\":\"2180m\",\"github.com/fuse\":\"1\",\"memory\":\"8312Mi\",\"pods\":\"9\"} and requirements karpenter.k8s.aws/instance-category In [m t], karpenter.k8s.aws/instance-generation Exists >2, karpenter.sh/capacity-type In [on-demand], karpenter.sh/nodepool In [default], kubernetes.io/arch In [amd64], kubernetes.io/os In [linux] (no instance type has enough resources); incompatible with nodepool \"arm64\", daemonset overhead={\"cpu\":\"180m\",\"memory\":\"120Mi\",\"pods\":\"8\"}, no instance type satisfied resources {\"cpu\":\"2180m\",\"github.com/fuse\":\"1\",\"memory\":\"8312Mi\",\"pods\":\"9\"} and requirements karpenter.k8s.aws/instance-category In [c m t], karpenter.k8s.aws/instance-generation Exists >2, karpenter.sh/capacity-type In [on-demand], karpenter.sh/nodepool In [arm64], kubernetes.io/arch In [arm64], kubernetes.io/os In [linux] (no instance type has enough resources)"
}
Pull Request Test Coverage Report for Build 9882748974
Details
- 0 of 60 (0.0%) changed or added relevant lines in 1 file are covered.
- No unchanged relevant lines lost coverage.
- Overall coverage decreased (-0.4%) to 77.764%
| Changes Missing Coverage | Covered Lines | Changed/Added Lines | % |
|---|---|---|---|
| pkg/apis/v1alpha1/zz_generated.deepcopy.go | 0 | 60 | 0.0% |
| <!-- | Total: | 0 | 60 |
| Totals | |
|---|---|
| Change from base Build 9880691959: | -0.4% |
| Covered Lines: | 8750 |
| Relevant Lines: | 11252 |
๐ - Coveralls
Hi All
I'm a bit confused about the relationship between kubernetes-sigs/karpenter and aws/karpenter-provider-aws.
The lack of support for custom resource requests/limits is blocking me from autocalling with xilinx devices.
I want to understand how to deploy this branch to EKS.
Any help or understanding would be appreciated.
This PR has been inactive for 14 days. StaleBot will close this stale PR after 14 more days of inactivity.
I want to understand how to deploy this branch to EKS.
This branch adds a โrequest for commentsโ. @ellistarn should the RFC and implementation land as one commit? Seems unusual.
This PR has been inactive for 14 days. StaleBot will close this stale PR after 14 more days of inactivity.
i think this shouldn't be closed.
/lifecycle frozen
@njtran: The lifecycle/frozen label cannot be applied to Pull Requests.
In response to this:
/lifecycle frozen
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.
What is the progress now? Why does it seems not to be updated.
This PR has been inactive for 14 days. StaleBot will close this stale PR after 14 more days of inactivity.
commenting to make it not stale. this is a very important feature that will take Karpenter to the next level.
Is there any update on this?
This PR has been inactive for 14 days. StaleBot will close this stale PR after 14 more days of inactivity.
Is this still planned for a future Karpenter release? Working on deploying in some clusters where I use hugepages, bumping up against this issue -- just trying to get a feel for whether I'll need a short-term workaround or if this is planned to be a long-term limitation.