karpenter
karpenter copied to clipboard
Surface health conditions of Karpenter resources as status conditions
Description
What problem are you trying to solve? Karpenter uses logs and events to alert users about both fleeting and unresolvable errors about their configurations. Logs may be spammy due to the fast-nature of re-queues, and events can give delayed information on when a health issue is actually resolved (waiting until the de-dupe interval is complete).
On top of this concern, it may be useful for a user to monitor their NodePools or NodeClasses to determine if they are all in a healthy state and fire alarms when certain status conditions are not maintained or, more generally, when the Readiness condition for either the NodePools or the NodeClasses isn't satisfied.
High-level, I think we are thinking the following status conditions would be nice to have:
- [ ] NodePools
- [ ]
NodeClassReady
- Rolled-up status conditions from thenodeClassRef
which checks the generic readiness of this object to determine if the NodeClass is usable - [ ]
InstanceTypesExist
- Validates whether there are provisionable instance types with available offerings to use - [ ]
NodesDisrupting
- Describes whether there are any nodes on the cluster that are actively being disrupted through replacement or deletion. This condition would not have an affect on the NodePool readiness but would be purely informational for alerting and tracking
- [ ]
We have logic that is sitting in the code that ignores certain NodePools for scheduling and have some logic in the code for handling cases where the NodeClass isn't ready. This logic could strictly rely on these new status conditions to ensure that we are always attempting to schedule and provision nodes that are sitting in a ready state.
Additional Info It's possible that we could then export metrics from these status conditions so that users can alarm/track these metrics.
How important is this feature to you?
- Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
- Please do not leave "+1" or "me too" comments, they generate extra noise for issue followers and do not help prioritize the request
- If you are interested in working on this issue or have submitted a pull request, please leave a comment
Is the scope of this proposal limited to the Status
(type NodePoolStatus
) of the type NodePool struct
?
Or are there other CRD resources that we want to add/augment aggregated status data for?
Or are there other CRD resources that we want to add/augment aggregated status data for
We'd probably also want to scope this proposal to the NodeClass
and NodeClaim
as well. We have some status conditions that currently sit on the NodeClaim
, but I could see us adding more detail here (I know that there was a proposal that the NodeClaim readiness should accurately reflect the Node readiness through its status conditions as well).
For the NodeClass
, it was difficult to talk about this in a cloud-neutral context, but I could see us using tangible examples from AWS and Azure to come to some agreed-upon tenents for what the readiness of these objects might look like so that we can rely on that readiness condition inside of the NodePool.
NodeClass
seems to just be a redirect to a NodeClaim
's NodeClassRef
. What integral type definition is a NodeClassRef
referring to?
It's the CloudProvider-specific parameters for a NodeClaim. We should probably enforce readiness in the status as part of the contract for a NodeClaim and then we can check the readiness of NodeClasses for any arbitrary NodeClass
For instance, right now, the AWS-specific NodeClass is the EC2NodeClass
InstanceTypesExist
part if covered by #617 where we display the resolved instance types itself
splitting down the pr's
@sadath-12 If you are going to take a stab at this, can you make sure to do a design write-up for this one? I think we need to carefully think through all the statusConditions that we want to surface and make sure that we are covering all of the use-cases that we expect off of it.
Ideally, we can surface enough information through our status conditions that we can form alerting, eventing, and logging logic based purely of the status condition state transitions.
Ya sure @jonathan-innis
Makes a lot of sense to me. @charliedmcb, @mattchr - note that the plan is to use https://github.com/aws/karpenter-core/pull/786 as a stepping stone.
/assign
- I'd definitely prefer to have a condition on a NodeClaim that reflects whether Karpenter things it can see a matching Node or not.
- When a NodePool is close it its capacity limit and all plausible offerings would take the NodePool over that limit, it'd be nice to have that reflected in a condition. Ideally also which limit(s) have been reached. Even if there are no pending Pods to schedule and no demand for new nodes.
prefer to have a condition on a NodeClaim that reflects whether Karpenter things it can see a matching Node or not
Is this different than our current Registered condition...or are you thinking more of a live condition that constantly reflects the existence of the Node?
a live condition that constantly reflects the existence of the Node?
Yes, that. If I force delete an associated Node I'd expect the condition to change moments before, or simultaneous with, the moment NodeClaim starts to get finalized.
Showing a summary of the node claims associated and if they are up to current state or currently in drift state by something like 5/6. Would give an idea if something is happening in the cluster currently.
In order to debug if a nodeclaim will get removed, or currently is getting started, or has expired, it would be great to show that information as part of a -o wide
command for debugging a cluster.
Fixed as part of https://github.com/kubernetes-sigs/karpenter/pull/1385 and https://github.com/kubernetes-sigs/karpenter/pull/1401