karpenter Surface health conditions of Karpenter resources as status conditions

Description

What problem are you trying to solve? Karpenter uses logs and events to alert users about both fleeting and unresolvable errors about their configurations. Logs may be spammy due to the fast-nature of re-queues, and events can give delayed information on when a health issue is actually resolved (waiting until the de-dupe interval is complete).

On top of this concern, it may be useful for a user to monitor their NodePools or NodeClasses to determine if they are all in a healthy state and fire alarms when certain status conditions are not maintained or, more generally, when the Readiness condition for either the NodePools or the NodeClasses isn't satisfied.

High-level, I think we are thinking the following status conditions would be nice to have:

[ ] NodePools
- [ ] NodeClassReady - Rolled-up status conditions from the nodeClassRef which checks the generic readiness of this object to determine if the NodeClass is usable
- [ ] InstanceTypesExist - Validates whether there are provisionable instance types with available offerings to use
- [ ] NodesDisrupting - Describes whether there are any nodes on the cluster that are actively being disrupted through replacement or deletion. This condition would not have an affect on the NodePool readiness but would be purely informational for alerting and tracking

We have logic that is sitting in the code that ignores certain NodePools for scheduling and have some logic in the code for handling cases where the NodeClass isn't ready. This logic could strictly rely on these new status conditions to ensure that we are always attempting to schedule and provision nodes that are sitting in a ready state.

Additional Info It's possible that we could then export metrics from these status conditions so that users can alarm/track these metrics.

How important is this feature to you?

Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
Please do not leave "+1" or "me too" comments, they generate extra noise for issue followers and do not help prioritize the request
If you are interested in working on this issue or have submitted a pull request, please leave a comment

Aug 29 '23 20:08 njtran

Is the scope of this proposal limited to the Status (type NodePoolStatus) of the type NodePool struct?

Or are there other CRD resources that we want to add/augment aggregated status data for?

Oct 13 '23 16:10 jackfrancis

Or are there other CRD resources that we want to add/augment aggregated status data for

We'd probably also want to scope this proposal to the NodeClass and NodeClaim as well. We have some status conditions that currently sit on the NodeClaim, but I could see us adding more detail here (I know that there was a proposal that the NodeClaim readiness should accurately reflect the Node readiness through its status conditions as well).

For the NodeClass, it was difficult to talk about this in a cloud-neutral context, but I could see us using tangible examples from AWS and Azure to come to some agreed-upon tenents for what the readiness of these objects might look like so that we can rely on that readiness condition inside of the NodePool.

Oct 13 '23 18:10 jonathan-innis

NodeClass seems to just be a redirect to a NodeClaim's NodeClassRef. What integral type definition is a NodeClassRef referring to?

Oct 13 '23 19:10 jackfrancis

It's the CloudProvider-specific parameters for a NodeClaim. We should probably enforce readiness in the status as part of the contract for a NodeClaim and then we can check the readiness of NodeClasses for any arbitrary NodeClass

Oct 17 '23 22:10 jonathan-innis

For instance, right now, the AWS-specific NodeClass is the EC2NodeClass

Oct 17 '23 22:10 jonathan-innis

InstanceTypesExist part if covered by #617 where we display the resolved instance types itself

Oct 24 '23 05:10 sadath-12

splitting down the pr's

Oct 24 '23 08:10 sadath-12

@sadath-12 If you are going to take a stab at this, can you make sure to do a design write-up for this one? I think we need to carefully think through all the statusConditions that we want to surface and make sure that we are covering all of the use-cases that we expect off of it.

Ideally, we can surface enough information through our status conditions that we can form alerting, eventing, and logging logic based purely of the status condition state transitions.

Oct 24 '23 21:10 jonathan-innis

Ya sure @jonathan-innis

Oct 25 '23 08:10 sadath-12

Makes a lot of sense to me. @charliedmcb, @mattchr - note that the plan is to use https://github.com/aws/karpenter-core/pull/786 as a stepping stone.

Nov 15 '23 22:11 tallaxes

/assign

Nov 23 '23 16:11 sadath-12

I'd definitely prefer to have a condition on a NodeClaim that reflects whether Karpenter things it can see a matching Node or not.
When a NodePool is close it its capacity limit and all plausible offerings would take the NodePool over that limit, it'd be nice to have that reflected in a condition. Ideally also which limit(s) have been reached. Even if there are no pending Pods to schedule and no demand for new nodes.

Jan 09 '24 23:01 sftim

prefer to have a condition on a NodeClaim that reflects whether Karpenter things it can see a matching Node or not

Is this different than our current Registered condition...or are you thinking more of a live condition that constantly reflects the existence of the Node?

Jan 10 '24 06:01 jonathan-innis

a live condition that constantly reflects the existence of the Node?

Yes, that. If I force delete an associated Node I'd expect the condition to change moments before, or simultaneous with, the moment NodeClaim starts to get finalized.

Jan 10 '24 09:01 sftim

Showing a summary of the node claims associated and if they are up to current state or currently in drift state by something like 5/6. Would give an idea if something is happening in the cluster currently.

Apr 14 '24 05:04 tvonhacht-apple

In order to debug if a nodeclaim will get removed, or currently is getting started, or has expired, it would be great to show that information as part of a -o wide command for debugging a cluster.

Apr 14 '24 06:04 tvonhacht-apple

Fixed as part of https://github.com/kubernetes-sigs/karpenter/pull/1385 and https://github.com/kubernetes-sigs/karpenter/pull/1401

Aug 12 '24 22:08 njtran

karpenter karpenter copied to clipboard

Surface health conditions of Karpenter resources as status conditions

Description

karpenter
karpenter copied to clipboard