karpenter icon indicating copy to clipboard operation
karpenter copied to clipboard

Surface health conditions of Karpenter resources as status conditions

Open njtran opened this issue 1 year ago • 16 comments

Description

What problem are you trying to solve? Karpenter uses logs and events to alert users about both fleeting and unresolvable errors about their configurations. Logs may be spammy due to the fast-nature of re-queues, and events can give delayed information on when a health issue is actually resolved (waiting until the de-dupe interval is complete).

On top of this concern, it may be useful for a user to monitor their NodePools or NodeClasses to determine if they are all in a healthy state and fire alarms when certain status conditions are not maintained or, more generally, when the Readiness condition for either the NodePools or the NodeClasses isn't satisfied.

High-level, I think we are thinking the following status conditions would be nice to have:

  • [ ] NodePools
    • [ ] NodeClassReady - Rolled-up status conditions from the nodeClassRef which checks the generic readiness of this object to determine if the NodeClass is usable
    • [ ] InstanceTypesExist - Validates whether there are provisionable instance types with available offerings to use
    • [ ] NodesDisrupting - Describes whether there are any nodes on the cluster that are actively being disrupted through replacement or deletion. This condition would not have an affect on the NodePool readiness but would be purely informational for alerting and tracking

We have logic that is sitting in the code that ignores certain NodePools for scheduling and have some logic in the code for handling cases where the NodeClass isn't ready. This logic could strictly rely on these new status conditions to ensure that we are always attempting to schedule and provision nodes that are sitting in a ready state.

Additional Info It's possible that we could then export metrics from these status conditions so that users can alarm/track these metrics.

How important is this feature to you?

  • Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
  • Please do not leave "+1" or "me too" comments, they generate extra noise for issue followers and do not help prioritize the request
  • If you are interested in working on this issue or have submitted a pull request, please leave a comment

njtran avatar Aug 29 '23 20:08 njtran

Is the scope of this proposal limited to the Status (type NodePoolStatus) of the type NodePool struct?

Or are there other CRD resources that we want to add/augment aggregated status data for?

jackfrancis avatar Oct 13 '23 16:10 jackfrancis

Or are there other CRD resources that we want to add/augment aggregated status data for

We'd probably also want to scope this proposal to the NodeClass and NodeClaim as well. We have some status conditions that currently sit on the NodeClaim, but I could see us adding more detail here (I know that there was a proposal that the NodeClaim readiness should accurately reflect the Node readiness through its status conditions as well).

For the NodeClass, it was difficult to talk about this in a cloud-neutral context, but I could see us using tangible examples from AWS and Azure to come to some agreed-upon tenents for what the readiness of these objects might look like so that we can rely on that readiness condition inside of the NodePool.

jonathan-innis avatar Oct 13 '23 18:10 jonathan-innis

NodeClass seems to just be a redirect to a NodeClaim's NodeClassRef. What integral type definition is a NodeClassRef referring to?

jackfrancis avatar Oct 13 '23 19:10 jackfrancis

It's the CloudProvider-specific parameters for a NodeClaim. We should probably enforce readiness in the status as part of the contract for a NodeClaim and then we can check the readiness of NodeClasses for any arbitrary NodeClass

jonathan-innis avatar Oct 17 '23 22:10 jonathan-innis

For instance, right now, the AWS-specific NodeClass is the EC2NodeClass

jonathan-innis avatar Oct 17 '23 22:10 jonathan-innis

InstanceTypesExist part if covered by #617 where we display the resolved instance types itself

sadath-12 avatar Oct 24 '23 05:10 sadath-12

splitting down the pr's

sadath-12 avatar Oct 24 '23 08:10 sadath-12

@sadath-12 If you are going to take a stab at this, can you make sure to do a design write-up for this one? I think we need to carefully think through all the statusConditions that we want to surface and make sure that we are covering all of the use-cases that we expect off of it.

Ideally, we can surface enough information through our status conditions that we can form alerting, eventing, and logging logic based purely of the status condition state transitions.

jonathan-innis avatar Oct 24 '23 21:10 jonathan-innis

Ya sure @jonathan-innis

sadath-12 avatar Oct 25 '23 08:10 sadath-12

Makes a lot of sense to me. @charliedmcb, @mattchr - note that the plan is to use https://github.com/aws/karpenter-core/pull/786 as a stepping stone.

tallaxes avatar Nov 15 '23 22:11 tallaxes

/assign

sadath-12 avatar Nov 23 '23 16:11 sadath-12

  • I'd definitely prefer to have a condition on a NodeClaim that reflects whether Karpenter things it can see a matching Node or not.
  • When a NodePool is close it its capacity limit and all plausible offerings would take the NodePool over that limit, it'd be nice to have that reflected in a condition. Ideally also which limit(s) have been reached. Even if there are no pending Pods to schedule and no demand for new nodes.

sftim avatar Jan 09 '24 23:01 sftim

prefer to have a condition on a NodeClaim that reflects whether Karpenter things it can see a matching Node or not

Is this different than our current Registered condition...or are you thinking more of a live condition that constantly reflects the existence of the Node?

jonathan-innis avatar Jan 10 '24 06:01 jonathan-innis

a live condition that constantly reflects the existence of the Node?

Yes, that. If I force delete an associated Node I'd expect the condition to change moments before, or simultaneous with, the moment NodeClaim starts to get finalized.

sftim avatar Jan 10 '24 09:01 sftim

Showing a summary of the node claims associated and if they are up to current state or currently in drift state by something like 5/6. Would give an idea if something is happening in the cluster currently.

tvonhacht-apple avatar Apr 14 '24 05:04 tvonhacht-apple

In order to debug if a nodeclaim will get removed, or currently is getting started, or has expired, it would be great to show that information as part of a -o wide command for debugging a cluster.

tvonhacht-apple avatar Apr 14 '24 06:04 tvonhacht-apple

Fixed as part of https://github.com/kubernetes-sigs/karpenter/pull/1385 and https://github.com/kubernetes-sigs/karpenter/pull/1401

njtran avatar Aug 12 '24 22:08 njtran