kustomize-controller icon indicating copy to clipboard operation
kustomize-controller copied to clipboard

Health checks can generate unhelpful notifications

Open kingdonb opened this issue 7 months ago • 7 comments

I'm not sure if this issue is specific to Crossplane, but - I've noticed that when one resource is not passing the health check, I will often get this really unhelpful error message that basically tells me, "I was waiting for all of these resources, and I timed out" without pointing at the one specific resource that fails the health check:

Image

The only resource that is failing health checks is the one highlighted.

But you can barely pick it out from the notification message, in all that noise. Is this a known issue and could it be fixed? I usually split bundles of more resources into separate kustomizations, so one error won't block the whole kustomization, but in this case I didn't. I'm not sure I've seen this issue before, I think it might be something specific about my environment.

I am pretty sure this is not an issue with crossplane, and the health checks work actually - I thought I would need custom CEL health checks because of Synced and Ready, but resources that aren't synced are also not ready - the health check error just doesn't report as clearly as I'd like.

kingdonb avatar Apr 24 '25 16:04 kingdonb

Hmm we've added a fix for this long time ago but it seems that it doesn't work in your case https://github.com/fluxcd/pkg/blob/877b123dd56df041167b7161b10d127152a45fb0/ssa/manager_wait.go#L96-L101

Somehow the error type does not match anymore, even if it's clearly a DeadlineExceeded error.

stefanprodan avatar Apr 24 '25 16:04 stefanprodan

OK, thanks! That context is very helpful. I will see if I can reproduce this issue on a public repo, I can't publish these configs where they are, and they're also I'm sure very far from a minimum viable repro.

kingdonb avatar Apr 24 '25 17:04 kingdonb

I've got kustomize-controller running in a debugger now, and with a breakpoint set in fluxcd/pkg/ssa - I should have some answers very shortly, if it's a type mismatch we'll know what type I hope...

Edit: Well, I've got the breakpoint stopping in that place, but in my tests, the issue did not reproduce. All of my environments are on v1.5.1 kustomize controller, and now my test clone is pinned to that version (with the local pkg/ssa for breakpoint) but I'm only seeing health check failed after 1m0.770045209s: timeout waiting for: [Object/kflow-mpi-jobs-gpu-nodepool status: 'InProgress'] which is exactly how I expected it should look.

When everyone has gone home for the day and I'm not impacting anyone's work, I will try to attach my debugger in the prod environment, bring the bad config back where this issue was readily reproducing earlier today, and see what I can find out.

kingdonb avatar Apr 24 '25 18:04 kingdonb

More information: that strange error coincided with the shutdown of the node, due to a Karpenter autoscaling event. While the health check was failing for unremarkable reasons (the resource was a valid Object but it contained an error that caused it to fail Syncing in Crossplane terms) - we did not see the noisy failure message until the node recycle event.

So, I think I can reproduce the issue, but maybe not so much in a debugger or controlled setting. It happens either before or after Kustomize Controller is forced to terminate & start up again on the next node.

I'm guessing that the timeout of 1m and the timing of the restart of Kustomize Controller has somehow lined up to really prevent all of those resources from reporting their health checks on time, and the message is accurate. I have seen this kind of error around node restarts a dozen times before and always thought it looked strange, but disregarded it because it didn't persist.

I never took note of these circumstances until my boss's boss pointed it out today 😅 just so happened he was debugging an object while the karpenter node restart came in, and he didn't have his blinders on like I usually do. I've conditioned myself to ignore all errors around a node restart if they go away. It might be better if this error was a bit more informative though!

kingdonb avatar Apr 24 '25 18:04 kingdonb

error coincided with the shutdown of the node

Ok now it makes sense, the shutdown signal is propagated to the go routine and the context is canceled for all opened watchers. The error message is in fact accurate.

stefanprodan avatar Apr 24 '25 19:04 stefanprodan

I went to a bunch of trouble a few weekends ago to create something I've called flux-event-relay so NodeClaim events can be broadcast like Flux events, through the fluxcd/pkg/events and EventRecorder interface. This provides context to noisy alerts. Maybe I can publish it in fluxcd-community. I need to get permission first.

Anyway, I think this mystery is solved, unless you want to keep this issue open - the only thing I can think of is maybe we could capture the termination event and emit a different notification - but we know that notifications emitted during termination cycle are incredibly misleading, that's why I created this tool to expose the context of those noisy alerts. If we know that nodes are recycling, then we can ignore the noisy alerts - and we should be able to observe the recovery when retryInterval comes.

Thanks for helping me debug this question!

kingdonb avatar Apr 25 '25 13:04 kingdonb

Possible I found a simpler repro this morning, my only problem is imagepullbackoff - but just for one deployment, still I got the whole message:

health check failed after 5m0.013613371s: timeout waiting for: [ServiceAccount/flux-system/flux-event-relay status: 'Unknown': client rate limiter Wait returned an error: rate: Wait(n=1) would exceed context deadline, ClusterRole/flux-event-relay status: 'Unknown': client rate limiter Wait returned an error: rate: Wait(n=1) would exceed context deadline, ClusterRoleBinding/flux-event-relay status: 'Unknown': client rate limiter Wait returned an error: rate: Wait(n=1) would exceed context deadline, Deployment/flux-system/flux-event-relay status: 'Unknown': client rate limiter Wait returned an error: rate: Wait(n=1) would exceed context deadline]

without any hint of node recycles or other extraordinary conditions. I will be testing to reproduce this on my home lab, later.

But I'm going to reopen for now, because this condition was just observed without any covered explanation. Will get back here when I'm sure it repro's in a way that I can explain and document.

kingdonb avatar Apr 28 '25 11:04 kingdonb