kargo icon indicating copy to clipboard operation
kargo copied to clipboard

Freigh promotion stalls in Unknown health state for 3 minutes

Open henriropp opened this issue 2 months ago • 5 comments

Checklist

  • [x] I've searched the issue queue to verify this is not a duplicate bug report.
  • [x] I've included steps to reproduce the bug.
  • [~] I've pasted the output of kargo version.
  • [-] I've pasted logs, if applicable.

Description

A freight is being auto promoted to a stage after a build. The health status indicates "Unknown" status for what seems 3 minutes until it goes healthy. There is Analysis Template configured for the stage. The Stage indicates the Verify is running, but it is not. It will start after health status changes to "Healthy".

However, going into a Stages menu and selecting "Refresh" (after 5-10 seconds) will get the health status changed to "Healthy" and the verification starts.

This happen on all (3) stages on all our projects/repos.

Screenshots

Image

Steps to Reproduce

  1. (auto) promote a freight
  2. see it happen
  3. wait 5-10 seconds
  4. stages menu and selecting "Refresh"

Version

1.8.2

Logs

henriropp avatar Nov 05 '25 12:11 henriropp

The health status indicates "Unknown" status for what seems 3 minutes until it goes healthy.

App health (when there is an App involved, that is) is one factor in Stage health. Because of some peculiarities of Argo CD, we are unable to trust an App's self-reported health for some time after a sync operation has completed. We allot 10s as the buffer. Any time a Stage's health checks are executed and we have an unknown result (for any reason) and if we're also able to determine that health should have been know-able (it isn't always), the Stage goes back on the queue with a progressive backoff so that the health check will be retried.

One of two things is happening:

  1. You're getting really unlucky and backing off quickly/a lot before the 10s is up and then you end up waiting a while before the next check. And/or you have a lot of Stages and the work queue is pretty deep and that's slowing things down.

  2. Other operations are happening against the App (auto-sync? something else?) and Kargo is, as described, not trusting and App's self-reported heath status if its last operation completed fewer than 10s prior.

Or it could be a combination of these two things. What's happening to you isn't unheard of. I've seen it, but there's usually a reason it's happening.

The Stage indicates the Verify is running, but it is not. It will start after health status changes to "Healthy".

This is normal, actually. Transitioning to a healthy state following a Promotion is implicitly part of the verification process. The Stage needs to be healthy before the user-defined portion of the verification kicks off. I think we could consider introducing a new state to make these two phases of the verification a bit more clear.

krancour avatar Nov 05 '25 13:11 krancour

Thanks for the quick reply. There should not be that much operations going on at the same time. The delay also happens consistently on all builds. In all projects, not only one.

I'll post more updates if I find something relevant.

henriropp avatar Nov 06 '25 17:11 henriropp

Is auto-sync enabled on your Apps?

krancour avatar Nov 09 '25 16:11 krancour

Yes, auto-sync is enabled.

henriropp avatar Nov 09 '25 16:11 henriropp

Yes, auto-sync is enabled.

That probably has something to do with it. App health isn't trusted for 10s after any operation. If Argo CD is, itself, performing operations on the App on a regular basis, it will regularly be creating conditions where the App health isn't trusted and that can impeded verification.

It would be easy to put this to the test by, at least temporarily, disabling auto-sync on the App.

krancour avatar Nov 14 '25 17:11 krancour

I am facing the same issue. I have realised that I always get 16 Reconciler errors "Stage health evaluated to Unknown".

Looked into the default backoff strategy of runtime controller. It starts at 5 miliseconds, doubled everytime after that, capped at 1000s. 5ms * (2^15) / 1000 = 163.84s or a bit under 3 minutes.

So I am guessing this might be the problem. What I still struggle to understand is why it seems the stage is requeued for reconciliation so many times before the promotion is done. I would expect that stage reconciliation attempt would get more spaced in time gradually. But instead, what I see in the logs, is 16 attempts while the promotion is ongoing, and then a silence of 2-3 minutes.

I suspect this is happening because stage reconciliation is enqueue on other events. Events that typically happen during the promotion.

FrancoisPoinsot avatar Dec 03 '25 19:12 FrancoisPoinsot

To test my theory I made this change in the controller configuration

import(
	"k8s.io/client-go/util/workqueue"
	"sigs.k8s.io/controller-runtime/pkg/reconcile"
)

func CommonOptions(maxConcurrentReconciles int) controller.Options {
	return controller.Options{
[...]
		RateLimiter:             workqueue.NewTypedItemExponentialFailureRateLimiter[reconcile.Request](5*time.Millisecond, 10*time.Second),
	}
}

To cap the backoff strategy of the rate limiter to 10s, instead of 1000s. After doing that, the verification only stall for 10s, or less.

I can't promise I am seeing the exact same problem as the original poster. And my change above is not a good solution. It's just to show the problem. Probably errors should not be thrown to being with when the controller is doing its job as expected?

FrancoisPoinsot avatar Dec 05 '25 15:12 FrancoisPoinsot

To add to this, here are logs for 1 promotion for 1 shard controller. The way the errors are packed early is what does not make sense to me. It does not follow exponential curve. Well it does for the first 10~ish. And then it follows some over kind of logic.

2025-12-05T15:07:27Z	ERROR	controller/controller.go:421	Reconciler error	
2025-12-05T15:07:27Z	ERROR	controller/controller.go:421	Reconciler error	
2025-12-05T15:07:27Z	ERROR	controller/controller.go:421	Reconciler error	
2025-12-05T15:07:28Z	ERROR	controller/controller.go:421	Reconciler error	
2025-12-05T15:07:28Z	ERROR	controller/controller.go:421	Reconciler error	
2025-12-05T15:07:28Z	ERROR	controller/controller.go:421	Reconciler error	
2025-12-05T15:07:28Z	ERROR	controller/controller.go:421	Reconciler error	
2025-12-05T15:07:28Z	ERROR	controller/controller.go:421	Reconciler error	
2025-12-05T15:07:29Z	ERROR	controller/controller.go:421	Reconciler error	
2025-12-05T15:07:31Z	ERROR	controller/controller.go:421	Reconciler error	
2025-12-05T15:07:33Z	ERROR	controller/controller.go:421	Reconciler error	
2025-12-05T15:07:39Z	ERROR	controller/controller.go:421	Reconciler error	
2025-12-05T15:07:44Z	ERROR	controller/controller.go:421	Reconciler error	
2025-12-05T15:07:48Z	ERROR	controller/controller.go:421	Reconciler error	
2025-12-05T15:07:50Z	ERROR	controller/controller.go:421	Reconciler error	
2025-12-05T15:07:50Z	ERROR	controller/controller.go:421	Reconciler error	

full logs

FrancoisPoinsot avatar Dec 05 '25 15:12 FrancoisPoinsot