wave icon indicating copy to clipboard operation
wave copied to clipboard

Add update throttling to prevent rapid deployment churn

Open toelke opened this issue 1 month ago • 13 comments

Implements a minimum interval between updates (default: 10s, configurable) to prevent Wave from updating deployments too frequently when secrets or configmaps change rapidly.

This prevents scenarios where a buggy controller rapidly updating secrets causes Wave to rapidly update deployments, which can overwhelm the Kubernetes API server.

Key features:

  • Fixed minimum interval between updates (default: 10s)
  • Configurable via --min-update-interval flag
  • Configurable via Helm chart (minUpdateInterval value)
  • State tracked in-memory within the operator
  • Thread-safe implementation with mutex protection
  • Applies to ALL updates, even when config hashes change

Fixes #182

🤖 Generated with Claude Code

toelke avatar Dec 04 '25 12:12 toelke

I did not want to close https://github.com/wave-k8s/wave/pull/183, I just wanted to rename the branch :-(

toelke avatar Dec 04 '25 12:12 toelke

I decided that the semantics of an increasing backoff period are not easy: When to decrease the backoff-period again?

So this is now with a static backoff-period. I think this is enough for all workloads.

toelke avatar Dec 04 '25 12:12 toelke

So this is now with a static backoff-period. I think this is enough for all workloads.

I guess that should be good enough for most cases. It would have delayed the issue in our case. Not sure if it would have prevented the issue completely though (given enough time passed). In case it happens again I would try to add more backoff.

I decided that the semantics of an increasing backoff period are not easy: When to decrease the backoff-period again?

I guess we would have to check if the last backoff passed when an update occurs (maybe with a few secs of slack). If that is the case we could reset the backoff. That would not make sure that the Deployment got ready but that updates did not come in too fast.

jabdoa2 avatar Dec 04 '25 12:12 jabdoa2

I guess we would have to check if the last backoff passed when an update occurs (maybe with a few secs of slack). If that is the case we could reset the backoff. That would not make sure that the Deployment got ready but that updates did not come in too fast.

Yes, I was thinking along those lines: if the last update was not stopped and is currentBackoff in the past, then reduce the limit for n steps. But ultimately I felt that 1 update per 10 seconds is not going to hurt in any case…

toelke avatar Dec 04 '25 12:12 toelke

I guess we would have to check if the last backoff passed when an update occurs (maybe with a few secs of slack). If that is the case we could reset the backoff. That would not make sure that the Deployment got ready but that updates did not come in too fast.

Yes, I was thinking along those lines: if the last update was not stopped and is currentBackoff in the past, then reduce the limit for n steps. But ultimately I felt that 1 update per 10 seconds is not going to hurt in any case…

For Deployments that depends on how fast your pods get ready. Kubernetes will only then remove old replicasets. If your pods are slow to start (i.e. Prometheus) then this might still create a lot of replica sets.

It should not explode too fast though as it would be limited to 360 RS per hour per deployment. However, it this happens to a lot of different Deployments it might still hit. Maybe we need a global max updates per minute?

jabdoa2 avatar Dec 04 '25 12:12 jabdoa2

A global cool-down could break the base expectation that there is a restart for the "last" change of a configmap.

toelke avatar Dec 04 '25 13:12 toelke

A global cool-down could break the base expectation that there is a restart for the "last" change of a configmap.

In general, yes. However, even the Kubernetes api has a per client qps limit (with burst) to prevent ddos. I imagine something like that.

In our larger clusters when we update the trust chain for all namespaces (using trust manager) this currently causes wave to update a few thousand pods at once (as almost all of them mount ca-certs from a configmap). Initially trust manager had bugs (which changed the configmap a few times for each cert) and that actually caused the our api server to crash as well. We attributes that to trust manager but now that I think of it wave might have amplified that as well.

jabdoa2 avatar Dec 04 '25 13:12 jabdoa2

In that case you require all Deployments to be restarted, you just want to do it slowly, right?

toelke avatar Dec 04 '25 13:12 toelke

In that case you require all Deployments to be restarted, you just want to do it slowly, right?

Exactly, kubernetes api is fragile as soon as clusters grow. Guess this would be another issue. I can write something up later.

jabdoa2 avatar Dec 04 '25 14:12 jabdoa2

That sounds like it wants to be solved by wave putting the updates into a queue that is slowly drained (leaky bucket for rate limit?)

toelke avatar Dec 04 '25 14:12 toelke

That sounds like it wants to be solved by wave putting the updates into a queue that is slowly drained (leaky bucket for rate limit?)

Basically, we already got a queue for the reconciler. As soon as we block the reconciler we got that behavior. Naiively we could just add a custom rate limiter during controller setup. However, that would not be 100% accurate. It would probably better to manually call When on a rate.Limiter before an update.

jabdoa2 avatar Dec 04 '25 15:12 jabdoa2

Like this?

toelke avatar Dec 05 '25 14:12 toelke

Like this?

Yeah that is great. Simple code and should be very effective. I guess I would set burst to 100 by default. That way wave will act instantly until you do a lot of updates at once.

jabdoa2 avatar Dec 06 '25 07:12 jabdoa2