helm-operator icon indicating copy to clipboard operation
helm-operator copied to clipboard

Configurability of install retries

Open seaneagan opened this issue 4 years ago • 2 comments

Describe the feature

Currently upgrades support configurability of retries through the HelmRelease's rollback.retry and rollback.maxRetries fields. (Side question, is rollback.retry necessary, since rollback.maxRetries could just default to 0?) However, installs are just infinitely retried, with no configurability. This can make debugging a failed install harder, and can become unnecessary churn in e.g. metrics/alerts.

To enable configurability for install retries, there could be a root-level maxRetries field which applies to installs and provides a default value for rollback.maxRetries (since this really just specifies how many times to retry an upgrade (with rollbacks instead of uninstalls in between). This field could default to:

  • -1, to match the current install behavior (infinite retries)
  • 5, to match the the existing rollback.maxRetries default
  • 6, to match the kubernetes Job backoffLimit default

There could be an ability to override the default at the operator level as well as needed. One could specify e.g. 0 if they wanted to fail as fast as possible, or just reduce down from the default to e.g. 1.

Statusing

There could be a status.installCount or a status.retryCount which also deprecates status.rollbackCount, and gets updated when retrying installs or upgrades (when rollbacks are enabled), since those can't happen at the same time (there's just a different strategy to reset the release between retries, either uninstalls or rollbacks). Similar to status.rollbackCount, this count would reset when a new change is made to the HelmRelease that needs to be reconciled.

There should also be a new Uninstalling value for status.phase, which occurs during any release uninstall triggered by the operator:

  1. a HelmRelease CR is deleted (only affects the prometheus metrics, since the CR would no longer exist on which to expose status.phase)
  2. post-install failure when there are more install retries to consume

seaneagan avatar Apr 21 '20 18:04 seaneagan

@seaneagan I will pick this one up if you are already not working on it?

stefansedich avatar Apr 21 '20 20:04 stefansedich

First of all, thank you for your detailed feature request! :sunflower:

I have talked this through with @stefansedich who is going to pick it up. We agreed on a max default of 5 to match the existing rollback default, mainly so that we do not have confusing or conflicting default values.

I think your request for an Uninstalling phase is also valid and would be an interesting metric addition for e.g. use cases where you are rapidly bootstrapping namespaces all day long.

hiddeco avatar Apr 21 '20 20:04 hiddeco

Sorry if your issue remains unresolved. The Helm Operator is in maintenance mode, we recommend everybody upgrades to Flux v2 and Helm Controller.

A new release of Helm Operator is out this week, 1.4.4.

We will continue to support Helm Operator in maintenance mode for an indefinite period of time, and eventually archive this repository.

Please be aware that Flux v2 has a vibrant and active developer community who are actively working through minor releases and delivering new features on the way to General Availability for Flux v2.

In the mean time, this repo will still be monitored, but support is basically limited to migration issues only. I will have to close many issues today without reading them all in detail because of time constraints. If your issue is very important, you are welcome to reopen it, but due to staleness of all issues at this point a new report is more likely to be in order. Please open another issue if you have unresolved problems that prevent your migration in the appropriate Flux v2 repo.

Helm Operator releases will continue as possible for a limited time, as a courtesy for those who still cannot migrate yet, but these are strongly not recommended for ongoing production use as our strict adherence to semver backward compatibility guarantees limit many dependencies and we can only upgrade them so far without breaking compatibility. So there are likely known CVEs that cannot be resolved.

We recommend upgrading to Flux v2 which is actively maintained ASAP.

I am going to go ahead and close every issue at once today, Thanks for participating in Helm Operator and Flux! 💚 💙

kingdonb avatar Sep 02 '22 19:09 kingdonb