source-controller
source-controller copied to clipboard
HelmCharts set to not ready on index download resp. source-controller restart
We're facing the following situation:
- we have several helmreleases installed successfully in our cluster
- the helmrepositories and helmcharts are also created and ready
Once the source-controller is restarted, it first fetches the repositories' index, meanwhile setting the helmrepositories state to Unknown. This leads to all the helmcharts states being set to "Not Ready".
Here an event message for a helmchart:
chart pull error: could not load repository index for remote chart reference: stat /data/helmrepository/default/myhelmrepo/index-<long id>.yaml: no such file or directory
Once the individual charts are downloaded again, the helmcharts' states transition to Unknown and Ready. The state where all helmcharts are not ready only lasts some seconds, but if in this time a helmrelease is reconciled (e.g. if the helm-controller is restarted), it is set to Not Ready as well. All dependent resources watching the helmreleases also see the Not Ready state.
Can this intermediate state of the helmcharts be changed to Unknown as for the helmrepositories? This way the helm-controller would potentially be able to set the helmrelease state to processing instead of failed.
The problem might be related to issue https://github.com/fluxcd/source-controller/issues/431, but I'm not sure how the helm-controller could be changed as mentioned there.
I also started a discussion here: https://github.com/fluxcd/flux2/discussions/3152 explaining the problem, maybe it can be fixed for certain use cases like the parallel installation/update of helm-controller and source-controller.
It sounds like we have a need to requeue more immediately when this type of failure happens (a source is not ready before the HelmRelease is reconciled)
We're looking at this in Bug Scrub, and it seems like in helm controller if it doesn't find the chart ready, it requeues immediately, but then in certain cases it requeues after spec.interval instead. We have a separate requeue interval for dependencies that defaults to 30s, and you can set it shorter, but if you've waited 30s and it hasn't resolved itself, until spec.interval arrives, we should be using the dependency requeue ("immediately") instead of the spec.interval.
There are some changes we're considering, if there is an issue here we need to think quite carefully about how to solve it, because we do not want to make a thundering herd or opposite problem where resources are getting reconciled too often.
Thanks for reporting this issue (we found also https://github.com/fluxcd/flux2/discussions/3152 which is related)