helm-controller icon indicating copy to clipboard operation
helm-controller copied to clipboard

Drift detection in warn mode - having status field or metrics to have alerting

Open sebglon opened this issue 6 months ago • 12 comments

Hi, We need to monitor and be alerted when the helm-controller detect a drift. can we have a status field or a metrics to implement alerting?

sebglon avatar Jun 18 '25 07:06 sebglon

We have .spec.driftDetection.mode warn: https://fluxcd.io/flux/components/helm/helmreleases/#drift-detection

We send an event to notification-controller when a drift is detected, so you can choose any of these notification providers to send the alert to:

https://fluxcd.io/flux/components/notification/providers/#type

matheuscscp avatar Jun 18 '25 08:06 matheuscscp

We use unified monitoring and alerting with prometheus. For that we need to have metrics to monitor the drift resolution.

sebglon avatar Jun 18 '25 11:06 sebglon

We use unified monitoring and alerting with prometheus.

So you can use the Prometheus Alert Manager integration https://fluxcd.io/flux/components/notification/providers/#prometheus-alertmanager

stefanprodan avatar Jun 18 '25 11:06 stefanprodan

The prometheus-alertmanager integration not generate metrics in prometheus. it only trigger an alert. this does not allow tracking by graph

sebglon avatar Jun 19 '25 07:06 sebglon

You can plot alert graphs in Grafana, have you tried setting up Alertmanager as a data source?

https://grafana.com/docs/grafana/latest/datasources/alertmanager/

matheuscscp avatar Jun 19 '25 07:06 matheuscscp

If we have a status.driftDetected fields on the HelmRelease object with the reason, we can easily generate metrics and take actions to fix the drift. This can be the same for warn and enabled. We have to identify helmReleae reconciliation loop with driftDetection. And we can't enable debug logs in production due to data volume generated

With a status field we may easily use kube-state-metrics exporter to generate metrics

sebglon avatar Jun 19 '25 08:06 sebglon

The alert is not possible to use for us. We need to monitor all HelmRelease on all our namespaces. But the alert require to specify all namespace (wirldcard is not supported).

Alredy discus here

sebglon avatar Jun 27 '25 09:06 sebglon

You can have one Provider+Alert per namespace

matheuscscp avatar Jun 27 '25 10:06 matheuscscp

No we can't have one Provider+Alert per namespace because we manage too big infra with multiple teams and products

sebglon avatar Jun 27 '25 12:06 sebglon

Have you checked Flux Operator ResourceSet API? It helps you reduce Flux boilerplate a lot, many people are happy with that:

https://fluxcd.control-plane.io/operator/resourcesets/introduction/

matheuscscp avatar Jun 27 '25 12:06 matheuscscp

Check also a reference architecture here: https://fluxcd.control-plane.io/guides/d2-architecture-reference/

matheuscscp avatar Jun 27 '25 12:06 matheuscscp

Why not for the Flux Operator but this not solve the Metrics and the clear view of the drift without using debug logs.

We need to add a HelmRelease status condition with the drift details. this help each teams and tenants to debug and monitor drift without using the flux debug logs...

sebglon avatar Jun 27 '25 13:06 sebglon