Drift detection in warn mode - having status field or metrics to have alerting
Hi, We need to monitor and be alerted when the helm-controller detect a drift. can we have a status field or a metrics to implement alerting?
We have .spec.driftDetection.mode warn: https://fluxcd.io/flux/components/helm/helmreleases/#drift-detection
We send an event to notification-controller when a drift is detected, so you can choose any of these notification providers to send the alert to:
https://fluxcd.io/flux/components/notification/providers/#type
We use unified monitoring and alerting with prometheus. For that we need to have metrics to monitor the drift resolution.
We use unified monitoring and alerting with prometheus.
So you can use the Prometheus Alert Manager integration https://fluxcd.io/flux/components/notification/providers/#prometheus-alertmanager
The prometheus-alertmanager integration not generate metrics in prometheus. it only trigger an alert. this does not allow tracking by graph
You can plot alert graphs in Grafana, have you tried setting up Alertmanager as a data source?
https://grafana.com/docs/grafana/latest/datasources/alertmanager/
If we have a status.driftDetected fields on the HelmRelease object with the reason, we can easily generate metrics and take actions to fix the drift.
This can be the same for warn and enabled.
We have to identify helmReleae reconciliation loop with driftDetection.
And we can't enable debug logs in production due to data volume generated
With a status field we may easily use kube-state-metrics exporter to generate metrics
The alert is not possible to use for us. We need to monitor all HelmRelease on all our namespaces.
But the alert require to specify all namespace (wirldcard is not supported).
Alredy discus here
You can have one Provider+Alert per namespace
No we can't have one Provider+Alert per namespace because we manage too big infra with multiple teams and products
Have you checked Flux Operator ResourceSet API? It helps you reduce Flux boilerplate a lot, many people are happy with that:
https://fluxcd.control-plane.io/operator/resourcesets/introduction/
Check also a reference architecture here: https://fluxcd.control-plane.io/guides/d2-architecture-reference/
Why not for the Flux Operator but this not solve the Metrics and the clear view of the drift without using debug logs.
We need to add a HelmRelease status condition with the drift details. this help each teams and tenants to debug and monitor drift without using the flux debug logs...