grafana-operator
grafana-operator copied to clipboard
feat: initial alerting support
This PR adds initial alerting support as described in the Alerting Support Proposal.
To keep changes small, this PR only adds the AlertRuleGroup
custom resource.
This is also the first resource implemented using the new auto generated client (relevant for #1357)
To modernize the code & API a bit, I've also performed the following actions for the new CR:
- Use kubernetes conditions instead of
lastMessage
andNoMatchingInstances
- Use finalizers instead of relying on instance status fields for cleanup
If there are reasons why these paradigms were not adopted in the first place, please let me know and I'll get this in line with other resources
Waiting for https://github.com/grafana/grafana-openapi-client-go/pull/75 to be merged before removing the draft status from the PR
@theSuess I'm currently looking through everything, but it's not a small PR, so it takes some time ;).
RBAC
Here our CI is a bit lacking, but there is no magic to get new RBAC rules in to kustomize or helm. Sadly, you will have to copy that yaml manually.
Tests
As you have seen in the code we haven't written many tests, so I can definitely live with merging this PR without having a bunch of them, even though it of course would be ideal to add some. But it would be nice to at least add some e2e test for alerts. This way we have some kind of basic knowledge that it's working. Take a look in https://github.com/grafana/grafana-operator/tree/master/tests/e2e/example-test for inspiration on how to set it up.
Comment
About conditions
, I think we have some old issue about looking in to it, but I think we just never got that far. So I'm happy to see it used.
About finalizers
, lets talk about it during our next meeting :)
Will it be able to provision both grafana managed alerts and datasource managed alers ?
I started to play around with the alertgroup and I managed to create the following issue.
To recreate, I ran sh hack/kind/start-kind.sh
to create my kind cluster.
Then I added applied config/samples/grafana_v1beta1_grafanaalertrulegroup.yaml
.
2024-02-19T15:28:43+01:00 ERROR updating status {"controller": "grafanaalertrulegroup", "controllerGroup": "grafana.integreatly.org", "controllerKind": "GrafanaAlertRuleGroup", "GrafanaAlertRuleGroup": {"name":"grafanaalertrulegroup-sample","namespace":"default"}, "namespace": "default", "name": "grafanaalertrulegroup-sample", "reconcileID": "e60b64c1-74f8-4c9d-ac22-335839f2b149", "error": "Operation cannot be fulfilled on grafanaalertrulegroups.grafana.integreatly.org \"grafanaalertrulegroup-sample\": the object has been modified; please apply your changes to the latest version and try again"}
github.com/grafana/grafana-operator/v5/controllers.(*GrafanaAlertRuleGroupReconciler).Reconcile.func1
/home/edvin/go/src/github.com/NissesSenap/grafana-operator/controllers/grafanaalertrulegroup_controller.go:102
github.com/grafana/grafana-operator/v5/controllers.(*GrafanaAlertRuleGroupReconciler).Reconcile
/home/edvin/go/src/github.com/NissesSenap/grafana-operator/controllers/grafanaalertrulegroup_controller.go:158
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile
/home/edvin/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:119
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler
/home/edvin/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:316
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem
/home/edvin/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:266
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2
/home/edvin/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:227
Initially I thought this was related to the error I saw In grafana about the folder missing, which isn't great and we probably need some kind of check for that as well.
logger=ngalert t=2024-02-19T14:27:00.00159844Z level=error msg="failed to fetch alert rule namespace" err="folder not found" uid=4843de5c-4f8a-4af0-9509-23526a04faf8 org=1 namespace_uid=f9b0a98d-2ed3-45a6-9521-18679c74d4f1
I don't have any more time to look into why this error happens but it could also be something around datasources, it seems like Alert actually want to try to reach the datasource instead of just saying that it's there like a dashboard can live with.
Another thing I noticed was that we should probably bump our default grafana instance to 10 before releasing this feature. If you don't have version 10, there is no export function for alerts, which makes this feature more or less impossible to use.
Another issue I faced was that I couldn't create a folder with a specific UID. Which created issues when I wanted to create the alert using the operator.
If anyone else will be playing around with folders I recommend these two curl commands.
# List all folders
curl -X GET -H Accept:application/json -H Content-Type:application/json -H "Authorization: Bearer SOOO_LONG_TOKEN" http://localhost:3000/api/folders
# Create a folder with a specific UID
curl -X POST -H Accept:application/json -H Content-Type:application/json -H "Authorization: Bearer SOOO_LONG_TOKEN" http://localhost:3000/api/folders -d '{"uid":"f9b0a98d-2ed3-45a6-9521-18679c74d4f1", "title":"foo"}'
We might need to solve the folder creation similarly as we did in dashboards, to make sure we didn't get any strange errors. Or we need to update grafanaFolder to take UID as an option so we can set it there.
the object has been modified; please apply your changes to the latest version and try again
This happens when we try to update the status on a resource which has been changed by another client. It might be caused by adding the finalizer. I'll build a fix for this
Regarding folders: what's your take on adding a folderSelector
field in the alertrulegroup? This would allow users to select a folder CR instead of having to find the UID themselves
Adding a UID to the folder spec would work as well. I don't think it makes sense to dynamically create folders for alert rule groups as multiple alert rule groups can be stored in the same folder which would require further matching logic
Alright, most issues ironed out.
We now support a folderSelector
as well as folderUID
(which overrides the selector)
The only issue is that there is no way to enforce at least one field must be set
without an admission hook so currently the CRD just requires the folderSelector
to be set.
I've also added an e2e test & fixed the instrumentedroundtripper
@theSuess could an option be to use validation rules instead of an adminsion webhook? It was in beta in 1.25 and GA in 1.29, so I think it's reasonable that we can use it now. https://kubernetes.io/docs/tasks/extend-kubernetes/custom-resources/custom-resource-definitions/#validation-rules
@theSuess could an option be to use validation rules instead of an adminsion webhook? It was in beta in 1.25 and GA in 1.29, so I think it's reasonable that we can use it now. https://kubernetes.io/docs/tasks/extend-kubernetes/custom-resources/custom-resource-definitions/#validation-rules
oooh that's interesting - will see if I can get this working with kubebuilder
Validation rules seem to work! Be careful with them though, if you get them wrong while having existing resources, it can be hard to get rid of them
Will it be able to provision both grafana managed alerts and datasource managed alers ?
@lsoica this will only support grafana managed alerts. Datasource managed alerts are much more complex and would require a direct connection to the datasource which is out of scope for the operator.
Regarding the folder selector: maybe we can simplify this to just be a folderRef
? Selectors only make sense if it's possible to match multiple resources, which does not make sense for targeting a single folder.
Thoughts?
Regarding the folder selector: maybe we can simplify this to just be a
folderRef
? Selectors only make sense if it's possible to match multiple resources, which does not make sense for targeting a single folder.Thoughts?
Agree, it sounds very resonable.
Switched over to folderRef
in tests & examples now. All open topics should thus be resolved :crossed_fingers:
I took the freedom to change a few labels and run some make commands to be more consistent with the current setup. Added a few minor comments, but I think this looks great. I can't wait for the feature, my self. Great job @theSuess !
Currently testing this using the provided example. I saw that the alerting rule group got applied:
status:
conditions:
- lastTransitionTime: "2024-03-08T10:16:02Z"
message: Alert Rule Group was successfully applied to 1 instances
observedGeneration: 3
reason: ApplySuccesfull
status: "True"
type: AlertGroupSynchronized
however, in Grafana I don't see it:
Am I missing something?
@pb82 , I think the problem is that you actually have to have a Prometheus datasource that is actually talking to something, and without it, the Alert UI/API becomes a bit weird. Which of course make the whole thing a bit harder to setup as an example.
@NissesSenap Thanks, I'll spin up a Prometheus instance and try with that.
I've tried to implement a safeguard for alerting on older Grafana versions but was blocked by not being able to infer the version of the instance. Tracking this in #1451