grafana-operator icon indicating copy to clipboard operation
grafana-operator copied to clipboard

feat: initial alerting support

Open theSuess opened this issue 1 year ago • 9 comments

This PR adds initial alerting support as described in the Alerting Support Proposal.

To keep changes small, this PR only adds the AlertRuleGroup custom resource.

This is also the first resource implemented using the new auto generated client (relevant for #1357)

theSuess avatar Feb 15 '24 12:02 theSuess

To modernize the code & API a bit, I've also performed the following actions for the new CR:

  • Use kubernetes conditions instead of lastMessage and NoMatchingInstances
  • Use finalizers instead of relying on instance status fields for cleanup

If there are reasons why these paradigms were not adopted in the first place, please let me know and I'll get this in line with other resources

theSuess avatar Feb 15 '24 12:02 theSuess

Waiting for https://github.com/grafana/grafana-openapi-client-go/pull/75 to be merged before removing the draft status from the PR

theSuess avatar Feb 15 '24 13:02 theSuess

@theSuess I'm currently looking through everything, but it's not a small PR, so it takes some time ;).

RBAC

Here our CI is a bit lacking, but there is no magic to get new RBAC rules in to kustomize or helm. Sadly, you will have to copy that yaml manually.

Tests

As you have seen in the code we haven't written many tests, so I can definitely live with merging this PR without having a bunch of them, even though it of course would be ideal to add some. But it would be nice to at least add some e2e test for alerts. This way we have some kind of basic knowledge that it's working. Take a look in https://github.com/grafana/grafana-operator/tree/master/tests/e2e/example-test for inspiration on how to set it up.

Comment

About conditions, I think we have some old issue about looking in to it, but I think we just never got that far. So I'm happy to see it used.

About finalizers, lets talk about it during our next meeting :)

NissesSenap avatar Feb 19 '24 09:02 NissesSenap

Will it be able to provision both grafana managed alerts and datasource managed alers ?

lsoica avatar Feb 19 '24 13:02 lsoica

I started to play around with the alertgroup and I managed to create the following issue. To recreate, I ran sh hack/kind/start-kind.sh to create my kind cluster. Then I added applied config/samples/grafana_v1beta1_grafanaalertrulegroup.yaml.

2024-02-19T15:28:43+01:00       ERROR   updating status {"controller": "grafanaalertrulegroup", "controllerGroup": "grafana.integreatly.org", "controllerKind": "GrafanaAlertRuleGroup", "GrafanaAlertRuleGroup": {"name":"grafanaalertrulegroup-sample","namespace":"default"}, "namespace": "default", "name": "grafanaalertrulegroup-sample", "reconcileID": "e60b64c1-74f8-4c9d-ac22-335839f2b149", "error": "Operation cannot be fulfilled on grafanaalertrulegroups.grafana.integreatly.org \"grafanaalertrulegroup-sample\": the object has been modified; please apply your changes to the latest version and try again"}
github.com/grafana/grafana-operator/v5/controllers.(*GrafanaAlertRuleGroupReconciler).Reconcile.func1
        /home/edvin/go/src/github.com/NissesSenap/grafana-operator/controllers/grafanaalertrulegroup_controller.go:102
github.com/grafana/grafana-operator/v5/controllers.(*GrafanaAlertRuleGroupReconciler).Reconcile
        /home/edvin/go/src/github.com/NissesSenap/grafana-operator/controllers/grafanaalertrulegroup_controller.go:158
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile
        /home/edvin/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:119
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler
        /home/edvin/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:316
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem
        /home/edvin/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:266
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2
        /home/edvin/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:227

Initially I thought this was related to the error I saw In grafana about the folder missing, which isn't great and we probably need some kind of check for that as well.

logger=ngalert t=2024-02-19T14:27:00.00159844Z level=error msg="failed to fetch alert rule namespace" err="folder not found" uid=4843de5c-4f8a-4af0-9509-23526a04faf8 org=1 namespace_uid=f9b0a98d-2ed3-45a6-9521-18679c74d4f1

I don't have any more time to look into why this error happens but it could also be something around datasources, it seems like Alert actually want to try to reach the datasource instead of just saying that it's there like a dashboard can live with.

Another thing I noticed was that we should probably bump our default grafana instance to 10 before releasing this feature. If you don't have version 10, there is no export function for alerts, which makes this feature more or less impossible to use.

Another issue I faced was that I couldn't create a folder with a specific UID. Which created issues when I wanted to create the alert using the operator.

If anyone else will be playing around with folders I recommend these two curl commands.

# List all folders
curl -X GET -H Accept:application/json -H Content-Type:application/json -H "Authorization: Bearer SOOO_LONG_TOKEN" http://localhost:3000/api/folders
# Create a folder with a specific UID
curl -X POST -H Accept:application/json -H Content-Type:application/json -H "Authorization: Bearer SOOO_LONG_TOKEN" http://localhost:3000/api/folders -d '{"uid":"f9b0a98d-2ed3-45a6-9521-18679c74d4f1", "title":"foo"}'

We might need to solve the folder creation similarly as we did in dashboards, to make sure we didn't get any strange errors. Or we need to update grafanaFolder to take UID as an option so we can set it there.

NissesSenap avatar Feb 19 '24 15:02 NissesSenap

the object has been modified; please apply your changes to the latest version and try again

This happens when we try to update the status on a resource which has been changed by another client. It might be caused by adding the finalizer. I'll build a fix for this

Regarding folders: what's your take on adding a folderSelector field in the alertrulegroup? This would allow users to select a folder CR instead of having to find the UID themselves

Adding a UID to the folder spec would work as well. I don't think it makes sense to dynamically create folders for alert rule groups as multiple alert rule groups can be stored in the same folder which would require further matching logic

theSuess avatar Feb 20 '24 10:02 theSuess

Alright, most issues ironed out.

We now support a folderSelector as well as folderUID (which overrides the selector)

The only issue is that there is no way to enforce at least one field must be set without an admission hook so currently the CRD just requires the folderSelector to be set.

I've also added an e2e test & fixed the instrumentedroundtripper

theSuess avatar Feb 21 '24 16:02 theSuess

@theSuess could an option be to use validation rules instead of an adminsion webhook? It was in beta in 1.25 and GA in 1.29, so I think it's reasonable that we can use it now. https://kubernetes.io/docs/tasks/extend-kubernetes/custom-resources/custom-resource-definitions/#validation-rules

NissesSenap avatar Feb 22 '24 08:02 NissesSenap

@theSuess could an option be to use validation rules instead of an adminsion webhook? It was in beta in 1.25 and GA in 1.29, so I think it's reasonable that we can use it now. https://kubernetes.io/docs/tasks/extend-kubernetes/custom-resources/custom-resource-definitions/#validation-rules

oooh that's interesting - will see if I can get this working with kubebuilder

theSuess avatar Feb 22 '24 08:02 theSuess

Validation rules seem to work! Be careful with them though, if you get them wrong while having existing resources, it can be hard to get rid of them

theSuess avatar Feb 23 '24 13:02 theSuess

Will it be able to provision both grafana managed alerts and datasource managed alers ?

@lsoica this will only support grafana managed alerts. Datasource managed alerts are much more complex and would require a direct connection to the datasource which is out of scope for the operator.

theSuess avatar Feb 27 '24 09:02 theSuess

Regarding the folder selector: maybe we can simplify this to just be a folderRef? Selectors only make sense if it's possible to match multiple resources, which does not make sense for targeting a single folder.

Thoughts?

theSuess avatar Feb 27 '24 09:02 theSuess

Regarding the folder selector: maybe we can simplify this to just be a folderRef? Selectors only make sense if it's possible to match multiple resources, which does not make sense for targeting a single folder.

Thoughts?

Agree, it sounds very resonable.

NissesSenap avatar Feb 28 '24 13:02 NissesSenap

Switched over to folderRef in tests & examples now. All open topics should thus be resolved :crossed_fingers:

theSuess avatar Feb 29 '24 09:02 theSuess

I took the freedom to change a few labels and run some make commands to be more consistent with the current setup. Added a few minor comments, but I think this looks great. I can't wait for the feature, my self. Great job @theSuess !

NissesSenap avatar Mar 05 '24 15:03 NissesSenap

Currently testing this using the provided example. I saw that the alerting rule group got applied:

  status:
    conditions:
    - lastTransitionTime: "2024-03-08T10:16:02Z"
      message: Alert Rule Group was successfully applied to 1 instances
      observedGeneration: 3
      reason: ApplySuccesfull
      status: "True"
      type: AlertGroupSynchronized

however, in Grafana I don't see it:

grafik

Am I missing something?

pb82 avatar Mar 08 '24 10:03 pb82

@pb82 , I think the problem is that you actually have to have a Prometheus datasource that is actually talking to something, and without it, the Alert UI/API becomes a bit weird. Which of course make the whole thing a bit harder to setup as an example.

NissesSenap avatar Mar 08 '24 10:03 NissesSenap

@NissesSenap Thanks, I'll spin up a Prometheus instance and try with that.

pb82 avatar Mar 08 '24 10:03 pb82

I've tried to implement a safeguard for alerting on older Grafana versions but was blocked by not being able to infer the version of the instance. Tracking this in #1451

theSuess avatar Mar 12 '24 14:03 theSuess