grafana-operator
grafana-operator copied to clipboard
Implement CRD's for Grafana 9 Alert Rules, Alert Groups, Contact Points, Notification Policies and Silences
Is your feature request related to a problem? Please describe. No existing problem.
Describe the solution you'd like Grafana Alerting is a great new feature, but the current grafana-operator does not support a CRD definition for the decoupled alerts, contact points, alert groups, notification policies and silences. The desired solution would be CRDs for each of these so that the new alerting tools are equally accessible via yaml.
kind: GrafanaAlertRule kind: GrafanaAlertGroup kind: GrafanaContactPoint kind: GrafanaNotificationPolicy kind: GrafanaSilence
Describe alternatives you've considered A workaround being considered is to manually mount the relevant config files into the running container via VolumeMount from a ConfigMap containing raw file contents. The only other alternative I am aware of is to manually enter these into the Grafana web console, which isn't the point.
This feature request is not new but has been reopened since the previous one was closed: https://github.com/grafana-operator/grafana-operator/issues/564
This could be interesting for v5 (https://github.com/grafana-operator/grafana-operator-experimental).
@R-Studio i've transferred your request to the new repository, I hope you don't mind!
Any news on this?
So I have looked through the API of Alerts.
In your request, you mention:
- kind: GrafanaAlertRule
- kind: GrafanaAlertGroup
- kind: GrafanaContactPoint
- kind: GrafanaNotificationPolicy
- kind: GrafanaSilence
Thoughts
So my understanding of GrafanaAlertGroup
is that it's just a visualization of groups that you define in your AlertRule, so there is no reason for us to provide it as a CRD.
kind: GrafanaSilence
I don't think silence is something that a normal user ever would do through a CRD. You just go in and silence the current alert, and you provide a reason and for how long in the UI.
If we added a CRD for GrafanaContactPoint
I think we would also need to create a separate CRD for GrafanaContactPointTemplate
(message template) or something like that.
When it comes to GrafanaNotificationPolicy
we once again have to add another functionality which is Mute timing
which is another API. Since they can be reused by multiple Notification policies it's probably best to have a seperate CRD for that one as well, something like GrafanaNotificationMuteTiming
.
But all of the above things are possible to solve and could still be "user friendly" as a CRD.
So we have two big issues left.
Big issues
Let's start with the core feature of GrafanaAlertRule
, so there is no way to export the rule through the UI (at least not what I can find).
So in reality you would have to write an exact alert rule by hand and make it match with our CRD that probably would look really close to how the Grafana go API sdk looks like, which would be really hard.
The other option is that you write the alert through the UI and then export the alerts by doing API requests to Grafana. Sure, we could write a small tool to do this, but I don't think it's reasonable for a normal developer to do this as a part of writing an alert.
The second big issue is that Silence contains state, since it's not reasonable for a developer to create a CRD to mute an alert this state needs to be stored else your silence will disappear when your grafana instance is restarted. So this would more or less force everyone to start using postgres to run grafana.
In short, I think this is a really hard issue to solve, and I don't have any good solutions for it, especially the way of working around alert management.
Possible workaround
So if you are a prometheus-operator user I would recommend using the AlertManager, grafana supports importing external alert managers as a datasource and it got support to write back silences to the alertmanager. That way the alerts silance is stored in the alert manager which already got state. This is how I solved it at my last company.
Alerts management creation isn't amazing in prometheus-operator CRD ether, but if you are used to writing promql it's not a big problem.
The problem with this is of course, what about those of us that uses Grafana for none prometheus datasources like GCP, well to be honest we are kind of out of luck when it comes to the perfect GitOps alerts as code workflow that we all are dreaming of. Sure it might be possible to use GCP as a datasource for your alerts as well (I haven't tried) but you probably want to keep your developers in the grafana UI and not go over to GCP just to mute a alert.
Conclusion
If you use prometheus-operator I suggest using the import alertmanager as an datasource feature in grafana to manage your alerts. If you have mutliple datasources like grafana you probably won't be able to have the perfect GitOps workflow with alerts as code. You will probably have to solve it by doing klick ops together with a postgres database together with grafana, else you are probably forced to manage mutliple UIs.
I don't think there is any reasonable way to solve this through the grafana-operator and creating new CRD
Please don't agree with my, I would love to hear an idea how these issues could be solved because I want this feature as well.
What are your thoughts on this @R-Studio and the other people that want this feature. Do you agree with my current assessment? If not, please share your thoughts on how we should solve these issues.
So in reality you would have to write an exact alert rule by hand and make it match with our CRD that probably would look really close to how the Grafana go API sdk looks like, which would be really hard.
Could you say more about what this means? My naive assumption (keeping in mind that although I'm a user of this project and grafana itself I'm not an authority on the dev SDKs) is that there exists an API to perform CRUD ops on these rules, as that API must be what the grafana web console is interacting with. Thus the CRD would be a yaml-definable mapping to that same API interface.
As to the statefulness of one of more attributes.. I could see value in having a field be specifyable as "only_on_create" or whatever a good name is to indicate that the value of said field in yaml is to be set on original creation and then no longer managed by the CRD. This would allow as-code creation of alerts and then subsequent management of, for example, silencing to be done via the web console.
FWIW in my use cases the primary value of the CRD is that I have as-code definitions of my grafana objects for better gitops. I want to roll out newly developed alerts and dashboards through each environment with low friction.
So how it would look isn't an easy question to answer and It would take many hours to come up with a suggestion that would even start to being good enough to discuss. I don't have that time, but if anyone else does, please go for it. I'm happy to discuss it.
But I created a small app to try to get a feeling for the API.
Looking at the SDK it's very focused on folders and groups which would make this even harder. Apparently you add your alert to a folder.
My app will list all alerts that you have in a specific group called group
and a folder called alerts
.
package main
import (
"fmt"
"net/url"
gapi "github.com/grafana/grafana-api-golang-client"
)
func main() {
// Create gapi config
config := gapi.Config{
BasicAuth: url.UserPassword("root", "secret"),
}
// Create a new client
client, err := gapi.New("http://localhost:3000", config)
if err != nil {
panic(err)
}
folderList, err := client.Folders()
if err != nil {
panic(err)
}
// Look through the list of folders and get the uid
for _, folder := range folderList {
fmt.Println(folder.UID)
listRules, err := client.AlertRuleGroup(folder.UID, "group")
if err != nil {
panic(err)
}
fmt.Println(listRules)
}
}
Basic grafana instance
apiVersion: grafana.integreatly.org/v1beta1
kind: Grafana
metadata:
name: grafana
labels:
dashboards: "grafana"
spec:
config:
log:
mode: "console"
auth:
disable_login_form: "false"
security:
admin_user: root
admin_password: secret
I port-forward to my instance and I created an alert manually in my grafana instance and then ran my script. You will of course have to adapt your folder and group settings to match your alert definition.
It provides the following output.
{
group rDzwzO8Vz 60 [
{map[description:if this happens the world has ended
] A [
0xc000180230
] Alerting rDzwzO8Vz 1 map[hello:world
] NoData 1 group test xu9Qkd8Vz 2023-05-12 06: 14: 35 +0000 UTC 5m 0s false
}
]
}
So if you want to give the CRD definition I would start reading the specs and playing around a bit with the SDK because I personally don't think the API documentation is great... there are no examples in the output.
https://grafana.com/docs/grafana/latest/developers/http_api/alerting_provisioning/ https://editor.swagger.io/?url=https://raw.githubusercontent.com/grafana/grafana/main/pkg/services/ngalert/api/tooling/post.json
If you use prometheus-operator I suggest using the import alertmanager as an datasource feature in grafana to manage your alerts.
Totally agree. I'm also using with prom-operator and it works fine.
Grafana alerting is a good feature in the UI but I think that APIs are not mature enough to work with third party software (like the operator).
Somehow I missed this issue when I searched for relevant ones but here's my two cents.
- Given the scope of this, I believe it may be necessary to build this iteratively. Regardless of any specific design, supporting this will create more CRDs than the project has at all already. It's no small task but it's not necessary to have everything at once IMO.
- I believe the Grafana APIs that are necessary are stable enough to begin work on this as only the "provisioning" API is required for what the grafana-operator will do. The "unstable" part are the proxied AlertManager/Prometheus APIs that aren't about defining rules/policies/etc.
- It is necessary that one pays attention to the complimentary nature of Grafana's unified alerting that extends (and somewhat replaces parts of) the Prometheus/AlertManager stack. For the grafana-operator this means ensuring that the functionality is designed with the existing prometheus-operator in mind. Specifically, I think in a first step one can completely ignore Grafana's AlertRules and defering that to
PrometheusRules
.
I'd be willing to draw up a proposal/design for this, however I can't assist with the implementation in the near future sadly.
@siegenthalerroger , creating a design proposal is a great start. We have been trying to do it through PRs since it's a simple way of providing feedback. You can view a design document that we have done earlier here: https://github.com/grafana-operator/grafana-operator/tree/master/docs/docs/proposals
How the document should look isn't set in stone, but if you take a look at, you will see what we are after.
Looking forward to seeing the design :)
@siegenthalerroger I'd be willing to help out here - I agree, that an iterative approach is good (enough) here. As for implementation - I hope to be able to help out here, but time is also an issue sadly.
@rammelmueller please create a design document as a first step. Looking forward to a PR around it
@rammelmueller I've created an initial proposal that's missing any/all of the specification for the new CRDs. Feel free to reach out if you want to assist.
Hi. I just wanted to ask if there is already more than just the proposal, like a decision it's going to happen or not?
@AlexEndris Hey! it is happening, @theSuess is planning on picking it up in the coming days I believe
That's great. I'd be willing to support, if I can, although I have 0 go experience. But I surely could play around with test versions.
We got an initial PR that can be viewed here: https://github.com/grafana/grafana-operator/pull/1420
We would love some feedback around https://github.com/grafana/grafana-operator/pull/1420, if you can try it out and see that it's working as you expect that would be great. We are currently going through the final reviews and hopefully we should merge the feature soon.
@NissesSenap I hope to find time to test it in the next few weeks.
Closing this issue as initial alerting support has been implemented in #1420.
Follow-up issues for Contact Points & Notification Policies are #1455 and #1454
There is no issue for implementing GrafanaAlertRule I could track, yet. Right? Or was this already implemented by #1420?
I believe alert rules are sub-resources of the GrafanaAlertGroup
CR, see https://github.com/grafana/grafana-operator/blob/ba4734b3d6d560352765327f96ae589d74825193/api/v1beta1/grafanaalertrulegroup_types.go#L58
Oh, then I must have missed that completely. Thank you! And Thx @theSuess for the implementation. Came just in time when I needed it! :)