arcade
arcade copied to clipboard
Add policy for setting up Grafana alerts
This is a result of a discussion that happened here: https://github.com/dotnet/core-eng/issues/15725
All feedback is appreciated, I added people who were part of the discussion
@markwilkie @ilyas1974 @tkapin can you help me review this policy and merge it please?
Thanks for the review!
Do we want to start thinking of ways we're going to make sure the policy is being followed?
Maybe some sort of consistency check that makes sure all alerts have a corresponding guide that could be a separate leg in the PR builds for the repos that hold dashboards?
My thinking was something like @ilyas1974 could enforce it piece by piece as FR receives alerts and works through them slowly? I think a person who just resolved an alert is a good candidate to describe how they did it and what resources they used when doing that. It could also spread the work among the team in an amortized manner? :)
I think that makes sense for existing alerts, but I'd rather we had some way to ensure that new alerts are up to standards. We can say that it's up to PR reviewers to ensure that happens, but it's very easy for them to be missed.
Once we've made our existing alerts better, I'd like to avoid us being in a cycle of:
- Alert gets checked in without a corresponding guide for x or y reason
- FR has to scramble to solve the alert the first time it happens
- Now FR person who suffered through it has to also remediate the missing guides.
for every new alert.
Given that @ilyas1974 is diving into the "docs are a jungle" work, I'll let him take a look.
I love the fact that we'd have guidance. I agree with Ricardo too that including how we plan to provide visibility to how we're doing per policy seems important.
We could have a warning in the helix-services
pipeline that would verify a link presence in the message
attribute in the JSON?
Maybe we put 👍 / 👎 feedback buttons on alerts? (Only kinda kidding.)
A warning would get ignored (our builds have had hundreds of warnings for ages... every now and then I go clean them up, but in general, no one looks at things that aren't failing).
I'm sort of fine with this being a code review/after the fact when an alert triggers situation. Honestly, for most alerts, until they trigger the first time, it's hard to write that documentation, because you don't necessarily know what investigation is helpful. And if it never actually triggers (we have a lot of really conservative alerts that basically boil down to "the sky is falling"), spending a bunch of time to write up a bunch of documentation isn't a huge win (in terms of time spent or the complexity of our documentation). Also, the mere presence of the stuff isn't really enough... it's easy to write useless comments/docs, so a code reviewer really should make sure that the documentation is understandable to someone other than the person who wrote it... so there is already a burden on reviewers.
A warning would get ignored (our builds have had hundreds of warnings for ages... every now and then I go clean them up, but in general, no one looks at things that aren't failing).
Correct me if I'm wrong but OSOB and helix-services have 0 warnings and we tend to them as they have a certain function always. At least in OSOB every warning does, I wasn't able to find one in helix-services? Or are they silenced there?
Also, the mere presence of the stuff isn't really enough... it's easy to write useless comments/docs, so a code reviewer really should make sure that the documentation is understandable to someone other than the person who wrote it... so there is already a burden on reviewers.
I agree that a review for alert docs is important and it's hard to get them right but a mere presence of links leading to the right place (e.g. links to the correct AppInsights with logs) would go a long way from where we are now. The way we name in things in cloud is sometimes quite unfortunate and since I personally don't work with our services often, to this day I have still failed to memorize many of the things so each alert investigation starts by me trying to find the right resource.
Whatever we agree on, I would like to lower the bar for any kind of doc improvement as much as possible (this is where I am trying to get from where we are with this whole effort).
arcade-services and helix-services only don't have warnings because I rather regularly go through and clean them up (usually when I'm doing other sweeping changes, and they are making it hard for me to tell what I might have broken). Anything that doesn't fail a PR, especially in arcade services (where you likely won't even look at the build if it passes, because it will just be a green dot in GitHub) really doesn't affect behavior much, so generally isn't worth writing/maintaining.
I like the idea of using the wiki for most of the actual documention, and leaving the alerts/panels pretty low weight. Editting the wiki is basically zero-barrier. Editing the panel/alert and then importing that into the dashboard, and then making a PR, just to add a little extra context is a HUGE burden, and likely why no one bothers. :-) I wish the AzDO wiki was really a wiki, and you could have dead links that lead to a "create this page" thing... then we could just automatically generate the links for every alert. BUT, AzDO wikis are, in fact, not wikis in that way, so we can't. (GitHub wikis were, so it was awesome when we had one of those).
@garath @riarenas I feel like we're kind of following this policy even without it being checked in. However, when we define a new alert, it might be helpful to point people somewhere. Maybe we can just merge this and be done with it?
I'm fine with this. @ilyas1974 should review as well as he has been doing work in normalizing the format of FR-ish wiki things, including alerts.
Oops, sorry, didn't notice it was Matt but Ilya with the second approval