roadmap icon indicating copy to clipboard operation
roadmap copied to clipboard

Alerting improvements

Open bohnjamin opened this issue 6 years ago • 1 comments

Some alerting ideas: I can break this out into separate suggestions if need be, but I felt like the webhooks/email sections would mostly become irrelevant if building alert templates was a possibility.

Flap detection

not default, opt-in only

  • Allow the system to detect that an alert keeps firing and suppress alerts for x minutes after y alerts in z minutes (all configurable) have triggered, with configurable reset parameters
    • e.g. if the alert hasn't fired for 1 hour reset the suppression detection

Alert Templating

Ideally the alerts would all be configurable with template variable values.

An example Alert Open email could look like this: Subject:Scout {{app_name}} alert: {{alert_name}} is {{alert_value}} for {{alert_duration}} Body:

App: {{app_name}} / {{link}}
Current value: {{alert_value}} over {{alert_duration}}
Threshold: {{alert_threshold}} over {{alert_duration}}
Start time: {{start_time}}

{{image:30m:include_throughput}}

This would render out to be: Subject: Scout My Example App alert: response_time_average is 127ms for 5m Body:

App:My Example App / https://scoutapm.com/apps/99999999?time=e:2019-05-30T16:50:00.000Z,d:30-min&p=response_time
Current value: 127ms over 5m
Threshold: >= 100ms over 5m
Start time: 2019-05-30T16:45:00.000Z

<attached/embedded image of the graph you'd find here: 
https://scoutapm.com/apps/99999999?time=e:2019-05-30T16:50:00.000Z,d:30-min&p=response_time&s=throughput>

Webhooks

  • Allow a webhooks and payloads to be built like the templates above OR
  • Allow some basic integrations (slack/pagerduty) that don't require a third-party

Emails

Assuming templating is a longer-term or unrealistic goal, some other minor improvements could be added to emails

Alert open emails

  • What the current value is
    • e.g. if the alert is configured for >=100ms response time over a 5 minute period, what's the actual response time average over the past 5 minutes that triggered the alert
  • Some other useful data: What were the previous 30 minute/24 hour averages?
  • An image of the graph from the period that triggered the alert
  • The link should take you to the x minutes leading up to the start of the alert including the period that triggered the email

Alert closed emails

  • What was the low/average/peak value while we were over the threshold?
  • What's the value now?
  • An image of the graph over the period the alert was open (if it was more than x minutes just provide the first or last x minutes)
  • The link should take you to the period where the alert started/ended. If it was longer than x minutes, default to a 30 minute period starting 10 minutes prior to the alert.

bohnjamin avatar May 30 '19 17:05 bohnjamin

Now that Zapier's webhooks are part of a premium plan I think you should put more effort into allowing integrations with services like Slack. Suggestions in this issue would be great addition.

Relates to #13

lenart avatar Nov 26 '20 10:11 lenart