Alerting improvements
Some alerting ideas: I can break this out into separate suggestions if need be, but I felt like the webhooks/email sections would mostly become irrelevant if building alert templates was a possibility.
Flap detection
not default, opt-in only
- Allow the system to detect that an alert keeps firing and suppress alerts for x minutes after y alerts in z minutes (all configurable) have triggered, with configurable reset parameters
- e.g. if the alert hasn't fired for 1 hour reset the suppression detection
Alert Templating
Ideally the alerts would all be configurable with template variable values.
An example Alert Open email could look like this:
Subject:Scout {{app_name}} alert: {{alert_name}} is {{alert_value}} for {{alert_duration}}
Body:
App: {{app_name}} / {{link}}
Current value: {{alert_value}} over {{alert_duration}}
Threshold: {{alert_threshold}} over {{alert_duration}}
Start time: {{start_time}}
{{image:30m:include_throughput}}
This would render out to be:
Subject: Scout My Example App alert: response_time_average is 127ms for 5m
Body:
App:My Example App / https://scoutapm.com/apps/99999999?time=e:2019-05-30T16:50:00.000Z,d:30-min&p=response_time
Current value: 127ms over 5m
Threshold: >= 100ms over 5m
Start time: 2019-05-30T16:45:00.000Z
<attached/embedded image of the graph you'd find here:
https://scoutapm.com/apps/99999999?time=e:2019-05-30T16:50:00.000Z,d:30-min&p=response_time&s=throughput>
Webhooks
- Allow a webhooks and payloads to be built like the templates above OR
- Allow some basic integrations (slack/pagerduty) that don't require a third-party
Emails
Assuming templating is a longer-term or unrealistic goal, some other minor improvements could be added to emails
Alert open emails
- What the current value is
- e.g. if the alert is configured for >=100ms response time over a 5 minute period, what's the actual response time average over the past 5 minutes that triggered the alert
- Some other useful data: What were the previous 30 minute/24 hour averages?
- An image of the graph from the period that triggered the alert
- The link should take you to the x minutes leading up to the start of the alert including the period that triggered the email
Alert closed emails
- What was the low/average/peak value while we were over the threshold?
- What's the value now?
- An image of the graph over the period the alert was open (if it was more than x minutes just provide the first or last x minutes)
- The link should take you to the period where the alert started/ended. If it was longer than x minutes, default to a 30 minute period starting 10 minutes prior to the alert.
Now that Zapier's webhooks are part of a premium plan I think you should put more effort into allowing integrations with services like Slack. Suggestions in this issue would be great addition.
Relates to #13