arcade
arcade copied to clipboard
Production - [Alerting] Helix API Average Response Time
:broken_heart: Metric state changed to alerting
Helix API Average Response Time is high!
- Server response time 5642.452180768239

@dotnet/dnceng, please investigate
Automation information below, do not change
Grafana-Automated-Alert-Id-24cae10d9eca44079e7cf3d47f148497
:broken_heart: Metric state changed to alerting
Helix API Average Response Time is high!
- Server response time 8168.334504762746

The alert auto resolved almost immediately; after looking at the graph I noticed that we regularly have a spike at 6AM PST which we captured both times in this issue. I remember @garath mentioning that it might be time to reevaluate the trigger for the alert
Would it make sense for this to be a median and/or percentile alert, rather than an average? With maybe multiple thresholds. Something like 90% of requests should be under 250 ms, and 99% must be under 30 seconds? I have a feeling that this "average" response time graph is getting skewed a lot by single long running requests.
Yeah, I think a set of percentiles is a good solution.

