arcade Production - [Alerting] Helix API Average Response Time

:broken_heart: Metric state changed to alerting

Helix API Average Response Time is high!

Server response time 5642.452180768239

Metric Graph

Go to rule

@dotnet/dnceng, please investigate

Automation information below, do not change

Grafana-Automated-Alert-Id-24cae10d9eca44079e7cf3d47f148497

Sep 23 '22 14:09 dotnet-eng-status[bot]

:green_heart: Metric state changed to ok

Helix API Average Response Time is high!

Metric Graph

Go to rule

Sep 23 '22 14:09 dotnet-eng-status[bot]

:broken_heart: Metric state changed to alerting

Helix API Average Response Time is high!

Server response time 8168.334504762746

Metric Graph

Go to rule

Sep 25 '22 14:09 dotnet-eng-status[bot]

:green_heart: Metric state changed to ok

Helix API Average Response Time is high!

Metric Graph

Go to rule

Sep 25 '22 14:09 dotnet-eng-status[bot]

The alert auto resolved almost immediately; after looking at the graph I noticed that we regularly have a spike at 6AM PST which we captured both times in this issue. I remember @garath mentioning that it might be time to reevaluate the trigger for the alert

Sep 27 '22 20:09 ulisesh

Would it make sense for this to be a median and/or percentile alert, rather than an average? With maybe multiple thresholds. Something like 90% of requests should be under 250 ms, and 99% must be under 30 seconds? I have a feeling that this "average" response time graph is getting skewed a lot by single long running requests.

Sep 27 '22 22:09 alexperovich

Yeah, I think a set of percentiles is a good solution.

Sep 27 '22 23:09 garath