arcade icon indicating copy to clipboard operation
arcade copied to clipboard

Production - [Alerting] Helix API Average Response Time

Open dotnet-eng-status[bot] opened this issue 3 years ago • 1 comments

:broken_heart: Metric state changed to alerting

Helix API Average Response Time is high!

  • Server response time 5642.452180768239

Metric Graph

Go to rule

@dotnet/dnceng, please investigate

Automation information below, do not change

Grafana-Automated-Alert-Id-24cae10d9eca44079e7cf3d47f148497

dotnet-eng-status[bot] avatar Sep 23 '22 14:09 dotnet-eng-status[bot]

:green_heart: Metric state changed to ok

Helix API Average Response Time is high!

Metric Graph

Go to rule

dotnet-eng-status[bot] avatar Sep 23 '22 14:09 dotnet-eng-status[bot]

:broken_heart: Metric state changed to alerting

Helix API Average Response Time is high!

  • Server response time 8168.334504762746

Metric Graph

Go to rule

dotnet-eng-status[bot] avatar Sep 25 '22 14:09 dotnet-eng-status[bot]

:green_heart: Metric state changed to ok

Helix API Average Response Time is high!

Metric Graph

Go to rule

dotnet-eng-status[bot] avatar Sep 25 '22 14:09 dotnet-eng-status[bot]

The alert auto resolved almost immediately; after looking at the graph I noticed that we regularly have a spike at 6AM PST which we captured both times in this issue. I remember @garath mentioning that it might be time to reevaluate the trigger for the alert

ulisesh avatar Sep 27 '22 20:09 ulisesh

Would it make sense for this to be a median and/or percentile alert, rather than an average? With maybe multiple thresholds. Something like 90% of requests should be under 250 ms, and 99% must be under 30 seconds? I have a feeling that this "average" response time graph is getting skewed a lot by single long running requests.

alexperovich avatar Sep 27 '22 22:09 alexperovich

Yeah, I think a set of percentiles is a good solution.

garath avatar Sep 27 '22 23:09 garath