arcade icon indicating copy to clipboard operation
arcade copied to clipboard

Staging - [Alerting] Queue Insights Failures alert

Open dotnet-eng-status-staging[bot] opened this issue 3 years ago • 9 comments

:broken_heart: Metric state changed to alerting

Queue Insights has thrown an unhandled exception and failed to generate its check. This could be caused by invalid data in the Matrix of Truth, or some other component failing.

Wiki Page: https://dev.azure.com/dnceng/internal/_wiki/wikis/DNCEng%20Services%20Wiki?wikiVersion=GBwikiMaster&pagePath=/FR%20Operations/Wiki%20for%20Grafana%20Alerts/%5BAlerts%5D%20Queue%20Insights&pageId=956&_a=edit

Go to rule

@dotnet/dnceng, please investigate

Automation information below, do not change

Grafana-Automated-Alert-Id-992309c92835448d815d22588ee67d0c

:green_heart: Metric state changed to ok

Queue Insights has thrown an unhandled exception and failed to generate its check. This could be caused by invalid data in the Matrix of Truth, or some other component failing.

Wiki Page: https://dev.azure.com/dnceng/internal/_wiki/wikis/DNCEng%20Services%20Wiki?wikiVersion=GBwikiMaster&pagePath=/FR%20Operations/Wiki%20for%20Grafana%20Alerts/%5BAlerts%5D%20Queue%20Insights&pageId=956&_a=edit

Metric Graph

Go to rule

This seems like one off Kusto failures. I think we'll need to just make the alert less sensitive.

melotic avatar Jul 29 '22 16:07 melotic

This seems like one off Kusto failures. I think we'll need to just make the alert less sensitive.

What is the evidence? (Just curious about Kusto errors)

garath avatar Jul 29 '22 18:07 garath

:broken_heart: Metric state changed to alerting

Queue Insights has thrown an unhandled exception and failed to generate its check. This could be caused by invalid data in the Matrix of Truth, or some other component failing.

Wiki Page: https://dev.azure.com/dnceng/internal/_wiki/wikis/DNCEng%20Services%20Wiki?wikiVersion=GBwikiMaster&pagePath=/FR%20Operations/Wiki%20for%20Grafana%20Alerts/%5BAlerts%5D%20Queue%20Insights&pageId=956&_a=edit

Go to rule

:green_heart: Metric state changed to ok

Queue Insights has thrown an unhandled exception and failed to generate its check. This could be caused by invalid data in the Matrix of Truth, or some other component failing.

Wiki Page: https://dev.azure.com/dnceng/internal/_wiki/wikis/DNCEng%20Services%20Wiki?wikiVersion=GBwikiMaster&pagePath=/FR%20Operations/Wiki%20for%20Grafana%20Alerts/%5BAlerts%5D%20Queue%20Insights&pageId=956&_a=edit

Metric Graph

Go to rule

Assigned to @melotic as this is his recently-created alert...

MattGal avatar Aug 01 '22 16:08 MattGal

This seems like one off Kusto failures. I think we'll need to just make the alert less sensitive.

What is the evidence? (Just curious about Kusto errors)

See this AI query

Kusto client failed to send a request to the service: The response ended prematurely.. 
Please provide the following information when contacting the Kusto team @ https://aka.ms/kustosupport :
DataSource='https://engsrvprod.kusto.windows.net/v1/rest/query',
DatabaseName='engineeringdata',
ClientRequestId='KD2RunQuery;5fad2bd4-5faf-4e45-ad3b-d7f1c862fd2b',
Timestamp='2022-08-01T11:05:28.6136850Z'.

I'm not sure exactly what this error means.. It happens sporadically.

melotic avatar Aug 01 '22 16:08 melotic

Consider if there is a reasonable path to avoid needing to log the exception in the first place.

Does the client code in Build Analysis catch or retry these events? If it doesn't, maybe it should.

garath avatar Aug 01 '22 17:08 garath

PR is out to retry these Kusto exceptions: https://dev.azure.com/dnceng/internal/_git/dotnet-helix-service/pullrequest/24754

melotic avatar Aug 11 '22 15:08 melotic

PR has been merged to staging. Closing alert.

ilyas1974 avatar Aug 15 '22 14:08 ilyas1974