arcade icon indicating copy to clipboard operation
arcade copied to clipboard

Production - [Alerting] Apple device failure rate alert

Open dotnet-eng-status[bot] opened this issue 3 years ago • 28 comments

:broken_heart: Metric state changed to alerting

Description and instructions for this alert

Please note that this alert will fire every 12 hours as the list of machines can change while the alert is alive. So please keep an eye on the list of machines in the comment.

  • FailureRate {Machine=DNCENGMAC063.local} 83
  • FailureRate {Machine=DNCENGMAC074.local} 100
  • FailureRate {Machine=DNCENGMAC111.local} 85

Go to rule

@dotnet/dnceng, please investigate

Automation information below, do not change

Grafana-Automated-Alert-Id-d70761f3c7e84a6380e44943a2e583e6

dotnet-eng-status[bot] avatar Sep 12 '22 07:09 dotnet-eng-status[bot]

AppleTV at DNCENGMAC061 stopped working. @akoeplinger what was it you did? Try to reboot it? Maybe it didn't come back? I see that one as the flakiest:

https://dotnet-eng-grafana.westus2.cloudapp.azure.com/d/mobileDevices/mobile-devices?orgId=1&from=now-3d&to=now&var-mobile_platform=apple&var-queue=osx.1015.amd64.iphone.open&var-queue=osx.1015.amd64.appletv.open&var-queue=osx.1100.amd64.appletv.open&var-queue=osx.1015.amd64.iphone.perf&var-queue=osx.1100.arm64.appletv.open

premun avatar Sep 12 '22 07:09 premun

:green_heart: Metric state changed to ok

Description and instructions for this alert

Please note that this alert will fire every 12 hours as the list of machines can change while the alert is alive. So please keep an eye on the list of machines in the comment.

Go to rule

dotnet-eng-status[bot] avatar Sep 12 '22 10:09 dotnet-eng-status[bot]

:broken_heart: Metric state changed to alerting

Description and instructions for this alert

Please note that this alert will fire every 12 hours as the list of machines can change while the alert is alive. So please keep an eye on the list of machines in the comment.

  • FailureRate {Machine=DNCENGMAC069.local} 85

Go to rule

dotnet-eng-status[bot] avatar Sep 12 '22 14:09 dotnet-eng-status[bot]

:green_heart: Metric state changed to ok

Description and instructions for this alert

Please note that this alert will fire every 12 hours as the list of machines can change while the alert is alive. So please keep an eye on the list of machines in the comment.

Go to rule

dotnet-eng-status[bot] avatar Sep 12 '22 19:09 dotnet-eng-status[bot]

:broken_heart: Metric state changed to alerting

Description and instructions for this alert

Please note that this alert will fire every 12 hours as the list of machines can change while the alert is alive. So please keep an eye on the list of machines in the comment.

  • FailureRate {Machine=DNCENGMAC091.local} 100

Go to rule

dotnet-eng-status[bot] avatar Sep 12 '22 22:09 dotnet-eng-status[bot]

@premun What should we do with this alert? This is noisy and each time it triggers again its different machines. By the time an ICM to reimage gets done the machine would already self heal from what I can see.

alexperovich avatar Sep 13 '22 00:09 alexperovich

This alert hasn't fired once in 9 months as Apple devices were really stable, so I think there might be something going on with the queue if it started triggering now.

https://dotnet-eng-grafana.westus2.cloudapp.azure.com/d/mobileDevices/mobile-devices?orgId=1&from=now-3d&to=now&var-mobile_platform=apple&var-queue=osx.1015.amd64.appletv.open&var-queue=osx.1100.amd64.appletv.open&var-queue=osx.1100.arm64.appletv.open

It might though be connected to some TCP issues we've seen lately. I will check something and come back.

premun avatar Sep 13 '22 08:09 premun

:broken_heart: Metric state changed to alerting

Description and instructions for this alert

Please note that this alert will fire every 12 hours as the list of machines can change while the alert is alive. So please keep an eye on the list of machines in the comment.

  • FailureRate {Machine=DNCENGMAC091.local} 100

Go to rule

dotnet-eng-status[bot] avatar Sep 13 '22 10:09 dotnet-eng-status[bot]

Okay, DNCENGMAC091 is stuck in a setup screen as per https://github.com/dotnet/arcade/issues/10664

I disabled it

premun avatar Sep 13 '22 11:09 premun

The others are possibly https://github.com/dotnet/runtime/issues/75307 but that is a very edge case scenario and would explain why the alert keeps flipping

premun avatar Sep 13 '22 11:09 premun

IcM issue -> https://portal.microsofticm.com/imp/v3/incidents/details/334627306/home

oleksandr-didyk avatar Sep 13 '22 11:09 oleksandr-didyk

@premun I only rebooted the tvOS device attached to DNCENGMAC091 (shutdown didn't work and the device showed up on other Macs probably due to having wifi enabled), I didn't do anything with DNCENGMAC061.

akoeplinger avatar Sep 13 '22 12:09 akoeplinger

:broken_heart: Metric state changed to alerting

Description and instructions for this alert

Please note that this alert will fire every 12 hours as the list of machines can change while the alert is alive. So please keep an eye on the list of machines in the comment.

  • FailureRate {Machine=DNCENGMAC091.local} 100

Go to rule

dotnet-eng-status[bot] avatar Sep 13 '22 22:09 dotnet-eng-status[bot]

:broken_heart: Metric state changed to alerting

Description and instructions for this alert

Please note that this alert will fire every 12 hours as the list of machines can change while the alert is alive. So please keep an eye on the list of machines in the comment.

  • FailureRate {Machine=DNCENGMAC091.local} 100

Go to rule

dotnet-eng-status[bot] avatar Sep 14 '22 10:09 dotnet-eng-status[bot]

:broken_heart: Metric state changed to alerting

Description and instructions for this alert

Please note that this alert will fire every 12 hours as the list of machines can change while the alert is alive. So please keep an eye on the list of machines in the comment.

  • FailureRate {Machine=DNCENGMAC091.local} 100

Go to rule

dotnet-eng-status[bot] avatar Sep 14 '22 22:09 dotnet-eng-status[bot]

This is waiting on an MLS ticket.

alexperovich avatar Sep 15 '22 00:09 alexperovich

:broken_heart: Metric state changed to alerting

Description and instructions for this alert

Please note that this alert will fire every 12 hours as the list of machines can change while the alert is alive. So please keep an eye on the list of machines in the comment.

  • FailureRate {Machine=DNCENGMAC091.local} 100

Go to rule

dotnet-eng-status[bot] avatar Sep 15 '22 10:09 dotnet-eng-status[bot]

:broken_heart: Metric state changed to alerting

Description and instructions for this alert

Please note that this alert will fire every 12 hours as the list of machines can change while the alert is alive. So please keep an eye on the list of machines in the comment.

  • FailureRate {Machine=DNCENGMAC091.local} 100

Go to rule

dotnet-eng-status[bot] avatar Sep 15 '22 22:09 dotnet-eng-status[bot]

:green_heart: Metric state changed to ok

Description and instructions for this alert

Please note that this alert will fire every 12 hours as the list of machines can change while the alert is alive. So please keep an eye on the list of machines in the comment.

Go to rule

dotnet-eng-status[bot] avatar Sep 16 '22 10:09 dotnet-eng-status[bot]

:broken_heart: Metric state changed to alerting

Description and instructions for this alert

Please note that this alert will fire every 12 hours as the list of machines can change while the alert is alive. So please keep an eye on the list of machines in the comment.

  • FailureRate {Machine=DNCENGMAC091.local} 100

Go to rule

dotnet-eng-status[bot] avatar Sep 19 '22 09:09 dotnet-eng-status[bot]

The machine was rebooted last week, but didn't get any work since. We need to wait for it to get some work items to see if it's fixed now

dkurepa avatar Sep 19 '22 09:09 dkurepa

:broken_heart: Metric state changed to alerting

Description and instructions for this alert

Please note that this alert will fire every 12 hours as the list of machines can change while the alert is alive. So please keep an eye on the list of machines in the comment.

  • FailureRate {Machine=DNCENGMAC091.local} 100

Go to rule

dotnet-eng-status[bot] avatar Sep 19 '22 21:09 dotnet-eng-status[bot]

:broken_heart: Metric state changed to alerting

Description and instructions for this alert

Please note that this alert will fire every 12 hours as the list of machines can change while the alert is alive. So please keep an eye on the list of machines in the comment.

  • FailureRate {Machine=DNCENGMAC091.local} 100

Go to rule

dotnet-eng-status[bot] avatar Sep 20 '22 09:09 dotnet-eng-status[bot]

:broken_heart: Metric state changed to alerting

Description and instructions for this alert

Please note that this alert will fire every 12 hours as the list of machines can change while the alert is alive. So please keep an eye on the list of machines in the comment.

  • FailureRate {Machine=DNCENGMAC091.local} 100

Go to rule

dotnet-eng-status[bot] avatar Sep 20 '22 21:09 dotnet-eng-status[bot]

:broken_heart: Metric state changed to alerting

Description and instructions for this alert

Please note that this alert will fire every 12 hours as the list of machines can change while the alert is alive. So please keep an eye on the list of machines in the comment.

  • FailureRate {Machine=DNCENGMAC091.local} 100

Go to rule

dotnet-eng-status[bot] avatar Sep 21 '22 09:09 dotnet-eng-status[bot]

:broken_heart: Metric state changed to alerting

Description and instructions for this alert

Please note that this alert will fire every 12 hours as the list of machines can change while the alert is alive. So please keep an eye on the list of machines in the comment.

  • FailureRate {Machine=DNCENGMAC091.local} 100

Go to rule

dotnet-eng-status[bot] avatar Sep 21 '22 21:09 dotnet-eng-status[bot]

:broken_heart: Metric state changed to alerting

Description and instructions for this alert

Please note that this alert will fire every 12 hours as the list of machines can change while the alert is alive. So please keep an eye on the list of machines in the comment.

  • FailureRate {Machine=DNCENGMAC091.local} 100

Go to rule

dotnet-eng-status[bot] avatar Sep 22 '22 09:09 dotnet-eng-status[bot]

:green_heart: Metric state changed to ok

Description and instructions for this alert

Please note that this alert will fire every 12 hours as the list of machines can change while the alert is alive. So please keep an eye on the list of machines in the comment.

Go to rule

dotnet-eng-status[bot] avatar Sep 22 '22 21:09 dotnet-eng-status[bot]

ICM has been solved. System is back online.

ilyas1974 avatar Sep 26 '22 20:09 ilyas1974