arcade Production - [Alerting] Apple device failure rate alert

:broken_heart: Metric state changed to alerting

FailureRate {Machine=DNCENGMAC063.local} 83
FailureRate {Machine=DNCENGMAC074.local} 100
FailureRate {Machine=DNCENGMAC111.local} 85

Go to rule

@dotnet/dnceng, please investigate

Automation information below, do not change

Grafana-Automated-Alert-Id-d70761f3c7e84a6380e44943a2e583e6

Sep 12 '22 07:09 dotnet-eng-status[bot]

AppleTV at DNCENGMAC061 stopped working. @akoeplinger what was it you did? Try to reboot it? Maybe it didn't come back? I see that one as the flakiest:

https://dotnet-eng-grafana.westus2.cloudapp.azure.com/d/mobileDevices/mobile-devices?orgId=1&from=now-3d&to=now&var-mobile_platform=apple&var-queue=osx.1015.amd64.iphone.open&var-queue=osx.1015.amd64.appletv.open&var-queue=osx.1100.amd64.appletv.open&var-queue=osx.1015.amd64.iphone.perf&var-queue=osx.1100.arm64.appletv.open

Sep 12 '22 07:09 premun

:green_heart: Metric state changed to ok

Go to rule

Sep 12 '22 10:09 dotnet-eng-status[bot]

:broken_heart: Metric state changed to alerting

FailureRate {Machine=DNCENGMAC069.local} 85

Go to rule

Sep 12 '22 14:09 dotnet-eng-status[bot]

:green_heart: Metric state changed to ok

Go to rule

Sep 12 '22 19:09 dotnet-eng-status[bot]

:broken_heart: Metric state changed to alerting

FailureRate {Machine=DNCENGMAC091.local} 100

Go to rule

Sep 12 '22 22:09 dotnet-eng-status[bot]

@premun What should we do with this alert? This is noisy and each time it triggers again its different machines. By the time an ICM to reimage gets done the machine would already self heal from what I can see.

Sep 13 '22 00:09 alexperovich

This alert hasn't fired once in 9 months as Apple devices were really stable, so I think there might be something going on with the queue if it started triggering now.

https://dotnet-eng-grafana.westus2.cloudapp.azure.com/d/mobileDevices/mobile-devices?orgId=1&from=now-3d&to=now&var-mobile_platform=apple&var-queue=osx.1015.amd64.appletv.open&var-queue=osx.1100.amd64.appletv.open&var-queue=osx.1100.arm64.appletv.open

It might though be connected to some TCP issues we've seen lately. I will check something and come back.

Sep 13 '22 08:09 premun

:broken_heart: Metric state changed to alerting

FailureRate {Machine=DNCENGMAC091.local} 100

Go to rule

Sep 13 '22 10:09 dotnet-eng-status[bot]

Okay, DNCENGMAC091 is stuck in a setup screen as per https://github.com/dotnet/arcade/issues/10664

I disabled it

Sep 13 '22 11:09 premun

The others are possibly https://github.com/dotnet/runtime/issues/75307 but that is a very edge case scenario and would explain why the alert keeps flipping

Sep 13 '22 11:09 premun

IcM issue -> https://portal.microsofticm.com/imp/v3/incidents/details/334627306/home

Sep 13 '22 11:09 oleksandr-didyk

@premun I only rebooted the tvOS device attached to DNCENGMAC091 (shutdown didn't work and the device showed up on other Macs probably due to having wifi enabled), I didn't do anything with DNCENGMAC061.

Sep 13 '22 12:09 akoeplinger

:broken_heart: Metric state changed to alerting

FailureRate {Machine=DNCENGMAC091.local} 100

Go to rule

Sep 13 '22 22:09 dotnet-eng-status[bot]

:broken_heart: Metric state changed to alerting

FailureRate {Machine=DNCENGMAC091.local} 100

Go to rule

Sep 14 '22 10:09 dotnet-eng-status[bot]

:broken_heart: Metric state changed to alerting

FailureRate {Machine=DNCENGMAC091.local} 100

Go to rule

Sep 14 '22 22:09 dotnet-eng-status[bot]

This is waiting on an MLS ticket.

Sep 15 '22 00:09 alexperovich

:broken_heart: Metric state changed to alerting

FailureRate {Machine=DNCENGMAC091.local} 100

Go to rule

Sep 15 '22 10:09 dotnet-eng-status[bot]

:broken_heart: Metric state changed to alerting

FailureRate {Machine=DNCENGMAC091.local} 100

Go to rule

Sep 15 '22 22:09 dotnet-eng-status[bot]

:green_heart: Metric state changed to ok

Go to rule

Sep 16 '22 10:09 dotnet-eng-status[bot]

:broken_heart: Metric state changed to alerting

FailureRate {Machine=DNCENGMAC091.local} 100

Go to rule

Sep 19 '22 09:09 dotnet-eng-status[bot]

The machine was rebooted last week, but didn't get any work since. We need to wait for it to get some work items to see if it's fixed now

Sep 19 '22 09:09 dkurepa

:broken_heart: Metric state changed to alerting

FailureRate {Machine=DNCENGMAC091.local} 100

Go to rule

Sep 19 '22 21:09 dotnet-eng-status[bot]

:broken_heart: Metric state changed to alerting

FailureRate {Machine=DNCENGMAC091.local} 100

Go to rule

Sep 20 '22 09:09 dotnet-eng-status[bot]

:broken_heart: Metric state changed to alerting

FailureRate {Machine=DNCENGMAC091.local} 100

Go to rule

Sep 20 '22 21:09 dotnet-eng-status[bot]

:broken_heart: Metric state changed to alerting

FailureRate {Machine=DNCENGMAC091.local} 100

Go to rule

Sep 21 '22 09:09 dotnet-eng-status[bot]

:broken_heart: Metric state changed to alerting

FailureRate {Machine=DNCENGMAC091.local} 100

Go to rule

Sep 21 '22 21:09 dotnet-eng-status[bot]

:broken_heart: Metric state changed to alerting

FailureRate {Machine=DNCENGMAC091.local} 100

Go to rule

Sep 22 '22 09:09 dotnet-eng-status[bot]

:green_heart: Metric state changed to ok

Go to rule

Sep 22 '22 21:09 dotnet-eng-status[bot]

ICM has been solved. System is back online.

Sep 26 '22 20:09 ilyas1974