Production - [Alerting] Apple device failure rate alert
:broken_heart: Metric state changed to alerting
Description and instructions for this alert
Please note that this alert will fire every 12 hours as the list of machines can change while the alert is alive. So please keep an eye on the list of machines in the comment.
- FailureRate {Machine=DNCENGMAC063.local} 83
- FailureRate {Machine=DNCENGMAC074.local} 100
- FailureRate {Machine=DNCENGMAC111.local} 85
@dotnet/dnceng, please investigate
Automation information below, do not change
Grafana-Automated-Alert-Id-d70761f3c7e84a6380e44943a2e583e6
AppleTV at DNCENGMAC061 stopped working. @akoeplinger what was it you did? Try to reboot it? Maybe it didn't come back? I see that one as the flakiest:
https://dotnet-eng-grafana.westus2.cloudapp.azure.com/d/mobileDevices/mobile-devices?orgId=1&from=now-3d&to=now&var-mobile_platform=apple&var-queue=osx.1015.amd64.iphone.open&var-queue=osx.1015.amd64.appletv.open&var-queue=osx.1100.amd64.appletv.open&var-queue=osx.1015.amd64.iphone.perf&var-queue=osx.1100.arm64.appletv.open
:green_heart: Metric state changed to ok
Description and instructions for this alert
Please note that this alert will fire every 12 hours as the list of machines can change while the alert is alive. So please keep an eye on the list of machines in the comment.
:broken_heart: Metric state changed to alerting
Description and instructions for this alert
Please note that this alert will fire every 12 hours as the list of machines can change while the alert is alive. So please keep an eye on the list of machines in the comment.
- FailureRate {Machine=DNCENGMAC069.local} 85
:green_heart: Metric state changed to ok
Description and instructions for this alert
Please note that this alert will fire every 12 hours as the list of machines can change while the alert is alive. So please keep an eye on the list of machines in the comment.
:broken_heart: Metric state changed to alerting
Description and instructions for this alert
Please note that this alert will fire every 12 hours as the list of machines can change while the alert is alive. So please keep an eye on the list of machines in the comment.
- FailureRate {Machine=DNCENGMAC091.local} 100
@premun What should we do with this alert? This is noisy and each time it triggers again its different machines. By the time an ICM to reimage gets done the machine would already self heal from what I can see.
This alert hasn't fired once in 9 months as Apple devices were really stable, so I think there might be something going on with the queue if it started triggering now.
https://dotnet-eng-grafana.westus2.cloudapp.azure.com/d/mobileDevices/mobile-devices?orgId=1&from=now-3d&to=now&var-mobile_platform=apple&var-queue=osx.1015.amd64.appletv.open&var-queue=osx.1100.amd64.appletv.open&var-queue=osx.1100.arm64.appletv.open
It might though be connected to some TCP issues we've seen lately. I will check something and come back.
:broken_heart: Metric state changed to alerting
Description and instructions for this alert
Please note that this alert will fire every 12 hours as the list of machines can change while the alert is alive. So please keep an eye on the list of machines in the comment.
- FailureRate {Machine=DNCENGMAC091.local} 100
Okay, DNCENGMAC091 is stuck in a setup screen as per https://github.com/dotnet/arcade/issues/10664
I disabled it
The others are possibly https://github.com/dotnet/runtime/issues/75307 but that is a very edge case scenario and would explain why the alert keeps flipping
IcM issue -> https://portal.microsofticm.com/imp/v3/incidents/details/334627306/home
@premun I only rebooted the tvOS device attached to DNCENGMAC091 (shutdown didn't work and the device showed up on other Macs probably due to having wifi enabled), I didn't do anything with DNCENGMAC061.
:broken_heart: Metric state changed to alerting
Description and instructions for this alert
Please note that this alert will fire every 12 hours as the list of machines can change while the alert is alive. So please keep an eye on the list of machines in the comment.
- FailureRate {Machine=DNCENGMAC091.local} 100
:broken_heart: Metric state changed to alerting
Description and instructions for this alert
Please note that this alert will fire every 12 hours as the list of machines can change while the alert is alive. So please keep an eye on the list of machines in the comment.
- FailureRate {Machine=DNCENGMAC091.local} 100
:broken_heart: Metric state changed to alerting
Description and instructions for this alert
Please note that this alert will fire every 12 hours as the list of machines can change while the alert is alive. So please keep an eye on the list of machines in the comment.
- FailureRate {Machine=DNCENGMAC091.local} 100
This is waiting on an MLS ticket.
:broken_heart: Metric state changed to alerting
Description and instructions for this alert
Please note that this alert will fire every 12 hours as the list of machines can change while the alert is alive. So please keep an eye on the list of machines in the comment.
- FailureRate {Machine=DNCENGMAC091.local} 100
:broken_heart: Metric state changed to alerting
Description and instructions for this alert
Please note that this alert will fire every 12 hours as the list of machines can change while the alert is alive. So please keep an eye on the list of machines in the comment.
- FailureRate {Machine=DNCENGMAC091.local} 100
:green_heart: Metric state changed to ok
Description and instructions for this alert
Please note that this alert will fire every 12 hours as the list of machines can change while the alert is alive. So please keep an eye on the list of machines in the comment.
:broken_heart: Metric state changed to alerting
Description and instructions for this alert
Please note that this alert will fire every 12 hours as the list of machines can change while the alert is alive. So please keep an eye on the list of machines in the comment.
- FailureRate {Machine=DNCENGMAC091.local} 100
The machine was rebooted last week, but didn't get any work since. We need to wait for it to get some work items to see if it's fixed now
:broken_heart: Metric state changed to alerting
Description and instructions for this alert
Please note that this alert will fire every 12 hours as the list of machines can change while the alert is alive. So please keep an eye on the list of machines in the comment.
- FailureRate {Machine=DNCENGMAC091.local} 100
:broken_heart: Metric state changed to alerting
Description and instructions for this alert
Please note that this alert will fire every 12 hours as the list of machines can change while the alert is alive. So please keep an eye on the list of machines in the comment.
- FailureRate {Machine=DNCENGMAC091.local} 100
:broken_heart: Metric state changed to alerting
Description and instructions for this alert
Please note that this alert will fire every 12 hours as the list of machines can change while the alert is alive. So please keep an eye on the list of machines in the comment.
- FailureRate {Machine=DNCENGMAC091.local} 100
:broken_heart: Metric state changed to alerting
Description and instructions for this alert
Please note that this alert will fire every 12 hours as the list of machines can change while the alert is alive. So please keep an eye on the list of machines in the comment.
- FailureRate {Machine=DNCENGMAC091.local} 100
:broken_heart: Metric state changed to alerting
Description and instructions for this alert
Please note that this alert will fire every 12 hours as the list of machines can change while the alert is alive. So please keep an eye on the list of machines in the comment.
- FailureRate {Machine=DNCENGMAC091.local} 100
:broken_heart: Metric state changed to alerting
Description and instructions for this alert
Please note that this alert will fire every 12 hours as the list of machines can change while the alert is alive. So please keep an eye on the list of machines in the comment.
- FailureRate {Machine=DNCENGMAC091.local} 100
:green_heart: Metric state changed to ok
Description and instructions for this alert
Please note that this alert will fire every 12 hours as the list of machines can change while the alert is alive. So please keep an eye on the list of machines in the comment.
ICM has been solved. System is back online.