ARC runners are not able to recover once ephemeral runner status was Failed
Describe the bug To conduct a POC , we are deploying ARC with a minimum of 2 runners.
However, after a few days, when the runner state changes to "failed" (as shown below), ARC does not maintain the minimum runner count.
We attempted to recreate both the controller POD and the listener POD, but the runners count still 1, not 2
2025-02-26T15:03:46Z INFO EphemeralRunner Updating ephemeral runner status to Failed {"ephemeralrunner": {"name":"runner-np-medium-cshmg-runner-2lnqd","namespace":"arc-runners"}} 2025-02-26T15:03:46Z INFO EphemeralRunner Removing the runner from the service {"ephemeralrunner": {"name":"runner-np-medium-cshmg-runner-2lnqd","namespace":"arc-runners"}} 2025-02-26T15:03:46Z INFO EphemeralRunnerSet Ephemeral runner counts {"ephemeralrunnerset": {"name":"runner-np-medium-cshmg","namespace":"arc-runners"}, "pending": 0, "running": 0, "finished": 0, "failed": 1, "deleting": 0}
Expected behavior Runners count should always be 2
Runner Version and Platform
We observed this with different version like v2.322.0 and older version
OS of the machine running the runner? Linux
Workaround:
To make it working again, we need to reinstall the ARC runner scale set after uninstalling the same
Could you please suggest/assist on this?