All runners offline - failed to acquire jobs
Checks
- [X] I've already read https://docs.github.com/en/actions/hosting-your-own-runners/managing-self-hosted-runners-with-actions-runner-controller/troubleshooting-actions-runner-controller-errors and I'm sure my issue is not covered in the troubleshooting guide.
- [X] I am using charts that are officially provided
Controller Version
0.9.2
Deployment Method
Helm
Checks
- [X] This isn't a question or user support case (For Q&A and community support, go to Discussions).
- [X] I've read the Changelog before submitting this issue and I'm sure it's not due to any recently-introduced backward-incompatible changes
To Reproduce
1. Allow controller to run as normal
2. Observe listener pod constantly crashing
3. Observe that runner-set in GitHub UI is online, but all runners are offline
The timing of this is around the update of ghcr.io/actions/actions-runner from 2.317.0 to 2.318.0, but rolling back to the image built from 2.317.0 has not resolved it.
Describe the bug
Up until yesterday this was working correctly. As of this morning, arc-runner-set-xxxx-listener fails with the following exception:
2024/07/31 08:15:48 Application returned an error: failed to handle message: failed to acquire jobs: failed to acquire jobs: Post "https://pipelinesghubeus7.actions.githubusercontent.com/WugTYvPOjBYXoVTZqLtkHQNo8dP79zLHH79vzLjE9k8ir38pq6//_apis/runtime/runnerscalesets/10/acquirejobs?api-version=6.0-preview": POST https://pipelinesghubeus7.actions.githubusercontent.com/WugTYvPOjBYXoVTZqLtkHQNo8dP79zLHH79vzLjE9k8ir38pq6//_apis/runtime/runnerscalesets/10/acquirejobs?api-version=6.0-preview giving up after 5 attempt(s)
The runner set remains online in github, but all runners are offline.
Describe the expected behavior
arc-runner-set-XXXX-listener should run without crashing, and runners should be online in github at https://github.com/organizations/OptAxe/settings/actions/runners
Additional Context
No values changed in controller-deployment
runner-deployment has the following values and template:
values:
githubConfigUrl: https://github.com/OurOrg
githubConfigSecret: github-pat
runnerGroup: arc-self-hosted-runners
minRunners: 1
maxRunners: 2
# Template needs to be set to use latest docker:dind with iptables legacy
# See https://github.com/actions/actions-runner-controller/issues/3159#issuecomment-1906905610
template:
spec:
nodeSelector:
node_pool: github-runners
initContainers:
- name: init-dind-externals
image: image-from-ghcr.io/actions/actions-runner:2.317.0
command: ["cp", "-r", "-v", "/home/runner/externals/.", "/home/runner/tmpDir/"]
volumeMounts:
- name: dind-externals
mountPath: /home/runner/tmpDir
containers:
- name: runner
image: image-from-ghcr.io/actions/actions-runner:2.317.0
command: ["/home/runner/run.sh"]
env:
- name: DOCKER_HOST
value: unix:///var/run/docker.sock
resources:
requests:
memory: 5Gi
volumeMounts:
- name: work
mountPath: /home/runner/_work
- name: dind-sock
mountPath: /var/run
- name: dind
image: docker:dind
args:
- dockerd
- --host=unix:///var/run/docker.sock
- --group=$(DOCKER_GROUP_GID)
env:
- name: DOCKER_GROUP_GID
value: "123"
- name: DOCKER_IPTABLES_LEGACY
value: '1'
securityContext:
privileged: true
volumeMounts:
- name: work
mountPath: /home/runner/_work
- name: dind-sock
mountPath: /var/run
- name: dind-externals
mountPath: /home/runner/externals
volumes:
- name: work
emptyDir: {}
- name: dind-sock
emptyDir: {}
- name: dind-externals
emptyDir: {}
### Controller Logs
```shell
https://gist.github.com/WTPOptAxe/f57e05eeb0989a968f3b30ab584baada
Runner Pod Logs
https://gist.github.com/WTPOptAxe/11be8a39ca690877e878cf539327561f
Hello! Thank you for filing an issue.
The maintainers will triage your issue shortly.
In the meantime, please take a look at the troubleshooting guide for bug reports.
If this is a feature request, please review our contribution guidelines.
Hey, I'm going to close this issue since many improvements have been made, and it seems like this issue is not occurring again. Thank you for reporting it. Please let us know if you are still experiencing this issue on the latest release.