actions-runner-controller
actions-runner-controller copied to clipboard
I am seeing Runner pod being terminated randomly. No help wrt to logs
Checks
- [X] I've already read https://github.com/actions/actions-runner-controller/blob/master/TROUBLESHOOTING.md and I'm sure my issue is not covered in the troubleshooting guide.
- [X] I'm not using a custom entrypoint in my runner image
Controller Version
0.27.4
Helm Chart Version
No response
CertManager Version
No response
Deployment Method
Kustomize
cert-manager installation
Yes,
Checks
- [X] This isn't a question or user support case (For Q&A and community support, go to Discussions. It might also be a good idea to contract with any of contributors and maintainers if your business is so critical and therefore you need priority support
- [X] I've read releasenotes before submitting this issue and I'm sure it's not due to any recently-introduced backward-incompatible changes
- [X] My actions-runner-controller version (v0.x.y) does support the feature
- [X] I've already upgraded ARC (including the CRDs, see charts/actions-runner-controller/docs/UPGRADING.md for details) to the latest and it didn't fix the issue
- [X] I've migrated to the workflow job webhook event (if you using webhook driven scaling)
Resource Definitions
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: gh-runner-pvc
namespace: actions-runner-system
spec:
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 30Gi
storageClassName: standard
---
apiVersion: actions.summerwind.dev/v1alpha1
kind: RunnerDeployment
metadata:
name: abc-k8s-action-runner
namespace: actions-runner-system
spec:
template:
spec:
organization: abc
labels: ["abc-k8s-action-runner"]
imagePullPolicy: IfNotPresent
tolerations:
- key: "gh_runner_only"
operator: "Exists"
#value: true
effect: "NoSchedule"
nodeSelector:
role: action_runner
resources:
requests:
memory: 2Gi
cpu: 1
limits:
memory: 4Gi
cpu: 2
volumes:
- name: docker
persistentVolumeClaim:
claimName: gh-runner-pvc
---
apiVersion: actions.summerwind.dev/v1alpha1
kind: HorizontalRunnerAutoscaler
metadata:
name: abc-autoscaled-action-runner
namespace: actions-runner-system
spec:
scaleTargetRef:
kind: RunnerDeployment
name: abc-k8s-action-runner # Name of RunnerDeployment
minReplicas: 0
maxReplicas: 6
metrics:
- type: TotalNumberOfQueuedAndInProgressWorkflowRuns
repositoryNames:
- repo-abc
To Reproduce
Run a 1hour plus logs.
See that runner gets terminated without cancel or any external event.
Describe the bug
Pod is getting shutdown showing above error in controller-manager. Job shows this logs. docker: Cannot connect to the Docker daemon at unix:///run/docker.sock. Is the docker daemon running?.
Describe the expected behavior
The pod should remain up throughout the job. There are some jobs running longer than 1 hour. We started seeing this behaviour recently.
Whole Controller Logs
2023-09-11T09:49:05Z INFO runnerpod Runner pod is annotated to wait for completion, and the runner container is not restarting {"runnerpod": "actions-runner-system/abc-k8s-action-runner-fnc4w-75bmp"}
2023-09-11T09:49:05Z INFO runnerpod This runner pod seems to have been deleted directly, bypassing the parent Runner resource. Marking the runner for deletion to not let it recreate this pod. {"runnerpod": "actions-runner-system/abc-k8s-action-runner-fnc4w-75bmp"}
2023-09-11T09:49:05Z INFO runnerpod Runner pod is annotated to wait for completion, and the runner container is not restarting {"runnerpod": "actions-runner-system/abc-k8s-action-runner-fnc4w-75bmp"}
2023-09-11T09:49:05Z INFO runner Removed finalizer {"runner": "actions-runner-system/abc-k8s-action-runner-fnc4w-75bmp"}
2023-09-11T09:49:05Z INFO runnerpod Runner pod is annotated to wait for completion, and the runner container is not restarting {"runnerpod": "actions-runner-system/abc-k8s-action-runner-fnc4w-75bmp"}
2023-09-11T09:49:34Z INFO runnerpod Runner pod has been stopped with a successful status. {"runnerpod": "actions-runner-system/abc-k8s-action-runner-fnc4w-75bmp"}
Whole Runner Pod Logs
Executing: 77%|███████▋ | 43/56 [04:18<02:02, 9.40s/cell]docker: Cannot connect to the Docker daemon at unix:///run/docker.sock. Is the docker daemon running?.
See 'docker run --help'.
Additional Context
No response
Hello! Thank you for filing an issue.
The maintainers will triage your issue shortly.
In the meantime, please take a look at the troubleshooting guide for bug reports.
If this is a feature request, please review our contribution guidelines.
I am seeing the same behavior as well. My runner pod is keep restarting after 30 minutes + have been passed. Using 0.27.3 here with cert manager 1.9.1, using helm deployment.