actions-runner-controller
actions-runner-controller copied to clipboard
RunnerSet does not always re-use `Available` PV
Checks
- [X] I've already read https://github.com/actions/actions-runner-controller/blob/master/TROUBLESHOOTING.md and I'm sure my issue is not covered in the troubleshooting guide.
- [X] I'm not using a custom entrypoint in my runner image
Controller Version
0.27.0
Helm Chart Version
0.22.0
CertManager Version
No response
Deployment Method
Helm
cert-manager installation
- No we did not follow https://github.com/actions/actions-runner-controller/blob/master/docs/quickstart.md#prerequisites for installing cert-manager
- We followed https://cert-manager.io/docs/installation/helm/ and installed it from the official source
Checks
- [X] This isn't a question or user support case (For Q&A and community support, go to Discussions. It might also be a good idea to contract with any of contributors and maintainers if your business is so critical and therefore you need priority support
- [X] I've read releasenotes before submitting this issue and I'm sure it's not due to any recently-introduced backward-incompatible changes
- [X] My actions-runner-controller version (v0.x.y) does support the feature
- [X] I've already upgraded ARC (including the CRDs, see charts/actions-runner-controller/docs/UPGRADING.md for details) to the latest and it didn't fix the issue
- [X] I've migrated to the workflow job webhook event (if you using webhook driven scaling)
Resource Definitions
apiVersion: actions.summerwind.dev/v1alpha1
kind: RunnerSet
metadata:
name: gha-runner
namespace: cicd--ci
spec:
dockerEnabled: true
ephemeral: true
group: Default
labels:
- ci
replicas: 3
repository: <REPOSITORY>
selector:
matchLabels:
app: ci
serviceName: gha-runner
template:
metadata:
annotations:
cluster-autoscaler.kubernetes.io/safe-to-evict: "false"
kubectl.kubernetes.io/default-logs-container: runner
labels:
app: ci
spec:
containers:
- env:
- name: DISABLE_RUNNER_UPDATE
value: "true"
- name: RUNNER_ALLOW_RUNASROOT
value: "1"
- name: RUNNER_GRACEFUL_STOP_TIMEOUT
value: "110"
- name: STARTUP_DELAY_IN_SECONDS
value: "30"
name: runner
resources:
limits:
cpu: "1.8"
memory: 7Gi
requests:
cpu: "1.5"
memory: 6Gi
- name: docker
volumeMounts:
- mountPath: /var/lib/docker
name: docker
securityContext:
fsGroup: 1001
serviceAccountName: gha-runner
volumeClaimTemplates:
- metadata:
name: docker
spec:
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 200Gi
To Reproduce
1. Deploy the RunnerSet and let it work normally
2. `Available` PVs can grow quickly along the time (we got 4.5k in 1 month)
Describe the bug
- We noticed
Available
PVs grow quickly along the time and reached 4.5k in 1 month. This indicates the RunnerSet is not re-using PVs properly - We also noticed some PVs are indeed being re-used, e.g. a Runner that was created 10m ago is using a PV that is 18d old. But the majority of Runners just spins up new volumes
- We use a custom runner image which is built on top of
docker.io/summerwind/actions-runner
. The only difference is we installed a few additional libraries and binaries, e.g.kubectl
,helm
,aws cli
etc. and we are not using a custom entrypoint
Describe the expected behavior
As described in the doc and discussion, ARC should maintain a pool of persistent volumes to be re-used by Runners, instead of provisioning new ones for most of the Runners.
Whole Controller Logs
https://gist.github.com/chaosun-abnormalsecurity/4d92b87f3807fcbaa279e1099200d20e
Whole Runner Pod Logs
https://gist.github.com/chaosun-abnormalsecurity/4879c98298f992698ee6824c9a2d4bb6
Additional Context
No response
Hello! Thank you for filing an issue.
The maintainers will triage your issue shortly.
In the meantime, please take a look at the troubleshooting guide for bug reports.
If this is a feature request, please review our contribution guidelines.
I ran into this problem today It creates a new PV even though an available PV exists. I'm wondering if it needs time to unbind from the PV and become available again, and if not, if it's a bug.
I believe this issue is a duplicate of #2282.