actions-runner-controller icon indicating copy to clipboard operation
actions-runner-controller copied to clipboard

RunnerSet does not always re-use `Available` PV

Open chaosun-abnormalsecurity opened this issue 1 year ago • 3 comments

Checks

  • [X] I've already read https://github.com/actions/actions-runner-controller/blob/master/TROUBLESHOOTING.md and I'm sure my issue is not covered in the troubleshooting guide.
  • [X] I'm not using a custom entrypoint in my runner image

Controller Version

0.27.0

Helm Chart Version

0.22.0

CertManager Version

No response

Deployment Method

Helm

cert-manager installation

  • No we did not follow https://github.com/actions/actions-runner-controller/blob/master/docs/quickstart.md#prerequisites for installing cert-manager
  • We followed https://cert-manager.io/docs/installation/helm/ and installed it from the official source

Checks

  • [X] This isn't a question or user support case (For Q&A and community support, go to Discussions. It might also be a good idea to contract with any of contributors and maintainers if your business is so critical and therefore you need priority support
  • [X] I've read releasenotes before submitting this issue and I'm sure it's not due to any recently-introduced backward-incompatible changes
  • [X] My actions-runner-controller version (v0.x.y) does support the feature
  • [X] I've already upgraded ARC (including the CRDs, see charts/actions-runner-controller/docs/UPGRADING.md for details) to the latest and it didn't fix the issue
  • [X] I've migrated to the workflow job webhook event (if you using webhook driven scaling)

Resource Definitions

apiVersion: actions.summerwind.dev/v1alpha1
kind: RunnerSet
metadata:
  name: gha-runner
  namespace: cicd--ci
spec:
  dockerEnabled: true
  ephemeral: true
  group: Default
  labels:
  - ci
  replicas: 3
  repository: <REPOSITORY>
  selector:
    matchLabels:
      app: ci
  serviceName: gha-runner
  template:
    metadata:
      annotations:
        cluster-autoscaler.kubernetes.io/safe-to-evict: "false"
        kubectl.kubernetes.io/default-logs-container: runner
      labels:
        app: ci
    spec:
      containers:
      - env:
        - name: DISABLE_RUNNER_UPDATE
          value: "true"
        - name: RUNNER_ALLOW_RUNASROOT
          value: "1"
        - name: RUNNER_GRACEFUL_STOP_TIMEOUT
          value: "110"
        - name: STARTUP_DELAY_IN_SECONDS
          value: "30"
        name: runner
        resources:
          limits:
            cpu: "1.8"
            memory: 7Gi
          requests:
            cpu: "1.5"
            memory: 6Gi
      - name: docker
        volumeMounts:
        - mountPath: /var/lib/docker
          name: docker
      securityContext:
        fsGroup: 1001
      serviceAccountName: gha-runner
  volumeClaimTemplates:
  - metadata:
      name: docker
    spec:
      accessModes:
      - ReadWriteOnce
      resources:
        requests:
          storage: 200Gi

To Reproduce

1. Deploy the RunnerSet and let it work normally
2. `Available` PVs can grow quickly along the time (we got 4.5k in 1 month)

Describe the bug

  1. We noticed Available PVs grow quickly along the time and reached 4.5k in 1 month. This indicates the RunnerSet is not re-using PVs properly
  2. We also noticed some PVs are indeed being re-used, e.g. a Runner that was created 10m ago is using a PV that is 18d old. But the majority of Runners just spins up new volumes
  3. We use a custom runner image which is built on top of docker.io/summerwind/actions-runner. The only difference is we installed a few additional libraries and binaries, e.g. kubectl, helm, aws cli etc. and we are not using a custom entrypoint

Describe the expected behavior

As described in the doc and discussion, ARC should maintain a pool of persistent volumes to be re-used by Runners, instead of provisioning new ones for most of the Runners.

Whole Controller Logs

https://gist.github.com/chaosun-abnormalsecurity/4d92b87f3807fcbaa279e1099200d20e

Whole Runner Pod Logs

https://gist.github.com/chaosun-abnormalsecurity/4879c98298f992698ee6824c9a2d4bb6

Additional Context

No response

Hello! Thank you for filing an issue.

The maintainers will triage your issue shortly.

In the meantime, please take a look at the troubleshooting guide for bug reports.

If this is a feature request, please review our contribution guidelines.

github-actions[bot] avatar Jan 11 '24 00:01 github-actions[bot]

I ran into this problem today It creates a new PV even though an available PV exists. I'm wondering if it needs time to unbind from the PV and become available again, and if not, if it's a bug.

waveofmymind avatar Jan 19 '24 07:01 waveofmymind

I believe this issue is a duplicate of #2282.

rdepres avatar Jan 19 '24 11:01 rdepres