actions-runner-controller icon indicating copy to clipboard operation
actions-runner-controller copied to clipboard

Ephemeral runner dependency caching

Open linde12 opened this issue 1 year ago • 15 comments

What would you like added?

I would like to be able to use something like the actions/cache action in order to cache dependencies installed with e.g. npm install (node_modules) in my workflow. I want to do this using ephemeral runners and i want to store the cache on a persistent volume or similar, not github (as it would be slow)

Why is this needed?

Currently, doing things like actions/setup-node and npm install is doing a lot of unnecessary requests and ends up taking lots of time. I would like to cache these things just as i do on Github-hosted runners with the actions/cache action so that the time spent is reduced.

I see that we can mount volumes to ephemeral runners today, but i'm not sure how or if its even possible to get ARC to write to/load from the cache.

linde12 avatar Jul 07 '23 11:07 linde12

Hello! Thank you for filing an issue.

The maintainers will triage your issue shortly.

In the meantime, please take a look at the troubleshooting guide for bug reports.

If this is a feature request, please review our contribution guidelines.

github-actions[bot] avatar Jul 07 '23 11:07 github-actions[bot]

Any news on this? I would also really benefit from something like this but I use maven.

Ueti avatar Jul 13 '23 07:07 Ueti

I've solved similar problem for node modules with help of https://verdaccio.org/ Put it as a DaemonSet to your worker nodes where Runners are executed, and point your job to pull dependencies via it. This setup worked pretty good.

vmbobyr avatar Aug 12 '23 07:08 vmbobyr

I've solved similar problem for node modules with help of https://verdaccio.org/ Put it as a DaemonSet to your worker nodes where Runners are executed, and point your job to pull dependencies via it. This setup worked pretty good.

Yes, this helps for some cases. I’ve accomplished the same with Artifactory but it’s unnecessarily complex and you still have to deal with dependency resolution which in itself can take > minute. Also you are limited on what you can cache, unlike the official cache action.

I am interested in a solution similar to the official cache action (preferably with the same API) so we can cache anything (just as we can on GitHub), without additional 3rd party software.

linde12 avatar Aug 12 '23 08:08 linde12

I solved this by using a ReadWriteMany PVC that every ephemeral runner attaches to upon startup:

apiVersion: actions.summerwind.dev/v1alpha1
kind: RunnerSet
metadata:
  name: enterprise-runnerset-large
spec:
  replicas: 4
  image: $IMAGE
  dockerdWithinRunnerContainer: true
  enterprise: $ENTERPRISE
  labels:
  - ubuntu-latest
  selector:
    matchLabels:
      app: runnerset-large
  serviceName: runnerset-large
  template:
    metadata:
      labels: 
        app: runnerset-large
    spec:
      securityContext:
        fsGroup: 1001
        fsGroupChangePolicy: "Always"
      terminationGracePeriodSeconds: 110
      containers:
      - name: runner
        env: 
          - name: RUNNER_GRACEFUL_STOP_TIMEOUT
            value: "90"
          - name: ARC_DOCKER_MTU_PROPAGATION
            value: "true"
        resources:
          limits:
            memory: "8Gi"
          requests:
            cpu: "2"
            memory: "8Gi"
        volumeMounts:
        - mountPath: /opt/hostedtoolcache
          name: tool-cache
        - mountPath: /runner/_work
          name: work
      volumes:
      - name: tool-cache
        persistentVolumeClaim:
          claimName: tool-cache-enterprise-runnerset-large-0
      - name: work
        ephemeral:
          volumeClaimTemplate:
            spec:
              accessModes: [ "ReadWriteOnce" ]
              storageClassName: "csi-ceph-cephfs"
              resources:
                requests:
                  storage: 5Gi
--- 
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: tool-cache-enterprise-runnerset-large-0
  finalizers:
    - kubernetes.io/pvc-protection
  labels:
    app: runnerset-large
spec:
  accessModes:
    - ReadWriteMany
  resources:
    requests:
      storage: 25Gi
  storageClassName: csi-ceph-cephfs
  volumeMode: Filesystem

View of the cache:

runner@enterprise-runnerset-medium-zj5wn-0:/$ ls -la /opt/hostedtoolcache/
total 1
drwxrwsr-x 22 root   runner  22 Sep 19 17:38 .
drwxr-xr-x  1 root   root    24 Sep 25 17:34 ..
drwxrwsr-x  3 runner runner   1 Aug  4 17:11 Java_Adopt_jdk
drwxrwsr-x  4 runner runner   2 Jul 25 15:49 Java_Corretto_jdk
drwxrwsr-x  3 runner runner   1 Aug  4 17:11 Java_IBM_Semeru_jdk
drwxrwsr-x  3 runner runner   1 Jul 14 15:45 Java_Oracle_jdk
drwxrwsr-x  5 runner runner   3 Aug 18 16:46 Java_Temurin-Hotspot_jdk
drwxrwsr-x  3 runner runner   1 Aug  4 17:13 Java_Zulu_jdk
drwxrwsr-x  3 runner runner   1 Jul 27 15:18 Miniconda3
drwxrwsr-x  5 runner runner   3 Sep 19 17:39 PyPy
drwxrwsr-x  9 runner runner   7 Sep 19 17:38 Python
drwxrwsr-x  6 runner runner   4 Sep 21 14:22 Ruby
drwxrwsr-x  3 runner runner   1 Jul  6 21:45 blobs
drwxrwsr-x  5 runner runner   3 Jul 26 20:49 buildx
drwxrwsr-x  4 runner runner   2 Jul 20 13:21 buildx-dl-bin
drwxrwsr-x  8 runner runner   9 Aug  4 20:46 dotnet
drwxrwsr-x  5 runner runner   3 Aug 30 19:33 go
drwxrwsr-x  3 runner runner   1 Jul 12 19:06 grype
-rw-rw-r--  1 runner runner 244 Jul  6 21:46 index.json
drwxrwsr-x  2 runner runner   0 Jul  6 21:46 ingest
drwxrwsr-x  4 runner runner   2 Jul 18 21:21 maven
drwxrwsr-x  9 runner runner   7 Aug 17 21:22 node
-rw-rw-r--  1 runner runner  30 Jul  6 21:46 oci-layout
drwxrwsr-x  3 runner runner   1 Jul 12 19:05 syft
runner@enterprise-runnerset-medium-zj5wn-0:/$ ls -la /opt/hostedtoolcache/node/
total 0
drwxrwsr-x  9 runner runner  7 Aug 17 21:22 .
drwxrwsr-x 22 root   runner 22 Sep 19 17:38 ..
drwxrwsr-x  3 runner runner  2 Jul 25 22:12 14.18.2
drwxrwsr-x  3 runner runner  2 Jul 20 18:53 16.14.0
drwxrwsr-x  3 runner runner  2 Jul 26 16:58 16.20.0
drwxrwsr-x  3 runner runner  2 Jul  3 11:00 16.20.1
drwxrwsr-x  3 runner runner  2 Jul 14 15:45 18.16.0
drwxrwsr-x  3 runner runner  2 Jul  6 14:38 18.16.1
drwxrwsr-x  3 runner runner  2 Aug 17 21:22 6.17.1

alec-drw avatar Sep 25 '23 17:09 alec-drw

I solved this by using a ReadWriteMany PVC that every ephemeral runner attaches to upon startup:

apiVersion: actions.summerwind.dev/v1alpha1
kind: RunnerSet
metadata:
  name: enterprise-runnerset-large
spec:
  replicas: 4
  image: $IMAGE
  dockerdWithinRunnerContainer: true
  enterprise: $ENTERPRISE
  labels:
  - ubuntu-latest
  selector:
    matchLabels:
      app: runnerset-large
  serviceName: runnerset-large
  template:
    metadata:
      labels: 
        app: runnerset-large
    spec:
      securityContext:
        fsGroup: 1001
        fsGroupChangePolicy: "Always"
      terminationGracePeriodSeconds: 110
      containers:
      - name: runner
        env: 
          - name: RUNNER_GRACEFUL_STOP_TIMEOUT
            value: "90"
          - name: ARC_DOCKER_MTU_PROPAGATION
            value: "true"
        resources:
          limits:
            memory: "8Gi"
          requests:
            cpu: "2"
            memory: "8Gi"
        volumeMounts:
        - mountPath: /opt/hostedtoolcache
          name: tool-cache
        - mountPath: /runner/_work
          name: work
      volumes:
      - name: tool-cache
        persistentVolumeClaim:
          claimName: tool-cache-enterprise-runnerset-large-0
      - name: work
        ephemeral:
          volumeClaimTemplate:
            spec:
              accessModes: [ "ReadWriteOnce" ]
              storageClassName: "csi-ceph-cephfs"
              resources:
                requests:
                  storage: 5Gi
--- 
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: tool-cache-enterprise-runnerset-large-0
  finalizers:
    - kubernetes.io/pvc-protection
  labels:
    app: runnerset-large
spec:
  accessModes:
    - ReadWriteMany
  resources:
    requests:
      storage: 25Gi
  storageClassName: csi-ceph-cephfs
  volumeMode: Filesystem

View of the cache:

runner@enterprise-runnerset-medium-zj5wn-0:/$ ls -la /opt/hostedtoolcache/
total 1
drwxrwsr-x 22 root   runner  22 Sep 19 17:38 .
drwxr-xr-x  1 root   root    24 Sep 25 17:34 ..
drwxrwsr-x  3 runner runner   1 Aug  4 17:11 Java_Adopt_jdk
drwxrwsr-x  4 runner runner   2 Jul 25 15:49 Java_Corretto_jdk
drwxrwsr-x  3 runner runner   1 Aug  4 17:11 Java_IBM_Semeru_jdk
drwxrwsr-x  3 runner runner   1 Jul 14 15:45 Java_Oracle_jdk
drwxrwsr-x  5 runner runner   3 Aug 18 16:46 Java_Temurin-Hotspot_jdk
drwxrwsr-x  3 runner runner   1 Aug  4 17:13 Java_Zulu_jdk
drwxrwsr-x  3 runner runner   1 Jul 27 15:18 Miniconda3
drwxrwsr-x  5 runner runner   3 Sep 19 17:39 PyPy
drwxrwsr-x  9 runner runner   7 Sep 19 17:38 Python
drwxrwsr-x  6 runner runner   4 Sep 21 14:22 Ruby
drwxrwsr-x  3 runner runner   1 Jul  6 21:45 blobs
drwxrwsr-x  5 runner runner   3 Jul 26 20:49 buildx
drwxrwsr-x  4 runner runner   2 Jul 20 13:21 buildx-dl-bin
drwxrwsr-x  8 runner runner   9 Aug  4 20:46 dotnet
drwxrwsr-x  5 runner runner   3 Aug 30 19:33 go
drwxrwsr-x  3 runner runner   1 Jul 12 19:06 grype
-rw-rw-r--  1 runner runner 244 Jul  6 21:46 index.json
drwxrwsr-x  2 runner runner   0 Jul  6 21:46 ingest
drwxrwsr-x  4 runner runner   2 Jul 18 21:21 maven
drwxrwsr-x  9 runner runner   7 Aug 17 21:22 node
-rw-rw-r--  1 runner runner  30 Jul  6 21:46 oci-layout
drwxrwsr-x  3 runner runner   1 Jul 12 19:05 syft
runner@enterprise-runnerset-medium-zj5wn-0:/$ ls -la /opt/hostedtoolcache/node/
total 0
drwxrwsr-x  9 runner runner  7 Aug 17 21:22 .
drwxrwsr-x 22 root   runner 22 Sep 19 17:38 ..
drwxrwsr-x  3 runner runner  2 Jul 25 22:12 14.18.2
drwxrwsr-x  3 runner runner  2 Jul 20 18:53 16.14.0
drwxrwsr-x  3 runner runner  2 Jul 26 16:58 16.20.0
drwxrwsr-x  3 runner runner  2 Jul  3 11:00 16.20.1
drwxrwsr-x  3 runner runner  2 Jul 14 15:45 18.16.0
drwxrwsr-x  3 runner runner  2 Jul  6 14:38 18.16.1
drwxrwsr-x  3 runner runner  2 Aug 17 21:22 6.17.1

Hi @alec-drw - I set it up exactly like this, but getting the below error under Github actions pipeline:

Download action repository 'actions/checkout@v3' (SHA:f43a0e5ff2bd294095638e18286ca9a3d1956744)
Error: Can't use 'tar -xzf' extract archive file: /runner/_work/_actions/_temp_cbd2a7be-cad1-4030-b170-e4737cdf2323/ca9fffe4-3f99-409f-a500-81e17f49794c.tar.gz. Action being checked out: actions/checkout@v3. return code: 2.

Did you face anything like this or any idea?

immatureprogrammerr avatar Oct 15 '23 11:10 immatureprogrammerr

I ended up with something like this

This is brutal usage of hostPath but it should provide a fast cache.

You will need a script to populate and manage the cache directory

  spec:
    securityContext:
      runAsUser: 1001
      runAsGroup: 1001
      fsGroup: 1001
      fsGroupChangePolicy: "OnRootMismatch"
    containers:
      - name: runner
        image: ghcr.io/actions/actions-runner:latest
        command: ["/home/runner/run.sh"]
        volumeMounts:
          - name: hostedtoolcache
            mountPath: /opt/hostedtoolcache
    volumes:
      - name: hostedtoolcache
        hostPath:
          # directory location on host
          path: /tmp/arc
          type: DirectoryOrCreate

So I guess that we need some sort of actions/ARC-cache helper to store and restore the files from local volumes

Maybe the actions/cache can be modified to just use local storage instead of cloud storage.

There is this issue on actions/cache asking for alternative backends https://github.com/actions/cache/issues/354

adiroiban avatar Oct 18 '23 12:10 adiroiban

@adiroiban how would that work with multiple nodes in the cluster? Each time a runner comes online it would have a different cache - my first pass used this but having a persistent cache via a PV with RWX allows for the same dir to be mounted each time

alec-drw avatar Oct 18 '23 13:10 alec-drw

If you already have a PV with RWX, then this is not needed.

My workaround is only for simple hostPath storage.

And is only to provide the super-fast persistent storage on the node.

And this is only for testing.

You can add more complex logic for storing / restoring the cache... but this will end up implemeting actions/cache


At the start and end of a workflow I have something like this

I am only caching "build" and "node_modules"

  - name: Restore cache
    shell: bash
    run: |
      if [ -f /opt/hostedtoolcache/pending ]; then
        echo "Waiting for pending cache to finalize..."
        sleep 10
      fi

      if [ -d /opt/hostedtoolcache/build ]; then
        echo "Restoring cache."
        cp -r /opt/hostedtoolcache/build build
        cp -r /opt/hostedtoolcache/node_modules node_modules
      else
        echo "No cache found."
      fi

  - name: Store cache
    shell: bash
    run: |
      if [ ! -f /opt/hostedtoolcache/pending ]; then
        touch /opt/hostedtoolcache/pending
        echo "Saving cache..."
        rm -rf /opt/hostedtoolcache/build
        rm -rf /opt/hostedtoolcache/node_modules
        mv build /opt/hostedtoolcache/
        mv node_modules /opt/hostedtoolcache/
        rm /opt/hostedtoolcache/pending
      else
        echo "Not saving cache as there is a pending save."
      fi

adiroiban avatar Oct 18 '23 14:10 adiroiban

Yes, this helps for some cases. I’ve accomplished the same with Artifactory but it’s unnecessarily complex and you still have to deal with dependency resolution which in itself can take > minute. Also you are limited on what you can cache, unlike the official cache action.

Can you please share your details on how you achieved this with Artifactory? I am trying to perform setupnode and npm install in a China GitHub enterprise server. The difficulty is in the resolution of the dependencies.

gayanak avatar Nov 16 '23 04:11 gayanak

One very hacky workaround that works in some scenarios (like in a monorepo setup) is to use a custom builder image and have the dependencies added in there. Something like

COPY --chown=runner:docker ./package.json /tmp/package.json
COPY --chown=runner:docker ./package-lock.json /tmp/package-lock.json
RUN cd /tmp && npm install

In our setup, this adds about 20-30s to download the (now larger) image to a fresh node, but after its prewarmed its pretty fast.

dfr-exnaton avatar Jan 29 '24 22:01 dfr-exnaton

Is a solution that replaces cloud storage of actions/cache with local storage possible?

We would like to switch to self-hosted runners but they have a very low success rate at handling actions that uses cache.

Action:

    - name: Build, tag, and push image to container registry
      uses: docker/build-push-action@v4
      with:
        push: ${{ env.ACT != 'true' }}
        provenance: false
        tags: |
          ${{ env.DOCKER_USERNAME }}/${{ inputs.image_name}}:${{ inputs.app_version }}
          ${{ env.DOCKER_USERNAME }}/${{ inputs.image_name}}:${{ github.sha }}
        platforms: ${{ inputs.platforms }}
        context: ${{ inputs.docker_context || '.' }}
        file: ${{ inputs.docker_file || './Dockerfile' }}
        cache-from: type=gha
        cache-to: type=gha,mode=max

Fails with

buildx failed with: error: failed to solve: Get "https://acghubeus1.actions.githubusercontent.com/8V4YnVoosB0k2Rieq40qvDIC8EkpoC2S2GIgTQFJs9ePMWvozj/_apis/artifactcache/cache?keys=buildkit-blob-1-sha256%3A2457c1c5bd028c46eab1f52756c9d700d6dc39a0f03443dd9fd2d739a38c1a89&version=693bb7016429d80366022f036f84856888c9f13e00145f5f6f4dce303a38d6f2": net/http: TLS handshake timeout

Reading the docs on https://docs.github.com/en/actions/using-workflows/caching-dependencies-to-speed-up-workflows

When using self-hosted runners, caches from workflow runs are stored on GitHub-owned cloud storage. A customer-owned storage solution is only available with GitHub Enterprise Server.

If we change the action like this the error dissapears

    - name: Build, tag, and push image to container registry
      uses: docker/build-push-action@v4
      with:
        push: ${{ env.ACT != 'true' }}
        provenance: false
        tags: |
          ${{ env.DOCKER_USERNAME }}/${{ inputs.image_name}}:${{ inputs.app_version }}
          ${{ env.DOCKER_USERNAME }}/${{ inputs.image_name}}:${{ github.sha }}
        platforms: ${{ inputs.platforms }}
        context: ${{ inputs.docker_context || '.' }}
        file: ${{ inputs.docker_file || './Dockerfile' }}

Unfortunately most actions that we use are already using GHA build cache.

davidwincent avatar Apr 10 '24 05:04 davidwincent

Implementing cache on self-hosted runners is not that easy.

You might have your own homelab/on-premise bare metal servers, or AWS / Azure / Google operated kubernetes clusters, each with a different storage solution.

I have an on-premise bare metal k8s cluster so I am using OpenEBS Local PV Hostpath for storage with a simple script to cache and restore

adiroiban avatar Apr 10 '24 09:04 adiroiban

Implementing cache on self-hosted runners is not that easy.

You might have your own homelab/on-premise bare metal servers, or AWS / Azure / Google operated kubernetes clusters, each with a different storage solution.

I have an on-premise bare metal k8s cluster so I am using OpenEBS Local PV Hostpath for storage with a simple script to cache and restore

Care to share?

davidwincent avatar Apr 10 '24 12:04 davidwincent

@davidwincent there is a actively maintained project for exactly that I came across quite recently; have not tried it myself, however https://github.com/falcondev-oss/github-actions-cache-server

alec-drw avatar Apr 10 '24 13:04 alec-drw