cache icon indicating copy to clipboard operation
cache copied to clipboard

Alternative storage backend

Open zen0wu opened this issue 5 years ago • 28 comments

The current storage for cache is Github, this does not work well with self hosted runner or some security sensitive case. Two issues impose here, security, and for self hosted runners there’s always options to store it much faster (locally, blob storage), rather than uploading/downloading gigabytes of files from Github.

zen0wu avatar Jun 21 '20 07:06 zen0wu

This would help our self-hosted runners that are on prem a lot. Maybe we can create a plugin kind of system for different type of storage systems.

peterfortuin avatar Jul 31 '20 08:07 peterfortuin

In my opinion, this certainly makes sense. It also has the advantage that you don't have to worry about GitHub's cache-size limits. The difficulty of doing this is not so high, so I can work on this if necessary. What do you think? @dhadka Well, I understand that there are more higher priority issues at the moment, so I would like to complete them first.

smorimoto avatar Aug 05 '20 23:08 smorimoto

@zen0wu There is a workaround made by @shonansurvivors that currently only work with S3 and its compatible services and systems. https://github.com/shonansurvivors/actions-s3-cache

smorimoto avatar Aug 05 '20 23:08 smorimoto

But I don't think that action will make things much faster than this action. It has an advantage over this action, probably only to avoid cache-size limits.

smorimoto avatar Aug 05 '20 23:08 smorimoto

@smorimoto Yeah, I was thinking of working on a generic "bring your own storage" Action during an upcoming company hackathon. My idea was to have plugins for the various storage providers (Azure, S3, Minio, etc.) with some parameters to let you control the lifetime of the file. For example, if we had a time-to-live (TTL) parameter, we could essentially implement caching with a short TTL and the upload-artifact and download-artifact actions with a longer TTL. A scheduled job could then be setup to run daily or weekly to scan for and remove old / unused files. This would give users much more control over the content and also eliminate many of the current restrictions (size limits, sharing between branches and repos, etc.)

dhadka avatar Aug 06 '20 03:08 dhadka

@dhadka Oh! I think that's a pretty good idea. Everyone probably thought about bringing the concept of the plugin to Actions, but no one has done that yet. It will have a good effect on the community that these famous Actions do it.

smorimoto avatar Aug 10 '20 21:08 smorimoto

A similar concern was brought up before: https://github.com/actions/cache/issues/279

Would love to see either option done (local or a general repository-style API for setting your own cache backends) - storing cache on GitHub is convenient but the limitations can be a deal-breaker if you're trying to use the cache as a way to introduce persistent directories between action runs, particularly when there might be 7 days between cache pulls.

judge2020 avatar Oct 04 '20 01:10 judge2020

If we could use self-managed disks, that would be awesome! For our developers, we want the CI CD experience to be the same both in the cloud and self-hosted. Right now, caching would need our own solution on self-hosted.

meiswjn avatar Nov 10 '20 14:11 meiswjn

I would love the ability to use blob storage. I have a Rust repository and each matrix build has > 1GB of compiled files, so we can't complete a single build without causing cache eviction.

lilith avatar Dec 17 '20 18:12 lilith

Would love to see this too, video game development with Unity creates huge Library folders, even for small scale games (ours is >3Gb), and each platform needs a separate cache. I'll have to take a look at https://github.com/shonansurvivors/actions-s3-cache, because we basically can't use Github caching for now (I would like to cache LFS files as well to decrease bandwith costs).

RDeluxe avatar Feb 17 '21 12:02 RDeluxe

I ended up doing something really simple with a custom composite actions in our repo. In case others find it useful. It does not support restore-keys yet, but it shouldn't be too hard to add.

https://gist.github.com/zen0wu/a0a7cd95fe3f2f550467c4428ef0f87c

zen0wu avatar May 28 '21 22:05 zen0wu

I ended up with this: https://github.com/marketplace/actions/s3-cache

jackieli-tes avatar Jun 04 '21 14:06 jackieli-tes

Note that GH is working on increased storage sizes: https://github.com/github/roadmap/issues/66

Although this issue is planned for 'Future' so there's no ETA for when it'll be completed.

judge2020 avatar Jun 06 '21 03:06 judge2020

@judge2020 https://github.com/actions/cache/discussions/497

smorimoto avatar Jun 06 '21 04:06 smorimoto

I don't think increasing cache storage size will be enough because you will always find someone who the defaults don't work for.

As a case study, I am working on a Rust project which also compiles TensorFlow from source. In a single run, our Linux build generated a 4417MB cache while the MacOS build generated 2115MB. If the cache is limited to 10GB, only one PR can use the cache at a time and all others will be evicted. CI runs only take 10 minutes with a cache hit, but having to rebuild from scratch blows build times out to 50+ minutes.

Multiply by several developers actively working on the project at the same time all evicting each others caches and you have a severe developer experience problem.

It would be much more convenient if my company could use our own S3 bucket for caching because then we can set the limit as high as we want and the only impact is a slightly larger AWS bill at the end of the month.

Michael-F-Bryan avatar Dec 28 '21 12:12 Michael-F-Bryan

Is there any plan to make this official? It would be very beneficial for the community

Sytten avatar Jul 26 '22 13:07 Sytten

@Sytten I don't think anything is being planned in this direction for GitHub.com right now as maintaining regions manually so that cache can work effectively will be very difficult in such cases.

Although, we do have option for custom storage in GHES (Github enterprise server), the possibility of us supporting this in Github.com is very low.

vsvipul avatar Oct 19 '22 10:10 vsvipul

@Sytten I don't think anything is being planned in this direction for GitHub.com right now as maintaining regions manually so that cache can work effectively will be very difficult in such cases.

Although, we do have option for custom storage in GHES (Github enterprise server), the possibility of us supporting this in Github.com is very low.

Why not just allow specifying an Azure Blob Storage or similar on an org / enterprise level?

meiswjn avatar Oct 20 '22 06:10 meiswjn

@meiswjn We do use azure blob storage to save actions caches. How will allowing a custom Azure blob storage help you additionally? Can you please elaborate?

vsvipul avatar Oct 25 '22 10:10 vsvipul

@meiswjn We do use azure blob storage to save actions caches. How will allowing a custom Azure blob storage help you additionally? Can you please elaborate?

Sure. For legal reasons we are not allowed to use GitHub Runners and GitHub storage. Export control, data privacy, internal security guidelines - you name it. Allowing a custom blob storage would give us many possibilities: Host it in our own country (no more export control and data privacy issues), connect it via private endpoint (confidential data / internal processes), use our own level of encryption, manage billing by ourself, etc.

meiswjn avatar Oct 25 '22 12:10 meiswjn

Hi, my employer uses self hosted runners on AWS (for security reasons). We'd like to do actions caching for Rust language intermediary build artifacts. It would be nice to not have to look at maintaining an internal fork of this action in order to do so.

Fishrock123 avatar Nov 30 '22 00:11 Fishrock123

Just reviving this thread a little. We have been using https://github.com/tespkg/actions-cache for a while, but it really would not be that big a change to port it to this action. I am willing to do the work if someone from the github team can confirm this is something they are interested in maintaining. I feel the arguments made here are pretty convincing, but I don't want to work for nothing.

Sytten avatar Feb 14 '23 20:02 Sytten

@vsvipul Any change in the policy of github around that topic?

Sytten avatar Mar 24 '23 17:03 Sytten

Is there any update on this? We're using self-hosted runners and waiting to restore/save cache back to GitHub is a crime. We need local disk storage

gaspo53 avatar Aug 02 '23 16:08 gaspo53

@gaspo53 check out a fork of the above action at https://github.com/everpcpc/actions-cache which implements Apache OpenDAL as a storage backend, allowing use of S3 and comparable Azure/GCP services for cache

strophy avatar Aug 03 '23 02:08 strophy

You could also have a look at https://github.com/runs-on/cache, which is a drop-in replace for actions/cache@v4, but supports S3 as a backend. Especially useful if your runners are in AWS, since you can get unlimited cache and at least 300MiB/s throughput.

crohr avatar Feb 14 '24 16:02 crohr