cache icon indicating copy to clipboard operation
cache copied to clipboard

Same cache key not shared between different pull requests

Open gyfis opened this issue 5 years ago • 38 comments

Hi, I'm not sure if this is a feature or a bug, but when we use the same cache key for different PRs (sequentially without a race condition), the second PR doesn't find the cache even though the keys are the same and the first PR completes the workflow successfully (including cache upload).

I looked through the actions/cache code and I think it may be related to the scope attribute of the ArtifactCacheEntry object. I weren't able to hack around the code to ignore this attribute, so I'm not 100% sure. The attribute does contain values that would point to the caches being scoped to different PRs, though (refs/pull/1660/merge and refs/pull/1661/merge on the two test pull requests I tried).

This is the setup we use:

      - uses: actions/cache@v1
        id: cache
        with:
          path: .github/vendor/bundle
          key: github-gems-${{ hashFiles('**/.github/Gemfile.lock') }}
          restore-keys: |
            github-gems-

The bundle is installing gems to the correct folder, since the cache is successfully fetched starting from the second commit on each pull request.

Let me know if I can provide more info. Thanks!

gyfis avatar Nov 10 '19 21:11 gyfis

Yes, it sounds like this behavior is caused by the scope. The PR refs/pull/1660/merge has read-write permissions to caches in the scope refs/pull/1660/merge and read-only permissions on the default branch scope (typically master). PR refs/pull/1661/merge does not have permissions to read the cache from refs/pull/1660/merge and vice versa. Essentially, you would need some action to trigger after merging the change into master which saves the cache. Once saved to the default (master) branch scope, it will be readable by PRs.

dhadka avatar Nov 10 '19 23:11 dhadka

Interesting! I’m glad there’s a way for it to work in its current state, but I feel like sharing (writable) caches between PRs would also be nice. Are you thinking about expanding this functionality, e.g. by making the scope a parameter of the cache action?

gyfis avatar Nov 11 '19 05:11 gyfis

I want to confirm that adding the cache persist to master push indeed gives pull request access to the same cache. This is the workflow I used (filename cache_persist.yml):

name: Cache persist

on: 
  push:
    branches:
      - master

env:
  BUNDLE_GEMFILE: .github/Gemfile

jobs:
  cache_persist:

    runs-on: ubuntu-latest

    steps:
      - uses: actions/checkout@v1

      - uses: actions/setup-ruby@v1
        with:
          ruby-version: '2.6'

      - uses: actions/cache@v1
        with:
          path: .github/vendor/bundle
          key: github-gems-${{ hashFiles('**/.github/Gemfile.lock') }}
          restore-keys: |
            github-gems-

      - name: bundle install
        run: |
          gem install bundler --no-document
          bundle check --path=vendor/bundle || bundle install --jobs=4 --retry=3 --path=vendor/bundle

Some good next steps for this issue:

  • acknowledging that this is the expected behavior by updating README.md or another document describing the scope functionality and the possibility to add read-only cache by using the master push, (and closing this issue), or
  • making the cache scope mechanism publicly modifiable with appropriate description and API, (and closing this issue), or
  • (the most low-effort one) using this issue as the reference for this functionality, and closing it with such confirmation

@dhadka I’m down for adding a PR to a documentation somewhere describing what’s up with the scoping, if you approve.

gyfis avatar Nov 11 '19 08:11 gyfis

There is some documentation on scopes in the help docs - https://help.github.com/en/actions/automating-your-workflow-with-github-actions/caching-dependencies-to-speed-up-workflows#cache-scope

The scope was designed this way to prevent poisoning the cache of other branches. For example, without scoping, someone could create a PR and inject malicious libraries into the cache, which are subsequently read by other branches.

@chrispat I think we can help make the documentation clearer on how the scope works, particularly with respect to the benefit of creating the cache in the default branch so it can be shared among PRs.

dhadka avatar Nov 11 '19 18:11 dhadka

Actually, the linked docs is slightly wrong in saying "You can restore a cache created in any branch of a repository, including base branches of forked repositories."

dhadka avatar Nov 11 '19 18:11 dhadka

I've created an internal issue to update the help docs.

dhadka avatar Nov 14 '19 14:11 dhadka

@gyfis does the cache persist workflow that you posted copy the cache from a feature branch to master?

YashMathur avatar Nov 14 '19 21:11 YashMathur

@YashMathur I wouldn’t say copy, really - the push if: master workflow is simply using the same cache as the pull request ones from feature branches, and in so making it available for newer pull request workflows.

gyfis avatar Nov 14 '19 23:11 gyfis

@gyfis If I understand this correctly, you're running the bundle install every time you merge a PR to master so the cache lives in master's scope. It can then be accessed by other branches.

I don't think the Cache persist workflow will be able to restore any cache that is created by pull requests. Is that correct?

YashMathur avatar Nov 14 '19 23:11 YashMathur

@YashMathur Yes, that’s exactly right.

This may be obvious but if the cache doesn’t change often, all workflows are fast (PR workflows read the master cache, and the master push workflows also read the master cache). If we don’t count parallelism, a new cache key in this setup is build exactly two times - once in a new feature branch that changes it, and once in a master push branch after merge - and then it’s saved and accessible everywhere.

(Trying to paraphrase the docs here to see if I’m understanding them correctly): Master branch workflow never reads any cache from feature branches, and there’s no cache passing between feature branches, either - that way no branch can pollute cache, unless merged into master.

gyfis avatar Nov 15 '19 06:11 gyfis

WHOAH -- I will just add on that I found this very confusing and have been spending a lot of hours trying to figure out why my cache hasn't been working as I'd expected. I definitely think the docs could be clearer that the cache is not accessible across branches by default, and adding an example of the job shown here as a way to achieve cache-sharing would be super helpful.

Thanks to those in the thread for laying this out so clearly -- this was driving me bananas coming from gitlab where cross-branch caching is the default. Since it takes about an hour to build the cache for my project, this is going to be a huge win for us!

mikekaminsky avatar May 30 '20 15:05 mikekaminsky

Here is the excerpt from the documentation

A workflow can access and restore a cache created in the current branch, the base branch (including base branches of forked repositories), or the default branch (usually master). For example, a cache created on the default branch master would be accessible from any pull request. Also, if the branch feature-b has the base branch feature-a, a workflow triggered on feature-b would have access to caches created in the default branch (master), feature-a, and feature-b.

Access restrictions provide cache isolation and security by creating a logical boundary between different workflows and branches. For example, a cache created for the branch feature-a (with the base master) would not be accessible to a pull request for the branch feature-b (with the base master).

I think it pretty clearly lays out the access restrictions but I would be happy to take feedback on how to improve it.

I think it might be possible to allow for sharing writable caches across PRs targeting the same base within a repo, however, I don't think I would ever allow that for PRs coming from forks of public repos.

chrispat avatar Jun 01 '20 15:06 chrispat

@chrispat

I think it might be possible to allow for sharing writable caches across PRs targeting the same base within a repo

My assumption (before finding this thread) was that caches would be shared between branches in a single repository. I kept wondering why my merge commits were not using cache even though the cache key was identical to the parent commit's build.

I would definitely love to see caches be shared between branches. I can understand wanting to avoid fork PRs from "poisoning the cache", but I think it should be safe for all branches in a given repository to use a shared cache.

EDIT: A compromise might be that when a feature branch is merged into the default branch it's cache could be merged into the default branch's cache, and then the least recently used of this cache can be purged to bring it down under the limit. Worst case this is no worse than now, but best case we can use the same cache as the final commit in the feature branch.

dkubb avatar Jul 03 '20 18:07 dkubb

I think it might be possible to allow for sharing writable caches across PRs targeting the same base within a repo, however, I don't think I would ever allow that for PRs coming from forks of public repos.

@chrispat This would be great for us. I would prefer to not have to trick the flow by checking into master after each test run to get this kind of functionality. I would imagine forks are an edge case that few people care about in practice.

jfo84 avatar Jul 28 '20 10:07 jfo84

This issue is stale because it has been open for 365 days with no activity. Leave a comment to avoid closing this issue in 5 days.

github-actions[bot] avatar Dec 30 '21 08:12 github-actions[bot]

I don't think this issue is solved yet... but since GitHub has a way of approving running checks for external PRs, there may be a possibility to condition access to cache in some way as well

gyfis avatar Dec 30 '21 08:12 gyfis

This is still a problem for us too. We don't want to use restore-keys because we would like to use npm ci. We would like to share node module caching across all branches with the package-lock hash as the key to minimize installing packages as some of our private packages cost per install.

flippidippi avatar Feb 10 '22 20:02 flippidippi

I'm sorry to spam this issue, but I have a question that is related to caches in pull requests.

The question is about the quota that caches for pull requests consume. When a pull request build uses GitHub Actions Cache, is the quota consumed from the upstream repository where the Pull Request was created, or does the pull request build consume the quota of the forked repository?

I have posted the question in GitHub Community forum for GitHub Actions: https://github.community/t/how-does-the-github-actions-cache-size-limit-work-for-pull-requests/240546/3 .

Would someone be able to reply to this question? Thanks.

(update: #534 seems to be related to my question, I'm sorry about the noise.)

lhotari avatar Mar 25 '22 07:03 lhotari

The cache lives in the repo the workflow runs are in which in the case of a pull request from a fork is the upstream repo.

chrispat avatar Mar 25 '22 12:03 chrispat

This is still a problem for us too. We don't want to use restore-keys because we would like to use npm ci. We would like to share node module caching across all branches with the package-lock hash as the key to minimize installing packages as some of our private packages cost per install.

If the branch your PR is targeting has the same package lock has as your branch then it should pull the cache from the branch the PR is targeting. The way the cache scopes work is you get a R/W to your scope (your branch) and a RO to the PR target branch and the repos default branch.

chrispat avatar Mar 25 '22 12:03 chrispat

@chrispat @N-Usha looks like the doc is still not super helpful in calling out the scopes around PR branches. Should we update the doc with details for PR specifically? Another possibility now that we have list API is to present a better message as to why no cache was not found (i.e. was it key, or the scope or the version)

bishal-pdMSFT avatar Jul 01 '22 08:07 bishal-pdMSFT

This cache miss across different branches is creating anti-patterns in the way people work. It has been my experience that people do not create different branches or smaller pull requests because the build would start from scratch!

This should be fixed. It is not just a miss-documentation.

aminya avatar Aug 04 '22 06:08 aminya

This is something that we are missing too. The caches being scoped to the PR's merge commit branch leads to a lot of wasted action minutes in our setup too. We are using the org enforced required workflows(Beta right now), which at this point do not run on pushes but only on PR's. With the current cache scoping we are basically unable to use any kind of caching.

avivek avatar Mar 18 '23 11:03 avivek

Screenshot from 2023-03-30 19-07-42 It should be obvious from the above screenshot that this is huge waste of time for workflows as well as waste of space for github. The cache key is not chosen randomly. It means something regarding the contents of the cache. There should be a configuration option that allows a repository to opt-in to trusting its own cache across branches.

moh-eulith avatar Mar 30 '23 23:03 moh-eulith

I was also wondering why my caching wouldn't work for different branches/PRs, until I eventually stumbled upon this issue.

This is a particular issue when using CMake builds with vcpkg (C++) and vcpkg's binary caching. For every new branch or PR, it kicks off vcpkg's build process which source-builds the required dependencies, and that takes over an hour on the CI runner.

Is there any way to share the caches across branches / PRs yet?

patrikhuber avatar Apr 10 '23 10:04 patrikhuber

Yea, it is absurd restriction. I understand this is enabled by default to avoid some cache poisoning, but you should be able to opt-out. Maybe github wants your build to run longer in order to charge you more money 😄 Otherwise I am forced to write my own cache workflow which will sync the content of the cache dir to some persistant external storage. 😞

techi602 avatar Apr 14 '23 10:04 techi602

Also need to voice my concern here. It's great that Github provides security support wheels for every repository on Github with a thoughtful set of sane defaults but additionally expose the ability to control these things! That said, If we can understand the security implications of reusing caches, we can also understand how to layer and separate them.

The impact across every repo seems detremental: developers wait longer for PR's to land (wastes time) and pays more for accrued CI minutes (wastes money).

A solution that comes to mind would be adding a scope or flag for how the cache would be accessible (type: private - only accessible to same ref or perhaps scope: $ref or glob).

jbergstroem avatar May 17 '23 13:05 jbergstroem

@t-dedah @vsvipul A lot of people including me have issues with the current limitation on cache scope. Could you please provide some explanation on what's your plan for addressing this?

ImanMahmoudinasab avatar Jul 20 '23 21:07 ImanMahmoudinasab

I created a github action that only checks the cache instead of checking and restoring it. It can be useful to run as a job when you merge to a main branch so that consequent PR's can use the github cache action to restore a package cache.

Take a look here: https://github.com/jbergstroem/cache-key-exists

jbergstroem avatar Jul 21 '23 05:07 jbergstroem

I created a github action that only checks the cache instead of checking and restoring it. It can be useful to run as a job when you merge to a main branch so that consequent PR's can use the github cache action to restore a package cache.

Take a look here: jbergstroem/cache-key-exists

Your action can be replaced with actions/cache/restore:

lookup-only - If true, only checks if cache entry exists and skips download. Default: false

  - uses: actions/cache/restore@v3
    with:
      path: path/to/dependencies
      key: ${{ runner.os }}-...
      lookup-only: true # <---- THIS

miluoshi avatar Jul 25 '23 12:07 miluoshi