[CI] bitcoin cache download occasionally fails
on occasion, the cache used to load the bitcoin binaries for tests will fail as if the cache does not exist. manually checking though, the cache is indeed listed so it's unclear where the error is coming from.
hypothesis is that with more devs/prs running workflows which stores data in the cache, we're hitting the limit more frequently, and the bitcoin cache is being evicted when a test is trying to load it.
i think there are 2 options we can explore:
- change to use simple
curlto download a bitcoin binary archive per test. i suspect this may add some time overall to the workflow, and it does add another moving part per test that may fail in unexpected ways. - add more robust "does cache exist?" functions, and if the cache is not present, immediately try to recreate it (this may be challenging though, since it could lead to an infinite loop of creating the cache).
cc @kantai @fdefelici
I did some analysis on this issue and here are the findings that I hope could help.
Root Cause
We currently use a caching strategy where a "CACHE CHECK/SAVE" step ensures availability before subsequent "CACHE RESTORE" steps are executed. However, due to GitHub's caching policy (i.e., eviction when storage limits are exceeded), it’s possible for a cache to be evicted between these steps.
This leads to errors like:
Failed to restore cache entry. Exiting as fail-on-cache-miss is set. Input key: stacks-core-bitcoin-binaries-25.0
This message comes from our custom wrapper action: https://github.com/stacks-network/actions/blob/main/stacks-core/cache/bitcoin/action.yml
Cache Analysis
Based on the cache usage in https://github.com/stacks-network/stacks-core/actions/caches, we can categorize our caches into two main groups:
- Stable caches: These include bitcoind binaries (e.g. stacks-core-bitcoin-binaries-25.0). They are reused across many PRs and rarely change.
- Temporary caches: These include test archive data (e.g. stacks-core-cc5724dfdc60a43c1b0916668bf63b776e6c88f1-test-archive). These are PR-specific and often larger in size.
The temporary caches are more demanding in terms of storage, increasing the risk of stable caches being evicted under memory pressure.
Potential Solutions
Option 1: Conservative Approach Modify our wrapper action so that, if cache restore fails, it immediately recreates the cache.
- Simple and quick to implement.
- Could lead to race conditions or infinite loops if multiple jobs try to recreate the cache simultaneously.
Option 2: Smarter Cache Management Since we have “two speeds” of caches (stable vs temporary), we could try to handle them accordingly.
GitHub cache management doesn’t yet allow grouping or prioritizing caches natively, but we can approximate this behavior by:
- Keeping stable caches persistent until explicitly updated (e.g. new Bitcoin version).
- Cleaning up temporary caches proactively (e.g. when a PR is closed).
This would reduce memory pressure and help avoid evicting critical caches.
To support this, we can explore using the gh cache CLI, which seems to provides useful commands to list and delete cache entries.
For example, this repo implements a scheduled cache eviction workflow.
We could take a similar approach:
- and try to delete temporary caches when closing the related PR
- eventually adjust cache naming patterns to include PR IDs, making them easier to identify and delete.
- if needed applying also a scheduled delete for old temporary cache entries
As a side note, if we need more flexibility in the future, we might consider an external caching solution (e.g., Amazon S3) for stable binaries.
option 1 was something i looked into when we moved to using caches for shared data - the biggest challenge was the possibility of an endless loop. where the cache doesn't exist, so it is created, then deleted/evicted, created again, and so on. i'm going off of memory, so i may be wrong - but i think at the time github actions UI complained about how it was attempted (calling out the endless loop).
option 2 seems like the better option, but not without tradeoffs:
- there really is only 1 stable cache we use today, bitcoin - the other caches for nextest etc are created on PR open/commit based on the commit hash. so any change would really only apply to bitcoin today (but we'd have the logic if that changes in the future).
- the scheduled eviction could work - but we'd have to also accept that it may delete caches that we still want (and would need to recreate). consider a PR that has a cache with a failing required test. today, we keep the cache in order to improve retry speed without needding to rebuild the entire cache (~15-20 minutes on avg). if that cache is evicted, and the failing job retried - it would need a manual retry of all jobs (to recreate the cache), not just the failing one. one way around this (but could be messy/duplicative) is to check/create the cache before each test (which has it's own issues because we run a lot of tests in parallel using the commit hash as the key). what is likely to occur here is we'd have competing tasks creating the cache, all racing to save it and start a test, meanwhile other jobs are overwriting that cache and starting their own tests.
- if we modifed the trigger from a scheduled task to delete caches to run on merge, that could solve the above issues, but not without challenges - since the cache key uses a commit hash, we'd have to tie a cache key back to a PR's commits before evicting them. i like your idea about adding the PR id though - perhaps we could filter by PR id on merge, and evict all matching though.
there may be a middle ground for option 2, so i'll start looking into solutions for that - meanwhile i could submit a PR to adjust teh bitcoin cache key to get around this error for the time being (unless current mitigation is acceptable?)
i'll start working on these ideas in some forks, will share more once i have some initial changes setup.
edit: using an external provider like s3 has come up in the past - and it would work, but it would require either public write access, or require anyone running the ci in a fork to also setup s3 and pay for it. I don't think that's an idea we should be pursuing just yet (a similar argument has been made to use custom runners like buildjet). additionally, the benefit of using a cache vs block storage is that it's a bit faster to download the data to a runner - we could probably do the same without s3/cache by simply modifying jobs to download artifacts from a job. this would be a bit slower on average i think, but it is an option.
i've gone over the different options here @fdefelici - i really like the "smart" caching strategy, and it was my first choice as well.
however, i don't think it will be technically possible using github's cache (which i think we need to keep, else it will add additional burden on forks and other contributors). the reason i don't think it's possible is that only the default branch, or a cache made from a parent branch may be used by a source branch in a PR.
so , caches created from master and develop would be fine for this strategy- but then we may have the issue where those caches are evicted, and we'll have 2 options:
- somehow we trigger a job to recreate the missing cache from the target branch (i.e. in a PR's workflow, if the cache is missing - we then have to re-trigger the cache creation from the target branch). at best, this approach would be quite messy and i would imagine prone to race conditions as you outlined in your first option above. e.g. 2 pr's detect the missing cache, both attempt to recreate - at least one of those PR's will fail status checks as a result.
- schedule a task that creates the caches
ntimes per day from the target branches. this could work, but it's possible that there may be an extended amount of time where the cache is missing, and jobs are failing until the job is scheduled to run (or is triggered manually). this could possibly be mitigated if we add functionality described in (1), but we're still left with an unresolved race.
I'm left thinking we avoid complicating this and stick with the "band-aid" approach i PR'ed in https://github.com/stacks-network/actions/pull/74. since each PR will create a new cache for each commit sha, it's quite unlikely that these caches will be evicted by the time they're used (it's possible they may be missing if a re-run is required several days later, but re-running all jobs i think is acceptable in that case).
combine this with the idea of removing older cache entries (or even just caches with sha's that have merged) - is i think the best approach here based on what can do with the tools available.
I'll start looking into that idea of using the cli to remove caches - i can think of 1 edge case where it may become a little trick (the sha's in develop, where they should be checked will also all be present in release branches - so it's possible a release build could have caches deleted during a release workflow). i think we can work around cases like this though, likely by using your idea of setting cache keys with a PR number.
https://docs.github.com/en/actions/writing-workflows/choosing-what-your-workflow-does/caching-dependencies-to-speed-up-workflows#restrictions-for-accessing-a-cache
Ok. Let's start with the "clean-up" approach, for sure can help with the "cache pressure" without overcomplicating things