postgrest icon indicating copy to clipboard operation
postgrest copied to clipboard

ci: Improve caching

Open wolfgangwalther opened this issue 1 year ago • 28 comments

Plenty of text in the commit messages. Hopefully this also fixes the currently failing loadtest, but I'm not sure about that.


TODO:

  • [x] Save nix cache only on main
  • [ ] Split stack/cabal caches into dependency and working directory caches
  • [ ] Reduce cache size
  • [ ] Add PR caching for postgrest-build
  • [ ] Garbage collect nix/stack/cabal caches?
  • [ ] Purge old caches proactively when saving new cache?

wolfgangwalther avatar Feb 06 '24 18:02 wolfgangwalther

Indeed, loadtest is fixed that way. Not sure whether that's a one-time thing or not, but still a good sign.

A few more observations, looking at https://github.com/PostgREST/postgrest/actions/caches. Since I'm not sure whether that's publicly available, some key facts:

  • There's a big warning on top, that reads: "Approaching total cache storage limit (53.35 GB of 10 GB Used). Least recently used caches will be automatically evicted to limit the total cache storage to 10 GB."
  • Those 53 GB have been created since yesterday, so that's not going to be sustainable.
  • macOS-all: 3.1 GB
  • Linux-test: 2.7 GB
  • Linux-style: 1.4 GB
  • Linux-common: 2.7 GB
  • Stack/cabal caches between 300MB and 1GB each

test and style are the new split in this MR. common is the old cache from before. Clearly, splitting test and style does not make any difference in terms of cache size - everything in style is also in test. On the flip-side, merging the loadtest into it is great, the cache is not really bigger.

Overall, we only have 10 GB, so we need to use the cache a lot less. I think we should do the following:

  • Only store caches on the main branch.
  • Restore caches on main and in PRs.
  • The caches will have fixed names and will be replaced by a new commit on main. No prefix matching.
  • Merge style and test again.

This way we can use the 10 GB we have for:

  • 3.1 GB for macos
  • 2.7 GB for tools
  • 2.3 GB for stack
  • 1 GB for Cabal
  • Sum: 9.1 GB

By only storing caches when on main we avoid having the main cache be evicted when a PR changes dependencies. This will make CI fast for those PRs which only touch code, those will use the cache. PRs which touch dependencies or nix will be slower, because they only use cachix. This seems like the most useful thing to do here.

Edit: The above calculation does not take the "static" cache into account, which is another 1 GB. So the cache would already be at it's maximum with those...


To avoid having to rebuild the dynamic postgrest build in every nix job, before we run all the tests on it, we can additionally cache just the dist-newstyle folder between the "prepopulate job" and all the test jobs. This comes in at around 200 MB, so we should still have room for those, even when pushing them in PRs. We'd just need to make sure that those caches get some kind of priority over the main caches in terms of eviction. I.e. caches from main should always be evicted last, PR cached first. Will need to figure out whether that's possible.

wolfgangwalther avatar Feb 06 '24 19:02 wolfgangwalther

we can additionally cache just the dist-newstyle folder between the "prepopulate job" and all the test jobs

Ooh, this one's good. Should've thought of this myself. Looks like something action/cache can easily handle.

We'd just need to make sure that those caches get some kind of priority over the main caches in terms of eviction.

The current eviction policy is that the oldest caches are deleted first, so all we could do is to create the big Nix cache last somehow. Let's see if we do hit that limit first.

  • Garbage collect nix/stack/cabal caches?

On the one hand, it makes sense for Nix at lease, on the other — there were full Nix cache rebuilds and from what I've noticed, the cache size doesn't seem to increase significantly, if ever; it was 3-something GB all the time.

  • Stack/cabal caches between 300MB and 1GB each

Speaking of Stack caches, somehow Linux cache is 380M, MacOS is 680M and Windows is 1GB, I wonder why?

develop7 avatar Feb 07 '24 11:02 develop7

Speaking of Stack caches, somehow Linux cache is 380M, MacOS is 680M and Windows is 1GB, I wonder why?

Yeah, that's something I wondered about, too. I tried stack locally, but I'm getting sizes of more than 2 GB for my ~/.stack folder. This might be what we're seeing on Windows as a slightly compressed cache, though...

I think the difference is that stack would download GHC in those cases, and we'd cache that, too. Which, I think, doesn't make much sense - downloading and installing GHC via stack is by far the fastest piece. Caching the dependencies is the important bit, though.

Maybe we can do better with some combination of ghcup and --system-ghc or so.

wolfgangwalther avatar Feb 07 '24 11:02 wolfgangwalther

The current eviction policy is that the oldest caches are deleted first, so all we could do is to create the big Nix cache last somehow. Let's see if we do hit that limit first.

The problem is that the caches from main will by definition always be "the oldest", so any PR cache that blows past the limit would cause some cache from main to be evicted.

wolfgangwalther avatar Feb 07 '24 11:02 wolfgangwalther

The problem is that the caches from main will by definition always be "the oldest", so any PR cache that blows past the limit would cause some cache from main to be evicted.

Interestingly, all PR caches from yesterday have already been evicted while main caches are still there. So maybe there is already some built-in prio between those.

wolfgangwalther avatar Feb 07 '24 11:02 wolfgangwalther

Since all caches are evicted after 7 days anyway, we could just run a daily workflow to restore (and save?) the current caches on main. This way they'd always be "used last" and hopefully not be evicted instead of PR caches.

wolfgangwalther avatar Feb 07 '24 11:02 wolfgangwalther

Interestingly, all PR caches from yesterday have already been evicted while main caches are still there.

Oh, were they? That'd be really useful. Were the corresponding branches deleted or related PRs merged?

develop7 avatar Feb 07 '24 12:02 develop7

Maybe we can do better with some combination of ghcup and --system-ghc or so.

We could even set stack up with ghcup, and configure stack in a way it'll use ghcup to install GHC (actually ghcup bootstrap script does it unless you ask it to), and if we cache .ghcup separately, we could share cached GHC between cabal and stack builds. With GHC tarballs are 200 to 300M big, times 3 platforms, this would result in another ~1GB saved cache.

develop7 avatar Feb 07 '24 12:02 develop7

we could share cached GHC between cabal and stack builds.

Not really, they use different GHC versions. Stack is on GHC 9.4.5, Cabal on 9.4.8, 9.6.4 and 9.8.1.

wolfgangwalther avatar Feb 07 '24 13:02 wolfgangwalther

Since all caches are evicted after 7 days anyway, we could just run a daily workflow to restore (and save?) the current caches on main. This way they'd always be "used last" and hopefully not be evicted instead of PR caches.

Yeah, I think I've suggested it somewhere in comments to nix-cache-related PRs. The eviction policy doesn't mention access time, though, but that seems like the best effort.

develop7 avatar Feb 07 '24 13:02 develop7

Oh, were they? That'd be really useful. Were the corresponding branches deleted or related PRs merged?

Ah, I thought we'd have the case with this PR. But this PR only re-uses caches from main, so that just kept those and deleted everything else...

wolfgangwalther avatar Feb 07 '24 13:02 wolfgangwalther

Not really, they use different GHC versions

Why do we stick to 9.4.5 for Stack, BTW? 9.4.8 is in Stackage for a month or so

develop7 avatar Feb 07 '24 13:02 develop7

Excellent point on caching GHC and dependencies in non-nix builds separately

develop7 avatar Feb 07 '24 13:02 develop7

Why do we stick to 9.4.5 for Stack, BTW? 9.4.8 is in Stackage for a month or so

But not available on FreebBSD, yet:

https://github.com/PostgREST/postgrest/pull/3155#discussion_r1451744469

wolfgangwalther avatar Feb 07 '24 13:02 wolfgangwalther

So the idea would be to consistently:

  • Install GHC / Stack / Cabal via ghcup. Every time, no cache.
  • Cache dependencies for Stack/Cabal.
  • Cache the actual build for Stack/Cabal separately. This avoids invalidating the dependencies cache for regular code changes.

wolfgangwalther avatar Feb 07 '24 13:02 wolfgangwalther

We could probably also just throw out the Linux x64 cabal build with GHC 9.4.8. We are building with 9.4.8 in nix, with 9.4.5 via stack on various platforms and with 9.4.8 on arm via Cabal. That should be enough for 9.4.x series...

wolfgangwalther avatar Feb 07 '24 13:02 wolfgangwalther

@develop7 do you see any way of "not storing the cache" with cache-nix-action? I think this is possible in v5, but we have reverted to v4. Maybe we should try v5 again - might not have been buggy at all, maybe we just had our caches invalidated for other reasons or so?

wolfgangwalther avatar Feb 07 '24 13:02 wolfgangwalther

do you see any way of "not storing the cache" with cache-nix-action?

@wolfgangwalther perhaps, use restore action only? That's what the cache-nix-action is using internally anyway.

develop7 avatar Feb 07 '24 14:02 develop7

@develop7

perhaps, use restore action only? That's what the cache-nix-action is using internally anyway.

Ah, right I remember why I looked at this and then didn't do it: This would require us to put the save part manually in each place where we currently use the setup-nix action. I didn't find a way to add a "post job hook" or something like that in a composite action, yet.

However, v5 of cache-nix-action has the save input which can control this. Using this would allow to keep things simple - just using setup-nix would be enough.

wolfgangwalther avatar Feb 07 '24 17:02 wolfgangwalther

I looked at the "Build MacOS (Nix)" job a little bit, because it by far takes the most time to run (compared to other nix jobs). Both on main (with cache) and in this PR (without cache, because the key was changed). The numbers are:

  • The last run on main took 14m 2s. 13m 37s were spent in the "Setup Nix Action" - most of that to download the actions cache of around 3.8 GB.
  • The last run in this PR took 15m 26s, of which 12m 26s were spent downloading from cachix. Nothing was built. "Setup Nix Action" took 2m 46s.

Assuming the 2m 46s are the "base" time for nix setup without cache download, downloading the actions cache took around 11m here. Compare that to 12m 30s for cachix... and I start to wonder why do we bother with caching this stuff via github actions at all? Cachix seems fast enough to me for that purpose.

I can see the github actions cache be useful to:

  • cache stack and cabal
  • cache postgrest-build for later stages of the same workflow

But I seriously question how much we are gaining from caching nix in the actions cache again.

As an experiment, I added a commit to replace the actions cache for this job with something else: Basically a lookup of cachix, which doesn't always involve a full download. This should be very fast when nothing changes - and if it does, it would only download those dependencies that are required for the rebuild. Compare this to downloading everything every time..

wolfgangwalther avatar Feb 07 '24 20:02 wolfgangwalther

I start to wonder why do we bother with caching this stuff via github actions at all?

Nix caching does shave off a minute or two per Nix-involving job, which shortens the feedback loop accordingly; usually to these two minutes, due to CI jobs running in parallel. For the macOS case, where there is a single cache consumer job and low-performing runner, it certainly does make sense to drop cache altogether.

This is my first approach to GitHub Actions and their cache in particular, and, being honest, it is performing worse than I've expected, especially Windows/Mac runners (download speed is barely 100Mb/sec (down to abysmal 20MB/sec even), while in Linux runners it's 380Mb/s sometimes). With the dedicated infrastructure, I expect the improvement to be more significant, but currently it might not be worth the effort.

develop7 avatar Feb 08 '24 15:02 develop7

As an experiment, I added a commit to replace the actions cache for this job with something else: Basically a lookup of cachix, which doesn't always involve a full download. This should be very fast when nothing changes - and if it does, it would only download those dependencies that are required for the rebuild. Compare this to downloading everything every time..

So that didn't work at all in CI, while locally it was fine. I have no idea, yet, what's happening there.

I didn't find a way to add a "post job hook" or something like that in a composite action, yet.

This is done now - I was able to keep it all in the setup-nix action.

wolfgangwalther avatar Feb 08 '24 21:02 wolfgangwalther

On the one hand, it makes sense for Nix at lease, on the other — there were full Nix cache rebuilds and from what I've noticed, the cache size doesn't seem to increase significantly, if ever; it was 3-something GB all the time.

After #3142 was merged the static build as failing on main with a full disk. It seems like the problem is this:

  • With the actions cache approach, the full cache is downloaded - including all the old derivations.
  • Then nix-build builds the new derivations.

This means that at some point both old and new derivations will be on disk. Compare that to the pure-cachix approach, where the old derivations will never be downloaded and just the new derivations will be built - much smaller chance of running the disk full.

In fact, I think for nix the actions cache is actually harmful, not helpful.

Nix caching does shave off a minute or two per Nix-involving job, which shortens the feedback loop accordingly; usually to these two minutes, due to CI jobs running in parallel. For the macOS case, where there is a single cache consumer job and low-performing runner, it certainly does make sense to drop cache altogether.

This is considering the best-case: Nothing changes and the full actions cache can be re-used. Yes, in this case it can be faster. But as more derivations need to be rebuilt, there are no advantages anymore. Consider the other extreme: A full rebuild. Prefix matching the old cache would mean we first load all the old derivations - and then start the full rebuild. This results in the full disk issue mentioned above, but also is just a waste of time - the actions cache should have never been downloaded, waste of time.


I just merged one of the commits to main which reduces the number of prefixes to restore caches from. This seems to fix it for the moment - but I think that's not a full fix, yet. It just works now, because the caches changed id and thus no caches were restored. We still have the prefix-based restore on main though and will run into the same problems very soon, possibly on the next commit, when the cache is restored again.

wolfgangwalther avatar Feb 09 '24 18:02 wolfgangwalther

I think what we should do is:

  • Use github actions cache only for non-nix stuff. stack/cabal dependencies and code, but maybe also dist-newstyle for the nix-based test environment.
  • Use cachix for nix.
  • Improve cachix performance by drastically reducing the closure size of our tools and dependencies.
  • Improve some of the jobs by not downloading everything from cachix, but only what is required. I think this could work for the full macos build and for the static build.

wolfgangwalther avatar Feb 09 '24 18:02 wolfgangwalther

It just works now, because the caches changed id and thus no caches were restored. We still have the prefix-based restore on main though and will run into the same problems very soon, possibly on the next commit, when the cache is restored again.

This turned out to be true - the next commit was failing with exactly the same error. I now pushed a commit to just remove the actions cache for nix entirely. This should fix CI for now and we can then investigate further how to improve CI performance here.

wolfgangwalther avatar Feb 10 '24 11:02 wolfgangwalther

I merged some of the simple fixes that I made along the way here, to main already.

wolfgangwalther avatar Feb 10 '24 19:02 wolfgangwalther

I have a lot of open TODOs, there is nothing to merge from me right now. The code still in this PR is not done from my POV, yet.

Plus: Literally, the MacOS job is failing in this state...

wolfgangwalther avatar Feb 14 '24 16:02 wolfgangwalther

I mean, the parts you have done already could be extracted to separate PR and merged, making this PR smaller and easier to review in the end.

develop7 avatar Feb 14 '24 17:02 develop7

I focused on improvements to the stack and cabal caching here and decided against trying to play with more nix-related github actions caching. Instead I will try https://github.com/PostgREST/postgrest/pull/3364#discussion_r1545788821 to speed up the macos nix job.

wolfgangwalther avatar May 20 '24 11:05 wolfgangwalther