atom icon indicating copy to clipboard operation
atom copied to clipboard

Always restore cache

Open aminya opened this issue 5 years ago • 61 comments

Edit: turns out npm ci deletes all the restored cache. This PR means we should decide between the reliability of CI vs the time it takes to bootstrap

We should restore the node_modules cache even if we change package.json. npm install is smart enough to not redo everything from scratch! In this case, however, we should not skip the bootstrap phase completely. It will automatically become faster due to npm's smartness.

This branch is based on #31. Needs rebase and adding it for other platforms for final merging.

aminya avatar Jul 10 '20 03:07 aminya

Okay, we can try this.

Better to use fallbacks though. Hits the more specific match if it can, then falls back to a more generic/loose match if possible.

Keep everything in the key: field.

Add restoreKeys: for the more generic/looser matches.

https://docs.microsoft.com/en-us/azure/devops/pipelines/release/caching?view=azure-devops#restore-keys

DeeDeeG avatar Jul 10 '20 03:07 DeeDeeG

Something like this:

          key: 'npm | "$(Agent.OS)" | "$(buildArch)" | apm/package.json, apm/package-lock.json, script/vsts/platforms/windows.yml'
          restoreKeys: |
             npm | "$(Agent.OS)" | "$(buildArch)" | apm/package.json, apm/package-lock.json
             npm | "$(Agent.OS)" | "$(buildArch)"

Oh and apparently we can cache the "system global npm package cache" somehow, not just the installed node_modules.

DeeDeeG avatar Jul 10 '20 03:07 DeeDeeG

Something like this:

          key: 'npm | "$(Agent.OS)" | "$(buildArch)" | apm/package.json, apm/package-lock.json, script/vsts/platforms/windows.yml'
          restoreKeys: |
             npm | "$(Agent.OS)" | "$(buildArch)" | apm/package.json, apm/package-lock.json
             npm | "$(Agent.OS)" | "$(buildArch)"

Oh and apparently we can cache the "system global npm package cache" somehow, not just the installed node_modules.

Done in https://github.com/atom-ide-community/atom/pull/33/commits/b766569979bf04513db55cd51aafaef938212e23

aminya avatar Jul 10 '20 04:07 aminya

Oh and apparently we can cache the "system global npm package cache" somehow, not just the installed node_modules.

https://stackoverflow.com/questions/14836053/how-can-i-change-the-cache-path-for-npm-or-completely-disable-the-cache-on-win

npm config set cache [some_dir_in_the_working_directory_of_the_pipeline]

If we set this we can cache one more part of CI to run strictly from Azure servers. (Avoid network calls to npmjs.com).

DeeDeeG avatar Jul 10 '20 04:07 DeeDeeG

Done in https://github.com/atom-ide-community/atom/pull/33/commits/b766569979bf04513db55cd51aafaef938212e23

Nice

DeeDeeG avatar Jul 10 '20 04:07 DeeDeeG

Oh and apparently we can cache the "system global npm package cache" somehow, not just the installed node_modules.

https://stackoverflow.com/questions/14836053/how-can-i-change-the-cache-path-for-npm-or-completely-disable-the-cache-on-win

npm config set cache [some_dir_in_the_working_directory_of_the_pipeline]

If we set this we can cache one more part of CI to run strictly from Azure servers. (Avoid network calls to npmjs.com).

Where do we need the global npm cache? It is better to first fix the current caching system before trying new folders.

aminya avatar Jul 10 '20 04:07 aminya

We still have cache misses: https://dev.azure.com/atomcommunity/atomcommunity/_build/results?buildId=117&view=logs&j=2985f0af-e798-5fdc-91b8-be9f0a3685c5&t=d0984c8c-0243-5740-7c87-f72756b34e5a&l=36

aminya avatar Jul 10 '20 04:07 aminya

Where do we need the global npm cache?

https://docs.microsoft.com/en-us/azure/devops/pipelines/build/variables?view=azure-devops&tabs=yaml#agent-variables

$(Agent.BuildDirectory)/npmcache?

It is better to first fix the current caching system before trying new folders.

:+1: Agreed.

DeeDeeG avatar Jul 10 '20 04:07 DeeDeeG

We still have cache misses

Maybe the new runs will upload their caches as the strict identifier AND the loose identifiers. So we might have plenty of chances to hit cache going forward. I'd like to see if that's what happens at the cache save for the current run.

By the way, we should make sure it still bootstraps after an inexact cache hit, right?

I think we are already set to do that, because cacheHitVar will be set to inexact, not true. (Bootstrap only skips if all three cache hit vars resolve to true).

DeeDeeG avatar Jul 10 '20 04:07 DeeDeeG

I removed bootstrap skipping in https://github.com/atom-ide-community/atom/pull/33/commits/6413b1591f491cc044776c06223cad99734be80b. If we want to use cache when we change package.json, etc, we need to remove this line.

aminya avatar Jul 10 '20 04:07 aminya

If we get an exact cache restore for all three, then we do want to skip bootstrapping IMO.

It would be a plain win to revert that, IMO. Bootstrap skipping is nice.

It only skips when all three are perfectly exactly matched. I was thinking ahead to inexact hits/restores I guess, but never followed up. You are on a roll with these by the way.

We are going to give upstream a bit of a heart attack if we try to post these to them though, lol. No one likes an outside contributor saying "Hey, I can contribute a change to the thing that lets you know if you broke your app!"

And even for our own sake we should watch this closely that it doesn't, well, break our CI to do inexact cache restores and re-bootstrap on top of that.

Anyway, if it interests us, no reason we can't do it. Upstream and us don't have to say exactly close. Just close enough that we can track and integrate on top of their work. And hopefully upstream stuff to benefit the widest community possible. (And hey, upstreaming helps to ensure they don't break our stuff, lol. So we make it "our [collective] stuff" and help them out. Open source is beautiful.)

DeeDeeG avatar Jul 10 '20 04:07 DeeDeeG

If we get an exact cache restore for all three, then we do want to skip bootstrapping IMO.

It would be a plain win to revert that, IMO. Bootstrap skipping is nice.

It only skips when all three are perfectly exactly matched.

OK. I reverted that. I changed the target branch for simpler diff.

aminya avatar Jul 10 '20 04:07 aminya

Maybe we can cache Windows build tools and dependencies... I'm just afraid we're going to have nothing but 39 cache steps pretty soon.

DeeDeeG avatar Jul 10 '20 05:07 DeeDeeG

Maybe we can cache Windows build tools and dependencies... I'm just afraid we're going to have nothing but 39 cache steps pretty soon.

Those don't take much time. I prefer caching for something that takes a lot of time. If we could write the cache steps in a more compact way, we could try that.

see https://github.com/atom-ide-community/atom/issues/1#issuecomment-656485638

aminya avatar Jul 10 '20 05:07 aminya

I just realized we haven't run the PR pipeline on master in... a very long time. So there is no cache saved for master branch. Which drastically reduced the chance of a cache hit. I'm running the PR pipeline against master now.

I think we should manually run the PR Pipeline for master branch every time we merge something that touches a file in script/vsts/platforms/*.yml

Edit: That would be hard to remember, though. We may as well run it on a schedule. Once every day maybe.

DeeDeeG avatar Jul 10 '20 19:07 DeeDeeG

@DeeDeeG We can set a pull request trigger for the master branch. That should be possible in Azure UI.

aminya avatar Jul 10 '20 23:07 aminya

@aminya Good idea! It should also be possible in the yaml if you prefer. Something like

pr:
- master

https://docs.microsoft.com/en-us/azure/devops/pipelines/repos/github?view=azure-devops&tabs=yaml#pr-triggers

DeeDeeG avatar Jul 11 '20 00:07 DeeDeeG

I would like to look at caching the npm package cache, because "why not?"

Something like:

- task: Cache@2 [restore the npm package cache here]
- script:
  npm config set cache $(Agent.BuildDirectory)/CI_npmcache
  [do "npm install" here]

Should it be in a separate PR if I get it working?

DeeDeeG avatar Jul 11 '20 05:07 DeeDeeG

Should it be in a separate PR if I get it working?

OK Sure, Make a PR to this branch (always_restore_cache). We can see if it speeds up the things.

aminya avatar Jul 11 '20 05:07 aminya

Here it is: https://github.com/atom-ide-community/atom/pull/43

DeeDeeG avatar Jul 11 '20 07:07 DeeDeeG

For the historical record: to the best of my knowledge, this PR was at commit https://github.com/atom-ide-community/atom/commit/e1929b3a882f294c837b4cf75b1e2263e60b9dfe before #46 was merged into master, and before this PR was then rebased on top of master again.

For convenience (and to keep that commit and its parents from being pruned), this branch exists to mark said commit, at least for now: pr-33-before-pr-46

DeeDeeG avatar Jul 16 '20 14:07 DeeDeeG

@DeeDeeG So are you sure that if we change package.json, the bootstrap step is not skipped here?

aminya avatar Jul 25 '20 17:07 aminya

@DeeDeeG bump. If the answer is yes, I want to merge this.

aminya avatar Jul 26 '20 21:07 aminya

Looking into it to get the proper, exact answer. I have forgotten the detail since this was last brought up.

DeeDeeG avatar Jul 26 '20 21:07 DeeDeeG

Off-topic:

While doing the research to give you a solid answer here, I came across this, which seems weird and is probably a bad idea.

https://github.com/atom/atom/blob/09e4af2d99f0c488face9888af6dde602d95e344/script/bootstrap#L35-L36

Modifying the PATH to remove MSBUILD.exe? Why?? I don't think this is a good idea.

One of 212 (!!!) commits in this massive refactoring PR with multiple authors: https://github.com/atom/atom/pull/12410

No real explanation is given in the PR that I could see at a glance, though the comments/commits there are many. I'd like to see if deleting this helps with #44.

DeeDeeG avatar Jul 26 '20 22:07 DeeDeeG

The PR is made by the main author of Atom. I can't complain. 😄

I think the reason is to prevent MSBuild to build the native apps. https://github.com/atom/atom/blob/09e4af2d99f0c488face9888af6dde602d95e344/script/lib/delete-msbuild-from-path.js#L14

aminya avatar Jul 26 '20 23:07 aminya

@DeeDeeG So are you sure that if we change package.json, the bootstrap step is not skipped here?

Yes. It will be an "inexact cache hit"; Even just one "inexact" or missed cache restore is enough to trigger bootstrapping.


Full technical breakdown of what's going to happen with caching/bootstrapping:

If we change [repo_root]/package[-lock].json, apm/package[-lock].json, or script/package[-lock].json, then bootstrapping still happens.

This line is the condition for if bootstrapping will happen. Bootstrapping will happen unless all three caches (all three node_modules folders) were restored with an exact match. [1] [2] https://github.com/atom-ide-community/atom/blob/09e4af2d99f0c488face9888af6dde602d95e344/script/vsts/platforms/templates/bootstrap.yml#L31

If there were any missed or "inexact" cache restores, any node_modules folders that were able to be restored are there when bootstrapping begins. (You could have apm/node_modules but no script/node_modules, for example.)

Then, there is a check that the "dependency fingerprint" recorded in [repo_root]/node_modules/.dependencies-fingerprint (possibly restored from cache) matches your current repo/machine on the following details:

If this check fails, all the cache-restored node_modules folders are cleaned away (deleted) and bootstrap happens fresh.

https://github.com/atom-ide-community/atom/blob/09e4af2d99f0c488face9888af6dde602d95e344/script/bootstrap#L31-L33

(Total cleaning like this will only happen if a [repo_root]/node_modules/.dependencies-fingerprint file is found, and happens to be "outdated"/not a match per the above set of details. So, if [repo_root]/node_modules has been restored AND Electron was updated, or CI's Node version was updated, or the bundled apm version string was updated. Otherwise, cache-restored node_modules folders will NOT be cleaned before bootstrapping.)

Thankfully, script/node_modules will always be cleanly installed if boostrapping happens, due to the ci command... https://github.com/atom-ide-community/atom/blob/09e4af2d99f0c488face9888af6dde602d95e344/script/lib/install-script-dependencies.js#L13 But in that case, restoring it when either of the other two caches was missed or "inexact" is totally pointless. It will be restored, then cleaned and reinstalled with npm ci.

I am concerned that apm's native modules won't be rebuilt, since it isn't installed with npm ci, but rather npm install. https://github.com/atom-ide-community/atom/blob/09e4af2d99f0c488face9888af6dde602d95e344/script/lib/install-apm.js#L9-L12 In my experience, an existing apm installation doesn't get its native code rebuilt if you do npm install in the apm/ directory.

But then when apm is used to build core packages, the apm ci command is used, so those do get rebuilt again.

https://github.com/atom-ide-community/atom/blob/09e4af2d99f0c488face9888af6dde602d95e344/script/lib/run-apm-install.js#L14

Again, since ci is being used, restoring the [repo_root]/node_modules directory is also a waste of time if either of the other two caches is missed or "inexact". Restoring, deleting and reinstalling is slower than just fresh installing.

DeeDeeG avatar Jul 27 '20 05:07 DeeDeeG

I want to think through if there is a reasonable way to add conditional cache restores based on other cache hits being exact. So that we don't restore a cache just to have npm ci delete everything we restored and reinstall it over again.

I'm also somewhat concerned we can have a situation where CI is stuck with a stale cached install of apm that won't be rebuilt. This could lead to errors like this:

Error: The module 'module_name_here'
was compiled against a different Node.js version using
NODE_MODULE_VERSION [older/newer ver]. This version of Node.js requires
NODE_MODULE_VERSION [currently installed ver]. Please try re-compiling or re-installing

Or other subtle behavior differences of apm.

Generally want to think over and confirm this approach isn't counter-productive and ultimately could even be slower. I'm not sure yet.

DeeDeeG avatar Jul 27 '20 05:07 DeeDeeG

Here is a list of possible outcomes when there are inexact cache hits, or cache misses:

List of possible outcomes

All Exact Hits

st at mt == skip bootstrapping ✅

One or more inexact cache hits (no cache misses)

st ai mt == re-run, scripts restore-wipe-reinstalled ❌, apm inexact and not rebuilt ❌, main restore-wipe-reinstalled ❌ st at mi == re-run, scripts restore-wipe-reinstalled ❌, apm exact and not rebuilt ❌, main restore-wipe-reinstalled ❌ st ai mi == re-run, scripts restore-wipe-reinstalled ❌, apm inexact and not rebuilt ❌, main restore-wipe-reinstalled ❌

si at mt == re-run, scripts restore-wipe-reinstalled ❌, apm exact and not rebuilt ❌, main restore-wipe-reinstalled ❌ si ai mt == re-run, scripts restore-wipe-reinstalled ❌, apm inexact and not rebuilt ❌, main restore-wipe-reinstalled ❌ si at mi == re-run, scripts restore-wipe-reinstalled ❌, apm exact and not rebuilt ❌, main restore-wipe-reinstalled ❌ si ai mi == re-run, scripts restore-wipe-reinstalled ❌, apm inexact and not rebuilt ❌, main restore-wipe-reinstalled ❌

One or more cache misses (no inexact cache hits)

st am mt == re-run, scripts restore-wipe-reinstalled ❌, apm installed fresh ✅, main restore-wipe-reinstalled ❌ st at mm == re-run, scripts restore-wipe-reinstalled ❌, apm exact and not rebuilt ❌, main installed fresh ✅ st am mm == re-run, scripts restore-wipe-reinstalled ❌, apm installed fresh ✅, main installed fresh ✅

sm at mt == re-run, scripts installed fresh ✅, apm exact and not rebuilt ❌, main restore-wipe-reinstalled ❌ sm am mt == re-run, scripts installed fresh ✅, apm installed fresh ✅, main restore-wipe-reinstalled ❌ sm at mm == re-run, scripts installed fresh ✅, apm exact and not rebuilt ❌, main installed fresh ✅

Three cache misses

sm am mm == re-run, scripts installed fresh ✅, apm installed fresh ✅, main installed fresh ✅ (This is a standard clean bootstrap)

Combination of cache misses and inexact cache hits

One inexact cache hit, one or two cache misses

si at mm == re-run, scripts restore-wipe-reinstalled ❌, apm exact and not rebuilt ❌, main installed fresh ✅ si am mt == re-run, scripts restore-wipe-reinstalled ❌, apm installed fresh ✅, main restore-wipe-reinstalled ❌ si am mm == re-run, scripts restore-wipe-reinstalled ❌, apm installed fresh ✅, main installed fresh ✅

st ai mm == re-run, scripts restore-wipe-reinstalled ❌, apm inexact and not rebuilt ❌, main installed fresh ✅ sm ai mt == re-run, scripts installed fresh ✅, apm inexact and not rebuilt ❌, main restore-wipe-reinstalled ❌ sm ai mm == re-run, scripts installed fresh ✅, apm inexact and not rebuilt ❌, main installed fresh ✅

st am mi == re-run, scripts restore-wipe-reinstalled ❌, apm installed fresh ✅, main restore-wipe-reinstalled ❌ sm at mi == re-run, scripts installed fresh ✅, apm exact and not rebuilt ❌, main restore-wipe-reinstalled ❌ sm am mi == re-run, scripts installed fresh ✅, apm installed fresh ✅, main restore-wipe-reinstalled ❌

Two inexact hits, one cache miss

si ai mm == re-run, scripts restore-wipe-reinstalled ❌, apm inexact and not rebuilt ❌, main installed fresh ✅ si am mi == re-run, scripts restore-wipe-reinstalled ❌, apm installed fresh ✅, main restore-wipe-reinstalled ❌ sm ai mi == re-run, scripts installed fresh ✅, apm inexact and not rebuilt ❌, main restore-wipe-reinstalled ❌

s[?] means the script/node_modules directory cache a[?] means the apm/node_modules directory cache m[?] means the "main" or [repo_root]/node_modules directory cache.

[?]t means "true" or exact cache hit. [?]i means "inexact"; an inexact cache hit [?]m means "miss"; a cache miss

✅ means something good happened, with respect to reproducible builds with no stale packages. Also, this check mark means no time was wasted restoring a cache that would get wiped out during the bootstrap. ❌ means something non-optimal happened. Either apm's native code isn't rebuilt (could become stale and not match the state of the repo/CI environment), OR time is wasted restoring a cache that will be wiped during bootstrapping.

DeeDeeG avatar Jul 27 '20 15:07 DeeDeeG

As that list of possibilities shows, allowing any inexact cache hits has a downside. Due to fairly consistent use of npm ci in the bootstrap script in a CI environment, any inexact restored cache is simply wiped out before being reinstalled clean. Thus, restoring the [repo_root]/node_modules or script/node_modules caches are a waste of time if any cache is missed or inexact.

We can "solve" this by allowing bootstrapping to be skipped if all three caches are restored, regardless of if they were exact or inexact. Currently only three exact cache hits will skip bootstrapping. The downside: How will we know if stale dependencies are causing our CI failures? How will we have confidence in the results? How will we debug anything?

In any case, this restoreKey makes no sense: npm | "$(Agent.OS)" | "$(BUILD_ARCH)". If there is an outdated package.json or package-lock.json, there is no reason to restore the cache. Either it is wiped away by npm ci, or we have restored incorrect dependencies and are going to use them in the final build/test.

npm ci isn't used with apm right now, because it seems to conflict with the "dedupe" postinstall script that the atom-package-manager package runs right after being installed. This means restored apm caches aren't wiped out, so perhaps the cache restore isn't a waste of time here... but that could be a problem with inexact cache restores, as native code is not rebuilt when running a simple npm install in a directory where the packages have already been installed to node_modules. Allowing inexact restore of apm/node_modules might mean we error out with a "wrong node version" error. Or other related, but subtler wrong behavior.

DeeDeeG avatar Jul 27 '20 15:07 DeeDeeG