gradle-node-plugin icon indicating copy to clipboard operation
gradle-node-plugin copied to clipboard

Make npmInstall and npmSetup tasks cacheable

Open bercik opened this issue 5 years ago • 40 comments

Currently npmInstall and npmSetup tasks are not cacheable by gradle build cache, thus when running CI pipelines on different machines those tasks needs to fully run every time which consumes a lot of time.

Please consider making those tasks cacheable. Instructions can be found here: https://docs.gradle.org/current/userguide/build_cache.html#enable_caching_of_non_cacheable_tasks

bercik avatar Mar 06 '20 13:03 bercik

Long story short, those tasks don't cache very well, however, if you can get a local mirror of node and npm you should be able to get the same experience for nodeSetup/npmSetup.

And if you're using webpack or similar you can set up a task for that with inputs/outputs and cache it, which should help speed things up

deepy avatar Mar 06 '20 13:03 deepy

Let's have a look to these tasks (I add nodeSetup to the list):

  • nodeSetup download Node.js. It uses for that a Gradle repository that automatically caches downloaded artifacts (exactly the same way as it does for Maven artifacts). So it is downloaded only once, but when the task is not up-to-date, it unpacks the archive (which is not very different in terms of performance as using a cache).
  • npmSetup uses npmInstall to install a specific version of npm. See below.
  • npmInstall: as of version 5 (as far as I remember), it caches the downloaded packages to avoid downloading them each time they are installed. This task is also already cached, not by Gradle directly, but by npm.

All those tasks are not directly cacheable by Gradle, but they rely on tools that use cache.

I know this have some limitations. For instance, if the build runs inside a container whose file system is not persistent, the cache will not be reused for next builds by default. This will be also the case for Gradle, but this can be fixed by changing its cache configuration. It would be necessary to do the same thing for also for npm and it is much simpler if all the cache is managed by Gradle. Once the Gradle cache is configured, everything can be cached.

We can assume (it would be interesting to test it though), that npmInstall for instance would be even faster if it was fully cached by Gradle. Indeed, npm install does not only extract the package contents. It also executes some post install scripts (for instance a node-gyp compile). Caching the node_modules directory would enable us to skip this whole process. But I think it could cause multiple issues:

  • Do the post install scripts make change only to the node_modules directory? I don't really know whether there are some conventions regarding this behavior. If it's not true, this means that each npmInstall user would be in charge of configuring perfectly the task output to ensure the cached result is perfectly right.
  • As far as I know, the node_modules directory does not contain exactly the same thing in the different platforms. For instance, node-gyp packages contain some platform-specific binaries. We should probably add the platform as a cache key to ensure the node_modules directory will not be reused across different platforms.
  • There are probably other things I didn't think about.

I think we should make some experiments to ensure that make them cacheable really makes sense and does not create some unwanted side effects.

bsautel avatar Mar 08 '20 15:03 bsautel

nodeSetup and npmSetup is going to be interesting to cache as they both access the same directories and create symbolic links (which are going to be replaced with the file they point to if we use the Gradle cache).

Luckily even if you run on an entire clean machine nodeSetup and npmSetup aren't that heavy, especially if you got distBaseUrl and .npmrc pointing to local mirrors or ones close to you.

deepy avatar Mar 09 '20 07:03 deepy

Thanks for replying.

You are right, nodeSetup and npmSetup don't seem to heavy, but the npmInstall does. We did the following in our project:

  1. We've added custom npmCi task that caches node_modules folder:
task npmCi(type: NpmTask, dependsOn: [npmSetup]) {
    if (System.env['CI'] != null && System.env['CI'] == "TRUE") {
        args = ['ci']
    } else {
        args = ['install']
    }

    inputs.file(file("${project.projectDir}/frontend/package.json")).withPathSensitivity(PathSensitivity.RELATIVE)
    inputs.file(file("${project.projectDir}/frontend/package-lock.json")).withPathSensitivity(PathSensitivity.RELATIVE)

    outputs.dir "${project.projectDir}/frontend/node_modules"
    outputs.cacheIf { true }
}
  1. In .gitlab-ci we are running the task as follows: ./gradlew npmCi --build-cache
  2. We are caching .gradle folder in gitlab-ci:
  cache:
    paths:
      - .gradle

What do you think of that solution? It seems to be working, but I'm a little afraid about the possible problems you've mentioned above.

bercik avatar Mar 09 '20 08:03 bercik

Hi,

I've come up with the following code and it works for me:

npmInstall {
    inputs.files(project.files('package.json'))
    outputs.dir("${project.buildDir}/node_modules/")
}

task buildUI(type: NpmTask, dependsOn: npmInstall) {
    inputs.files(project.files('src'))
    outputs.dir("${project.buildDir}/dist/")
    args = ['run', 'build']
}

build.dependsOn buildUI

I'm running the gradle build and it's running npm install only when the package.json is changed and it runs npm build when the src folder is changed.

nukesz avatar Apr 02 '20 14:04 nukesz

@nukesz normally you don't need to configure the npmInstall as you did. The default configuration already contains those inputs and outputs.

I also would advice you to add the node_modules directory as an input of your buildUi task since a change in your project library will probably change the build output. I would also add the package.json file since you rely on its scripts when running run build. You can also use the NpxTask to avoid needing to declare the build script in the package.json file.

@nukesz, you are not talking about the same thing as @bercik. This issue is about caching, not just task execution avoidance. Task execution avoidance is the fact of skipping the task execution if the inputs did not change since last build and the outputs are still present and did not change. Caching is one step further. As Gradle knows which are the input and outputs of the task, it can save the output in a cache (which can be shared between multiple computers) and reuse the same output stored in the cache instead of running the task when the inputs are identical.

bsautel avatar Apr 03 '20 12:04 bsautel

@bercik is your experiment successful? Does it work as expected? Do you use only local cache or do you share the cache between multiple hosts?

bsautel avatar Apr 03 '20 12:04 bsautel

I would recommend merging npmInstall and build tasks. For example by using in package.json:

"build": "npm ci && ng build

This gives a single task in Gradle. So if the result (the compiled webapp) is already available, the build server can jump over setting up the node_modules directory. Which can save a lot of time.

And npm install should not be used, as it is not reproducible. Rather use npm ci.

remmeier avatar Apr 23 '20 06:04 remmeier

@remmeier if the compiled web app is already available then both npmInstall (which can be configured to use npm ci) and your build task would be skipped as your build task is already up-to-date.

Though in your case there's a caveat with npm ci, which is that it'll delete your node_modules before running.

deepy avatar Apr 23 '20 07:04 deepy

does npm ci not do up-to-date checking? Actually in our project we make use of yarn, but apply the same techniques. This works well. Was hoping for NPM to get to the same level in that regard. With our single task yarn setup we manage to save around 30-60 second with every build per frontend project (which adds up, in particular for multi-project builds).

(there is a second new npm-based project on my side, i probably have some more insights soon in that regard)

remmeier avatar Apr 23 '20 08:04 remmeier

I did some experiments in another issue to know which one of Gradle up-to-date checking or npm install up-to-date checking and it appeared that Gradle was much faster than npm install (see here for Linux and here for Windows). Of course, it probably also depends on the project size (mainly the number of dependencies). But this was not really the result we were expecting.

I did not do this experiment with npm ci nor with yarn, that would be interesting to do it.

bsautel avatar Apr 23 '20 08:04 bsautel

And it sounds like npm ci always starts by removing the node_modules directory. So it does not seem to do any up-to-date checking attempt.

Here is the first line that is printed by npm ci

npm WARN prepare removing existing node_modules/ before installation

bsautel avatar Apr 23 '20 08:04 bsautel

I will have a look later and file a bug report if it really does not to up-to-date checking. I really like yarn & Gradle...

npmInstall and gradle cache restore will heavily depend on where the data is coming from. yarn is able to cache it locally, so it can be fast. The same for gradle. A close-by Gradle remote cache is also not so bad. But the public NPM registry should be avoided at all cost I think for anybody in need of r robust/fast builds. Your Gradle cache results look good for sure.

remmeier avatar Apr 23 '20 08:04 remmeier

the one things really good at yarn is that yarn install is super fast if all is up-to-date (around a second).

remmeier avatar Apr 23 '20 08:04 remmeier

the one things really good at yarn is that yarn install is super fast if all is up-to-date (around a second).

I don't think realistically it can ever beat gradle doing up-to-date checks on your compile webapp task unless you either have a massive app or no dependencies.

deepy avatar Apr 23 '20 10:04 deepy

I did this experiment again using an Angular project with a multiple dependencies (902 packages at node_modules root) on an Ubuntu 20.04 laptop.

Note that I used the version 3 of the plugin (which will be released soon) because I discovered an issue in the version 2.2.3 regarding the yarn task outputs declaration (the task does not run when I remove the node_modules directory, whereas it should).

Command Command execution time Gradle up-to-date check
yarn 1 second 1 second
npm install 8 seconds 1 second
npm ci 40 seconds 1 second

We can see that yarn's up-to-date checking is fast, but not faster that Gradle. For npm install and npm ci, Gradle is clearly faster.

bsautel avatar Apr 26 '20 16:04 bsautel

I could investigate the issue I encountered, and it appears that the up-to-date checking of the yarn task works as expected, I probably did an error in my experiment.

But I discovered that the node_modules directory is declared twice as output, once as a directory and once as a file tree (unless you use the node_modules output filter). It is useless and slows down the Gradle's up-to-date checking. It takes 3 seconds instead of 1 (with the version 3 of the plugin).

I created the issue #96 and fixed it. It now also takes 1 second with the version 2.2.3 and the fix.

bsautel avatar Apr 27 '20 07:04 bsautel

What about Yarn tasks ( YarnSetupTask, YarnInstallTask, YarnTask)? Can build cache be enabled for them?

sergeykad avatar Jul 05 '21 18:07 sergeykad

Your own ad-hoc YarnTasks can absolutely be cached, where I work all our npm run dist takes nearly a minute each (and we got 4 in one project) They cache extremely well, especially when paired with a remote build-cache

deepy avatar Jul 05 '21 20:07 deepy

The problem is that yarnSetup, npmSetup and nodeSetup reference absolute paths, busting the cache. Just don't do that. gradle shouldn't be installing those things anyway.

To fix caching, add this to your root build.gradle:

plugins {
  id "com.github.node-gradle.node" // specify version!
}

import com.github.gradle.node.task.NodeSetupTask
import com.github.gradle.node.yarn.task.YarnSetupTask
import com.github.gradle.node.npm.task.NpmSetupTask

afterEvaluate {
  // disable all setup tasks because they bust the cache
  allprojects {
    [YarnSetupTask.class, NpmSetupTask.class, NodeSetupTask.class].forEach({ setupTask ->
      tasks.withType(setupTask).configureEach({ task ->
        task.configure {
          onlyIf { false }
        }
      })
    })
  }
}

doctorpangloss avatar Jan 25 '22 23:01 doctorpangloss

There's some other workarounds in https://github.com/gradle/gradle/issues/3525 as well, but with symbolic links being resolved to their file instead of being stored I'm not sure we ever could support the build cache in those tasks.

So if you're using download = false you'll get the same effect as the snippet above (but lose out on perfectly reproducible builds)

deepy avatar Jan 26 '22 16:01 deepy

Bumping the thread, since the question I've run into seems to be related.

On Gradle 7.4, npmSetup task shows the following deprecation warning:

Cannot access output property 'npmDir' of task ':npmSetup' (see --info log for details). Accessing unreadable inputs or outputs has been deprecated. This will fail with an error in Gradle 8.0. Declare the task as untracked 
by using Task.doNotTrackState(). Consult the upgrading guide for further information: https://docs.gradle.org/7.4/userguide/upgrading_version_7.html#declare_unreadable_input_output

This is shown only on Windows - on Linux, even in WSL, there's no problem. I guess the source of the error is the symbolic links in bin, since they can't be opened by the ordinary Windows programs too. Does this mean that npmSetup on Windows will always be effectively uncacheable?

Cerber-Ursi avatar Oct 31 '23 08:10 Cerber-Ursi

@Cerber-Ursi I'm unable to reproduce that issue using the example project, with or without download enabled, with or without a npmVersion set https://github.com/node-gradle/gradle-node-plugin/blob/main/examples/simple-node/npm/build.gradle

Can you open a separate issue and provide a standalone project that reproduces the issue?

deepy avatar Dec 19 '23 16:12 deepy

in my experience since my last update, it's better to never cache npmInstall and npmSetup. they are too large for gradle to cache efficiently anyway. however, the yarn-state or yarn-integrity files or similar should be specified as outputs, and if you want to be extra disciplined, node_modules must be set to read only. with yarn, use yarn 2+ and plug'n'play if you want caching.

doctorpangloss avatar Dec 19 '23 17:12 doctorpangloss

Interesting. I've tried to do this again and wasn't able to reproduce the original problem either, including in the real project where it was initially noticed. Will try again later, not sure what exactly has changed in between.

Cerber-Ursi avatar Dec 20 '23 05:12 Cerber-Ursi

EDIT: I made a mistake, the information below is incorrect.


~I tried to enable caching by specifying outputs.cacheIf { true }~

tasks.npmInstall {
    outputs.cacheIf { true }
}

~I think it's relevant that I also have a custom NpmTask~

val npmRunBuild by tasks.registering(NpmTask::class) {
  npmCommand.set(listOf("run", "build"))
  inputs.files("package.json", "webpack.config.js")
    .withPathSensitivity(RELATIVE)
  inputs.dir("src/main")
    .withPathSensitivity(RELATIVE)
  inputs.dir(layout.projectDirectory.dir("node_modules"))
    .withPathSensitivity(RELATIVE)
    outputs.dir("dist")
}

~If I enable additional build cache logging (add org.gradle.caching.debug=true to gradle.properties) and run ./gradlew clean && ./gradlew assemble I can see this warning on the npmInstall task~

> Task :npmInstall
[...]
Non-cacheable because Gradle does not know how file 'node_modules/.package-lock.json' was created (output property 'nodeModulesPackageLock'). Task output caching requires exclusive access to output paths to guarantee correctness (i.e. multiple tasks are not allowed to produce output in the same location). [OVERLAPPING_OUTPUTS]

~Specifying the package-lock file as an output seems to help:~

tasks.npmInstall {
    outputs.file(layout.projectDirectory.file("node_modules/.package-lock.json"))
        .withPropertyName("nodeModulesPackageLock")
    outputs.cacheIf { true }
}

~I can then run~

~1. ./gradlew assemble > assemble1.log~ ~2. ./gradlew clean -q~ ~3. ./gradlew assemble > assemble2.log~

~In assemble2.log I can see that npmInstall is cached.~

> Task :npmInstall FROM-CACHE

adam-enko avatar Jan 30 '24 11:01 adam-enko

tasks.npmInstall {
    outputs.file(layout.projectDirectory.file("node_modules/.package-lock.json"))
        .withPropertyName("nodeModulesPackageLock")
    outputs.cacheIf { true }
}

I don't think this does what you think it does. Unless npmInstall contains node_modules/ as an output directory...

  1. delete node_modules/
  2. run npmInstall
  3. observe gradle will restore just node_modules/.package-lock.json from the cache

doctorpangloss avatar Jan 30 '24 17:01 doctorpangloss

I don't think this does what you think it does.

You're right! It wasn't caching as I expected, I misunderstood what was going on.

adam-enko avatar Jan 30 '24 18:01 adam-enko

Hi 👋

I've add an example test of the behaviour I'd like to see in https://github.com/node-gradle/gradle-node-plugin/pull/301

Currently I see warning messages from Gradle that explain why npmInstall can't be cached:

Non-cacheable because Gradle does not know how file 'package-lock.json' was created (output property 'packageLockFileAsOutput'). Task output caching requires exclusive access to output paths to guarantee correctness (i.e. multiple tasks are not allowed to produce output in the same location). [OVERLAPPING_OUTPUTS]

Non-cacheable because Gradle does not know how file 'node_modules/.package-lock.json' was created (output property 'nodeModulesPackageLock'). Task output caching requires exclusive access to output paths to guarantee correctness (i.e. multiple tasks are not allowed to produce output in the same location). [OVERLAPPING_OUTPUTS]

From a quick look removing these properties as outputs should work

  • packageLockFileAsOutput conflicts with the input property, packageLockFileAsInput
  • nodeModulesPackageLock should be tracked by the getNodeModulesDirectory output

Do these changes seem reasonable? I'm not familiar with NPM development so I would appreciate any insights.

adam-enko avatar Jan 31 '24 09:01 adam-enko

Conceptually you've got the right idea @adam-enko, but there's a few things that makes npm install cache extremely badly or outright broken

Let's start with the easy one, Gradle currently breaks symlink and this makes node_modules unhappy, this is currently being worked on and might get resolved by the linked PR in https://github.com/gradle/gradle/issues/3525

Next one is harder because I don't have the data any longer and never thought to create benchmarks, but when your node_modules becomes "moderately large" you may find that Gradle spends minutes doing the up-to-date check This is especially funny if it turns out to be out-of-date and you then have to wait for npm install to do the same up-to-date check

These might be fine locally, but if you have a remote build cache then publishing a 1-2GB node_modules is likely to cause issues And even at 100-200MB, the cost of pulling it from the cache will likely be bad, especially if you have a npm mirror close to you

And with a remote build cache, what happens if you pull a node_modules using native components built for the wrong platform? Adding the platform to the cache key by default would help this, but unless your node_modules has no native components then I suspect caching it won't make much sense for the above reasons

So to list the problems I remember:

  • npm install already does this check and is pretty fast at this point
  • Gradle's up-to-date checking can get really slow with large outputs and node_modules can easily become large
  • Some tools in the JS ecosystem happily read and write to a folder of their own in node_modules, this used to upset me but I can't find any modern tools doing this so this might not be relevant any longer
  • Every project I've personally come in contact with has included native components, making node_modules platform dependent, this used to upset me because a lot of these didn't build on FreeBSD
  • Caching the build of your app tends to make more sense as it's generally slow and npm install on a moderately up-to-date node_modules can be very quick

With these in mind, most people who provided feedback (and most people polled) preferred the solution where Gradle mostly leaves everything to npm and just keeps track of whether it has run successfully with the current package.json (and lockfiles) and only expecting node_modules/.package-lock.json as output See the fastNpmInstall for more information https://github.com/node-gradle/gradle-node-plugin/blob/16ecc0fa492b65dfa46a559f432ade45ce500b8e/src/main/kotlin/com/github/gradle/node/NodeExtension.kt#L127-L134

So to wrap this up, do you really want to cache node_modules or are you ok with recreating it initially and having it be reported up-to-date like other tasks? i.e. do you want it cached or are you interested in fast builds that work well with the remote cache but with the downside that your CI needs to do an npm install?

deepy avatar Jan 31 '24 10:01 deepy