build-image icon indicating copy to clipboard operation
build-image copied to clipboard

Better error handling of `nvm` and `nodejs.org/dist`

Open ehmicky opened this issue 5 years ago • 7 comments

We run nvm install to check what the latest matching Node.js version is (e.g. NODE_VERSION 12 would match 12.18.3), and download it unless it is available locally:

https://github.com/netlify/build-image/blob/450c24b893036c2813a0b0d0a2a05bcdd81935a4/run-build-functions.sh#L221

Even if the desired Node.js version is already available locally, we still need to use nvm install (and make an HTTP request) to check whether a newer matching Node.js version is available. However, if the command fails because nodejs.org is down, we should default to using a local Node.js version instead, providing it matches the desired Node.js version. At the moment, we fail instead, which created an outage on 2020-08-30.

https://github.com/netlify/build-image/blob/450c24b893036c2813a0b0d0a2a05bcdd81935a4/run-build-functions.sh#L233

ehmicky avatar Aug 31 '20 18:08 ehmicky

Doing exactly what @ehmicky describes would have greatly reduced (perhaps to the extent that nobody noticed - us OR customers) the impact of our incident last night. It exactly describes the root cause of the "outage" which paged several people and wasted a half hour of their night since it was neither caused by us nor actionable by us.

cf #incident-2021-02-19-high-percentage-of-build-failures and https://www.notion.so/netlify/2021-02-18-High-percentage-of-build-failures-880aebbdffa549d28d7316e6ad062487

cc @verythorough this may get upped in priority as a followup from the outage, not sure yet.

nolessafool avatar Feb 19 '21 16:02 nolessafool

From discussing with @fool and @keiko713, we are proposing tweaking the solution above. Instead of always running the default Node.js version, we would instead try to run the same major Node.js version that users are choosing.

From fairly recent data collection, it appears that:

  • Almost all of our users are not specifying the Node.js minor/patch version, only the major version
  • Most of our users are using the currently supported Node.js major releases, plus the last major release before that.

Based on this, we could:

  • [ ] Download the latest version of Node.js 10, 12, 14 and 15 using nvm, when Dockerfile is built. I.e. those versions would be cached in the build image.
  • [ ] When a build is using nvm install that fails, check the error message (alternative: ping https://nodejs.org/dist. Note: I have a library that does just that, which includes automatic retries) to see if nodejs.org/dist is down.
  • [ ] If it is down, then use the cached Node.js version (or the closest to it).

This would increase the build image uncompressed size by 100MB to 150MB.

ehmicky avatar Feb 19 '21 17:02 ehmicky

Linking for context https://github.com/actions/setup-node#v2. In v2 setup-node:

  1. Keep images up to date with latest version of each LTS
  2. Use their own repository to fetch versions
  3. Fallback to fetch from https://nodejs.org/dist
  4. Cache versions locally too

So it seems the suggested solution is to do the same except 2. which sounds good to me

erezrokah avatar Feb 21 '21 08:02 erezrokah

Another option could also be to use the NVM_NODEJS_ORG_MIRROR environment variable which allows adding mirrors. This would remove any need to cache the Node.js binaries ourselves, by relying on mirrors instead.

ehmicky avatar Feb 17 '22 14:02 ehmicky

Note: this problem also happens ~20 times per day due to transient network errors when nvm is fetching https://nodejs.org/dist/index.json, which would also be solved by this issue.

ehmicky avatar Feb 18 '22 21:02 ehmicky

Draft PR at https://github.com/netlify/build-image/pull/750 for adding a mirror.

ehmicky avatar Mar 03 '22 17:03 ehmicky

#750 was closed due to security concerns with using external mirrors. This means we should instead go with caching the binaries, as explained in the above comments.

ehmicky avatar Mar 24 '22 15:03 ehmicky