build icon indicating copy to clipboard operation
build copied to clipboard

Integrity checks for R2 migration

Open UlisesGascon opened this issue 2 years ago • 5 comments
trafficstars

TL:DR;

We will change the way we serve the binaries, so we want to ensure that the binaries are properly migrated. Additionally, we can take this opportunity to have some scripts (potentially GH actions) that we can use to check if the binaries are fine and the releases are correct.

Historical Context

We had being suffering from cache problems for a while:

  • https://github.com/nodejs/build/issues/3424
  • https://github.com/nodejs/TSC/issues/1416
  • https://github.com/nodejs/build/issues/3410

Seems like the long term solution will be to relocate the binaries to R2:

  • https://github.com/nodejs/build/issues/3461

Implementation

I started building a simple GitHub Action that collects all the releases and generates the URLs for all the available binaries. It then performs a basic HTTP request using curl to check the response headers. After that, it generates some metrics based on this and presents a simple report in markdown format.

While presenting this proof of concept in Slack, the collaborators provided super useful feedback and suggested features that we can implement.

Current approach

The idea of using a CRON Job to collect availability metrics may not be very effective for the cache issues scenario, but there are many features that can be valuable to us.

Features requested/ideas

  • Add support for iojs.org/dist as NVM depends on it (@ljharb)
  • Verify the R2 cutover (@flakey5 @MattIPv4 @ovflowd)
  • Store and validate the SHA for files does not change (@MattIPv4)
  • Check that the SHASUMS256 files are correctly signed (@UlisesGascon)
  • Check the binaries (@MattIPv4 @UlisesGascon)
    • Checksum matches the release SHASUMS256
    • Binaries described in the SHASUMS256 are available
    • Binaries are excluded from malware databases using VirusTotal
    • Binaries checksum matches the SHASUMS256

I will request to transfer the repo to the Node.js org when the code is stable and documented, currently is quite hacky code

Next steps

I have started to consolidate the feedback into issues:

  • [x] https://github.com/UlisesGascon/nodejs-distribution-system-monitoring/issues/4
  • [x] https://github.com/UlisesGascon/nodejs-distribution-system-monitoring/issues/9
  • [x] https://github.com/UlisesGascon/nodejs-distribution-system-monitoring/issues/7
  • [x] https://github.com/UlisesGascon/nodejs-distribution-system-monitoring/issues/6
  • [ ] https://github.com/UlisesGascon/nodejs-distribution-system-monitoring/issues/8
  • [ ] https://github.com/UlisesGascon/nodejs-distribution-system-monitoring/issues/3
  • [ ] https://github.com/nodejs/admin/issues/821
  • [ ] https://github.com/UlisesGascon/nodejs-distribution-system-monitoring/issues/5

Discovery

There are some things that bubble to the surface while implementing the systematic checks:

  • https://github.com/nodejs/build/issues/3468
  • https://github.com/nodejs/build/issues/3463

UlisesGascon avatar Aug 22 '23 08:08 UlisesGascon

While I appreciate the effort I have some concerns.

I think you're trying to check two separate issues:

  1. The integrity of the files. e.g. are the SHASUMS properly signed and do the files match the SHAs?
  2. Whether the URL(s)/webserver is responding.

We currently do a very limited version of 1. in validate-downloads which only checks the binaries for the most recent versions of Node.js 16, 18 and 20 using jenkins/download-test.sh. It runs once per day (or on demand if manually run in Jenkins).

Cases where the files do not match the SHAs published in the SHASUMS:

  • Something went wrong in the release process. This only needs to be a one time check.
  • The files were not uploaded fully to the server (e.g. the disk filled up). Again only needs to be a one time check validating the file was uploaded correctly.
  • The webserver/cache service is misbehaving.
  • Someone or process with access inadvertently tampers with the files. We mitigate this by gating access -- even releasers do not have permissions to change the releases on the server once seven days have past (the seven days was originally because some platforms (e.g. arm32) were slow and released after the other platforms -- we haven't actually had phased platform releases in a long time (I think we even removed the bits from the release guide that mentioned this)).
  • The infrastructure has been compromised and a malicious actor tampers with the files. In this case they'd likely be able to also modify the SHASUMS files. In mitigation we also fully publish the signed SHASUMs in the release blog posts on the website, so an attacker would also need to compromise the website and the website's GH repository.

For 2. we currently know that we have cache purge issues that affect any number of the download URLs -- the extra monitoring if we were checking over HTTP every existing asset URL would be contributing negatively to the server load (even if retrieving just the headers as connection(s) will need to be made to the server).

I started building a simple GitHub Action that collects all the releases and generates the URLs for all the available binaries. It then performs a basic HTTP request using curl to check the response headers.

I hope this has rate limiting implemented -- this will be hundreds of files/HTTP requests.

richardlau avatar Aug 22 '23 12:08 richardlau

Thanks a lot for the feedback @richardlau! :)

We currently do a very limited version of 1. in validate-downloads which only checks the binaries for the most recent versions of Node.js 16, 18 and 20 using jenkins/download-test.sh. It runs once per day (or on demand if manually run in Jenkins)

I was not aware of this job, and it basically covers a lot of the things that I was expecting to cover, so fewer things in my to-do list. 👍

Cases where the files do not match the SHAs published in the SHASUMS:

Only one case is relevant here: the infrastructure has been compromised and a malicious actor has tampered with the files.

We can check if the shasum files were modified. I already collect and update them when new releases are added. You can find them here. Then I can check if any of the checksums have changed and/or if the signatures are valid (in case of additions, aka new releases).

This way, we ensure that the immutability is still in place and there is no tampering with the new additions. The number of HTTP requests is quite low because the binary checksums are collected from the SHASUMS. The script only downloads the SHASUM files.

This can be a weekly job, executed on the weekends.

For 2. we currently know that we have cache purge issues that affect any number of the download URLs -- the extra monitoring if we were checking over HTTP every existing asset URL would be contributing negatively to the server load (even if retrieving just the headers as connection(s) will need to be made to the server).

I hope this has rate limiting implemented -- this will be hundreds of files/HTTP requests.

It ran during the weekend for a while and I have already removed the CRON. However, it can still be executed manually either on the local machine or by triggering the workflow in Github. I believe we can use this script for the R2 migration to ensure that all the binaries are transferred and that all the URLs are functioning correctly. Please note that the script only checks the headers and closes the connection, it does not attempt to download the binaries.

UlisesGascon avatar Aug 22 '23 13:08 UlisesGascon