bottlerocket icon indicating copy to clipboard operation
bottlerocket copied to clipboard

Go Modules in packages

Open samuelkarp opened this issue 2 years ago • 0 comments

In order to provide core functionality for the operating system, Bottlerocket includes a number of third-party software packages. Bottlerocket builds these software packages from source code, and stores the source code in a “lookaside” cache. The lookaside cache helps ensure that builds can be reproducible and reduces the load from Bottlerocket builds on upstream source repositories. Bottlerocket’s build system (buildsys) expects sources to be declared in an individual package’s Cargo.toml package.metadata.external-files key, then (by default) will download the specified source code from the lookaside cache (and can optionally fall-back to the upstream source during package development). Buildsys then runs each package build in an environment with network access disabled, preventing build scripts from retrieving additional source code not declared in the Cargo.toml file and downloaded from the lookaside cache.

A small (but growing) number of the third-party software packages included in Bottlerocket are written in the Go programming language. These include components critical to running containers such as runc, containerd, Docker (in some variants), and the orchestrator agents (Kubelet and ECS agent).

In the early days of the Go ecosystem, “vendoring” was common. Vendoring is the practice of a software project copying the source code of its dependencies into its own source tree. Vendoring became popular in the Go ecosystem as the Go tooling (formerly) provided little other support for managing dependencies; copying was a reliable way to ensure the correct versions of dependencies were available when compiling a project. In more recent versions of Go, native dependency management with “modules” was added where the Go tool is responsible for retrieving the necessary dependency sources at build time. Modern Go programs use Go modules and declare their dependencies in go.mod and go.sum files; some maintain “legacy” vendor directories but the ecosystem is gradually moving away from vendoring.

Most of the Go programs included in Bottlerocket (runc, containerd, Docker, Kubelet, ECS agent) currently employ the vendoring practice upstream. Consequently, the process to ingest sources for these programs is fairly simple: copy the upstream source tarball vended by the project into Bottlerocket’s lookaside cache and build from there. However, there are upstream dependencies which are more modern and thus do not employ the vendoring process. In this case we need to solve the question: how to properly retrieve and store those dependencies such that they can be used during a build?

Option A A small number of Bottlerocket programs today do not use vendoring and instead declare individual dependencies in their Cargo.toml and store them in the lookaside cache. This works okay for programs with a small number of dependencies or programs where the set or versions of dependencies does not change frequently. Option A would be to continue this practice: manually resolve each dependency, retrieve the source tarball, upload to the source cache, and declare in the package’s Cargo.toml.

Option B

The Go tool has a built-in mechanism to retrieve dependencies for a program and materialize those into a source tree, effectively translating the build-time retrieval process into the more-familiar vendoring process. go mod vendor will cause a new vendor folder to be created with all of those dependencies, and uses the go.sum file as well as (optionally) an online Sum database to verify each dependency during download. Instead of manually resolving each dependency, retrieving the source tarball, uploading to the lookaside cache, and duplicating the go.mod/go.sum information in the Cargo.toml file, Option B would be to run go mod vendor, create a tarball of the resulting vendor folder, and upload that single folder to the lookaside cache. This reduces the overhead of maintaining a package and changes the process for updating into a two-step update (get new upstream sources and run go mod vendor) as opposed to the N-step update required for Option A (where N is the total number of dependencies), however it also changes the lookaside cache from a cache (that can be skipped) into a canonical storage location for a transformed artifact.

Option C (my recommendation)

Builds of packages in Bottlerocket are orchestrated by a system called buildsys. Buildsys is built on top of Cargo (for dependency tracking) and Docker (for build isolation) and is the component responsible for retrieving source code from the source cache. Go’s module support has a similar ability to retrieve cached sources from a hosted Go module proxy (one is provided by Google and is the default in the Go ecosystem) through the go mod download and go mod vendor commands. Option C would involve modifying buildsys to allow it to retrieve Go modules from the default Go module proxy and/or from upstream git source repositories and map those into the build environment that it generates.

While Go modules provide built-in source reproducibility through checksum verification of the sources, this approach does add a new dependency on a Go module proxy like the default one (or on the upstream git repositories if the proxy is skipped).

I’ve written a proof of concept of this option and will open a pull request demonstrating it.

samuelkarp avatar Apr 01 '22 03:04 samuelkarp