distroless icon indicating copy to clipboard operation
distroless copied to clipboard

Suggestion: better approach to package and version management

Open PeterMylemans opened this issue 3 years ago • 27 comments

Also relates to #153 and the clean up already done by @mattmoor (thanks btw!). During further investigation I've stumbled on some work in rules_pkg github repository [1] that looks like an improved version of the deb management that is currently in the distroless workspace.

Advantages:

  • It has a nice tool [2] that automates rewriting the workspace using buildozer instead of regex expressions, which is a poor man's solution.
  • A lot more readable compared to the current approach in distroless
  • Performance of downloading dependencies is much better and more stable due to the use of builtin bazel workspace rules instead of a custom python program.
  • Support for 3rd party debian repositories when required. This could open up distroless to support more recent versions of the software stacks included in the scope, while still using the stable foundation of debian for the base packages.
  • Fixes the two step build process (no longer required to build a package_manager due to workspace rules usage)

Things to improve:

  • it is not possible at the moment to have multiple sources (e.g. debian, debian-updates and debian-security) while evaluating the most recent version, so the code needs to be adapted to support that. The work required is not too difficult. A syntax proposal is outlined below.
  • from what I can tell it is not packaged in rules_pkg, but instead lives as a separate workspace in the same git repo.

@mattmoor @chanseokoh @aiuto feel free to review this idea. I'm happy to start a PR in either repository if we agree this is a good way forward.

References: [1] https://github.com/bazelbuild/rules_pkg/blob/main/deb_packages/WORKSPACE [2] https://github.com/bazelbuild/rules_pkg/tree/main/deb_packages/tools/update_deb_packages

http_file(
    name = "buster_archive_key",
    sha256 = "9c854992fc6c423efe8622c3c326a66e73268995ecbe8f685129063206a18043",
    urls = ["https://ftp-master.debian.org/keys/archive-key-10.asc"],
)
http_file(
    name = "buster_security_archive_key",
    sha256 = "4cf886d6df0fc1c185ce9fb085d1cd8d678bc460e6267d80a833d7ea507a0fbd",
    urls = ["https://ftp-master.debian.org/keys/archive-key-10-security.asc"],
)

deb_packages(
    name = "debian",
    arch = "amd64",
    distro = "buster",
    packages = {
        "base-files": "http://ftp.debian.org/debian/pool/main/b/base-files/base-files_10.3+deb10u6_amd64.deb",
        "busybox": "http://ftp.debian.org/debian/pool/main/b/busybox/busybox_1.30.1-4_amd64.deb",
        "ca-certificates": "http://ftp.debian.org/debian/pool/main/c/ca-certificates/ca-certificates_20200601~deb10u1_all.deb",
        "fontconfig-config": "http://ftp.debian.org/debian/pool/main/f/fontconfig/fontconfig-config_2.13.1-2_all.deb",
    },
    packages_sha256 = {
        "base-files": "ed640f8e2ab4e44731485ac7658a269012b9318ec8c6fb7b2b78825a624a9939",
        "busybox": "1e32ea742bddec4ed5a530ee2f423cdfc297c6280bfbb45c97bf12eecf5c3ec1",
        "ca-certificates": "794bd3ffa0fc268dc8363f8924b2ab7cf831ab151574a6c1584790ce9945cbb2",
        "fontconfig-config": "9f5d34ba20eb156ef62d8126866a376be985c6a83fdcfb33f12cd83acac480c2",
    },
    sources = [
        "buster_archive_key http://ftp.debian.org/debian buster main",
        "buster_archive_key http://ftp.debian.org/debian buster-updates main",
        "buster_security_archive_key http://ftp.debian.org/debian-security buster/updates main",
    ],
)

PeterMylemans avatar Sep 30 '20 13:09 PeterMylemans

I think this is a fantastic idea. In this case, if there's a reliable and well-maintained external facility, I think there's no need to maintain a custom solution. Completely removing the custom python has been what I've been hoping for a very long time.

from what I can tell it is not packaged in rules_pkg, but instead lives as a separate workspace in the same git repo.

I'm not really familiar with the Bazel ecosystem. What does it mean? It is like "experimental", "alpha", or "use at your own risk"? I think it can be troublesome if it's not well maintained.

chanseokoh avatar Sep 30 '20 13:09 chanseokoh

cc @dlorenc

mattmoor avatar Sep 30 '20 14:09 mattmoor

I have the same question about maintainability. Currently it is not part of the published package and should be consumed as a git source repository dependency with a prefix to include only the deb_packages subdirectory.

Honestly, I would have expected to have a rules_deb_repository repo with its own release cycle, tagging and packaging. But maybe I'm just biased towards a polyrepo approach.

In any case I'm not a Bazel community expert either.

PeterMylemans avatar Sep 30 '20 14:09 PeterMylemans

@petermylemans can we ask them to publish the rule and support it?

chanseokoh avatar Sep 30 '20 16:09 chanseokoh

The current content of rules_pkg is confusing I would like to fix that. Most repos under bazelbuild https://github.com/bazelbuild name rules_X contain rules for creating X. The few that are for consuming rather than producing have entirely different names or some modifier (e.g. rules_jvm_external).

So, I would like to see rules_pkg just contain rules to create packages, not consume them. Along that line, anything for building docker images should be in rules_docker because people do work on that. Also, I don't have the knowledge to review anything involving docker, either producing or consuming. Someone else must own it.

So, where do we put improved consumption rules?

  • If there was someone willing to really own rules_pkg/deb_packages/..., that would be fine for the short term, but I can't offer any help beyond modifying the CODEOWNERS file to auto-assign PRs to someone.
  • Can we put improvements in distroless?
  • Should we spin up a new repository?
    • under bazelbuild requires a Googler to be the primary owner. Someone who knows the problem space will have to do the work on that
    • in a new place - that is fine with me. We just have to figure out a copyright nice way to be able to reuse the work in rules_pkg/deb_packages so we don't have to clean-room reinvent.

Basically, I am fine with any solution, but I can't do any of the work, for the reasons above.

aiuto avatar Sep 30 '20 16:09 aiuto

Sounds to me that the deb repository rule should live in rules_docker.

They already have rules to deal with installing packages into containers using package manager runtimes such as Apt.

However, these seem to "execute" the package manager in a running container using a docker runtime. So the result is not always idempotent, depending on the installation process. Functionally the resulting image will be the same, but digests might differ. In my experience, it is a lot easier to manage transitive dependencies this way, but it also pulls in more dependencies than is needed for constructing an application runtime.

We could go in two phases:

  1. Put the rule in distroless
  2. Migrate the rule to rules_docker once the improvements are hammered out.

PeterMylemans avatar Oct 01 '20 11:10 PeterMylemans

Sounds to me that the deb repository rule should live in rules_docker

cc @dmarting (once upon a time, they were built in/for rules_docker 😉 )

mattmoor avatar Oct 01 '20 13:10 mattmoor

Isn't it that deb_packages has nothing to do with Docker? It just downloads deb files from a package mirror and makes them available to use, which other users outside the container context can leverage, right? I may be missing something, but I am not sure why it should live in rules_docker.

We could go in two phases:

  1. Put the rule in distroless
  2. Migrate the rule to rules_docker once the improvements are hammered out.

You mean basically like forking the code into distroless (and then wait for it to be upstreamed, and migrate once it's officially supported)? If so, I am not sure it's a sustainable solution to us (distroless).

chanseokoh avatar Oct 01 '20 13:10 chanseokoh

Yeah, in practice it takes a very long time to unfork these kinds of things and rely on the new upstream. Bazel not solving the diamond dependency problem exacerbates this.

mattmoor avatar Oct 01 '20 14:10 mattmoor

I realize now I did not directly address @chanseokoh's question "I'm not really familiar with the Bazel ecosystem. What does it mean? It is like "experimental", "alpha", or "use at your own risk"? I think it can be troublesome if it's not well maintained."

github.com/bazelbuild/rules_pkg/deb_packages exists but no one is maintaining it in any way. The only work being done in rules_pkg is on the low level packaging side (making tarballs, RPMs and debs). IMO, deb_packages is neither experimental, alpha, or own risk. It is classic abandonware.

aiuto avatar Oct 01 '20 14:10 aiuto

Also agree...I've been trying my best to propose a move away from Bazel not just because of this, but the mere impossibility of air gapped/dark site dependency inclusion and injection of vulnerabilities from its poor tracking of dependencies.

The more you build in go on Bazel for example...the more vulnerabilities in the actual container entrypoint it exponentially propagates. It only exists in my environment for this and envoy...and is the sole reason for even having openJDK and its slew of dependencies....so much for a microservices minimalist mentality.

A deb packaging...rootfs creation..output process would allow anyone to use any package manager or process of their choice, so if I want Dockerfiles to create the base image I can as an example and use buildah to package it as an OCI tar.

smijolovic avatar Oct 01 '20 19:10 smijolovic

@chanseokoh You are right, deb_repository handling is not specific to docker / containers. I was only suggesting it to live there, because it is the main use case today. But I guess it can be extended to any kind of software that requires consuming deb packages.

Maybe it is is better to spin up a new repo: e.g. rules_deb_repository under the same copyright (Apache 2.0). I'm ok to do this, maintain it and publish it e.g. under github.com/petermylemans/rules_deb_packages and add you as collaborator to avoid the "bus factor".

Maybe it would be better under an organization: in any case the repo can always be transferred should it become a requirement.

PeterMylemans avatar Oct 02 '20 10:10 PeterMylemans

Pinging @jayconrod here, in case he has insights.

justaugustus avatar Oct 03 '20 04:10 justaugustus

I've done most of the small changes (current version rules_pkg/deb_packages has issues with current version of bazel) and updates required here: https://github.com/petermylemans/rules_deb_packages. That coudl replace distroless' "python module" with an improved module that makes use of bazel builtin support for downloading (and caching) remote archives.

You can have a look at the example and/or README.md

But what still bothers me though is related to the fact that bazel promotes including all dependencies as source. People seem to solve this by providing a dependencies macro (the deps.bzl / repositories.bzl / ...), for consuming projects.

@mattmoor I've hijacked some of the approach as used in rules_docker. But I still got the issue that e.g. github.com/bazelbuild/buildtools has done a recent change that makes it incompatible with some older versions of gazelle (and rules_docker by extension). This results in a delicate balance of versions...

For tooling this seems a bit strange to me. I would expect to be able to just include prebuilt binaries for the tools used in consuming applications (like distroless). So they don't need to bother with managing dependencies for a simple tool used for keeping versions up to date. But then I get a chicken and egg problem at the rules_deb_packages side, as its dependencies macro would need to provide http_archive repo rules for its own supporting binaries?

Anybody got any experience to deal with this of am I proposing crazy ideas here?

PeterMylemans avatar Oct 03 '20 18:10 PeterMylemans

I wonder how feasible it might be to use something like goreleaser to build binaries and attach them as release artifacts, and then have a downstream step construct a .bzl file that folks could pull down for that release in workspace 🤔

Generally it is possible (and in the case of WORKSPACE tooling required) to pull down prebuilt binary tooling, but you want to make sure you build it for all the downstream platforms that it runs on (I recently hit this with linux/arm64).

Seems like it'd be a fun pattern now that github actions exist, they didn't the last time I was deep in Bazel land.

mattmoor avatar Oct 03 '20 19:10 mattmoor

I used a regular shell script for now, but it implements the same idea. Another point of the list. :smile:

Next up: I'll go for a draft PR to see if the approach is what we want or not.

PeterMylemans avatar Oct 10 '20 17:10 PeterMylemans

@mattmoor @chanseokoh can you have a quick look at the draft PR to see if it's on track? The WORKSPACE file seems to grow somewhat, but I guess that is normal due to the amount of debian packages and architectures being processed.

The "magic" of selecting the right package has been moved to the update deb packages process, so basically the urls and sha256 are stored in the WORKSPACE file.

Deb downloads are a LOT faster and is more stable. I suspect that switching to the main debian CDN has something to do with that (as it is no longer proxying through snapshots).

PeterMylemans avatar Oct 10 '20 19:10 PeterMylemans

Deb downloads are a LOT faster and is more stable. I suspect that switching to the main debian CDN has something to do with that (as it is no longer proxying through snapshots).

One of the nice, presumably intended, consequences of distroless using snapshots.debian.org, is that one can recreate a given distroless image even when the main Debian release archives have moved on.

Will the changes proposed in #614 remove that?

joshuagl avatar Nov 02 '20 13:11 joshuagl

In the month of October - distroless builds were down for about 10 days. Meanwhile, debootstrap worked flawlessly every single day. This is where the arguments about debian builds just don't hold water. This bazel process is frankly the most unstable and non-reproducible build process I have seen in years...not to mention a nightmare for air-gapped.

I still don't understand why this is necessary. This should move to a much simpler minbase debootstrap, package removal, and injections of cacerts/group/passwd/nsswitch/os-release files. The process for building debian-base and debian-iptables is stable. That actually works. If it ain't broke....

It's VERY troubling how baseimages has moved to an unstable process with a bloated package manager that increases the security threat profile tremendously and requires internet connectivity to build. This doesn't bode well for the future longevity of kubernetes if this is a dependency.

smijolovic avatar Nov 02 '20 17:11 smijolovic

Deb downloads are a LOT faster and is more stable. I suspect that switching to the main debian CDN has something to do with that (as it is no longer proxying through snapshots).

One of the nice, presumably intended, consequences of distroless using snapshots.debian.org, is that one can recreate a given distroless image even when the main Debian release archives have moved on.

Will the changes proposed in #614 remove that?

Deb packages remain available, if only from the pool in http://archive.debian.org/ instead of deb.debian.org. That is why both are included in mirrors for the pool urls. The snapshot basically handles the caching of coherent "release package files", but that is mostly useful when using a package manager such as apt or the apt-simulator in debootstrap. The apt-simulator in distroless followed the same practice at build time. Now this resolution of packages is done at "update tool" time, so the build can work with a fixed set of explicitly versioned dependencies (instead of a dynamic, but stable one).

In case of distroless using the bazel build tool, it might as well download (and cache) the pinned versions directly from the debian mirror pool and validate the sha256 sums for correctness. This is pretty similar to how a http_file repository works and that matches better with how bazel is designed to work as well. This would also fit in nicely with how Bazel deals with airgapped builds today.

While I can understand @smijolovic frustration with the current system, leaving bazel behind is akin to rewriting most of what is in the distroless project. That is a lot to ask from a project that is mostly community driven. My honest opinion (as someone who is not responsible for this repo) on this: since distroless' inception new tools have come to light that look promising, but are at the same time still finding their place in the eco system. I'm usually the first to rewrite things and move forward (and I see a lot of merit in tools like buildah or cloud native buildpacks), but in this case I would proceed with some caution: lot's of promises of "the new silver bullet in container building", but time will tell.

That is why I would fix the python based packager now (that broke bazel's builtin repository handling to some degree) and keep an eye out for the future of alternative build tools.

PeterMylemans avatar Nov 08 '20 16:11 PeterMylemans

Thanks for the response Peter. If I'm understanding your explanation correctly, I don't think it's quite working as you expect.

If I clone your repo and run $ bazel build //base:static_nonroot_amd64_debian10 I get the following (trimmed) error:

ERROR: no such package '@packages_amd64_debian9//debs': java.io.IOException: Error downloading [http://deb.debian.org/debian/pool/updates/main/o/openjdk-8/openjdk-8-jdk-headless_8u265-b01-0+deb9u1_amd64.deb, http://deb.debian.org/debian-security/pool/updates/main/o/openjdk-8/openjdk-8-jdk-headless_8u265-b01-0+deb9u1_amd64.deb, http://archive.debian.org/debian/pool/updates/main/o/openjdk-8/openjdk-8-jdk-headless_8u265-b01-0+deb9u1_amd64.deb, http://archive.debian.org/debian-security/pool/updates/main/o/openjdk-8/openjdk-8-jdk-headless_8u265-b01-0+deb9u1_amd64.deb] to /home/joshuagl/.cache/bazel/_bazel_joshuagl/6b17bc35729cb244efa1a69ee18e7f1c/external/packages_amd64_debian9/debs/1a576428b61c9671cab4072f6d7b1b70027307be882ea0e4ed23ed0c6683e3d2.deb: GET returned 404 Not Found
INFO: Elapsed time: 2.397s
INFO: 0 processes.
FAILED: Build did NOT complete successfully (0 packages loaded)
    currently loading: base

I believe this means openjdk-8-jdk-headless_8u265-b01-0+deb9u1_amd64.deb is not available from archive.debian.org or deb.debian.org?

If I first run bazel run update_deb_packages the build is able to proceed.

In the diff of packages_amd64_debian9.bzl I see pool/updates/main/o/openjdk-8/openjdk-8-jre-headless_8u265-b01-0+deb9u1_amd64.deb has been replaced with pool/updates/main/o/openjdk-8/openjdk-8-jdk-headless_8u272-b10-0+deb9u1_amd64.deb.

joshuagl avatar Nov 09 '20 12:11 joshuagl

Mmm I'll have to look into this later this week.

Thanks for the heads-up: I'll come back on this.

PeterMylemans avatar Nov 09 '20 15:11 PeterMylemans

Is there a way with:

bazel build //package_manager:dpkg_parser.par

To specify only a certain image to load? This fails most often loading all of the packaging for all arch and debian versions.

smijolovic avatar Nov 10 '20 06:11 smijolovic

FWIW my understanding of the Debian package archives is as follows.

  • deb.debian.org provides mirrors of the archives on ftp.debian.org, which host the current releases (Jessie/8, Stretch/9, and Buster/10)
  • archive.debian.org hosts older releases (before Jessie/8)
  • a package repository (ftp. or archive.) has package indices (Packages.[gz|xz|bz2]) per release-repository-architecture combination, in release/repo/arch i.e. debian/dists/buster/main/binary-amd64
  • The packages themselves all live in the pool directory, in sub-directories by <repository>/<beginning of package name>/<package name>, i.e. ca-cacert packages for all current releases live in debian/pool/main/c/ca-cacert
  • The pool contains all versions of a package referenced by a package index (Packages.[gz|xz|bz2])
  • When a package version is no longer listed by a package index, it is reaped from the pool

Thus, per my understanding at least, without using snapshot.debian.org it will only ever be possible to fetch the most recent version of a package for a release.

I tried to find some document(s) that described the above but didn't have much luck. There is some description of the Debian repository format here, but it does not describe all of the above: https://wiki.debian.org/DebianRepository/Format

joshuagl avatar Nov 10 '20 11:11 joshuagl

@joshuagl I can confirm that you are correct. :+1: Good catch and thanks for the investigation! The ftp behaviour was a bit surprising to me to be honest.

I'll rework the PR towards using the snapshot repo's as a mirror instead of regular deb.

PeterMylemans avatar Nov 11 '20 07:11 PeterMylemans

Work and life caught up with me in the last months, but I did have a look at the snapshot option.

After going back and forth a bit, using snapshots always ends up quite close to what we have already.

Maybe best to revisit this after the mult-arch changes, as they are bound to conflict anyway.

PeterMylemans avatar Jan 31 '21 16:01 PeterMylemans

😯 They actually fixed a bug on snapshot.debian.org: https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=960304

Hopefully this gives more stability.

chanseokoh avatar Aug 04 '21 13:08 chanseokoh