bootc
bootc copied to clipboard
tracker: logically bound app images
Logically bound images
Current documentation: https://containers.github.io/bootc/experimental-logically-bound-images.html
Original feature proposal text:
We should support a mechanism where some container images are "lifecycle bound" to the base bootc image.
A common advantage/disadvantage of the below is that the user must manage multiple container images for system installs - e.g. for a disconnected/offline install they must all be mirrored, not just one.
In this model, the app images would only be referenced from the base image as .image files or an equivalent.
This contrasts with physically bound images.
bootc logically bound flow
bootc upgrade follows a flow like:
- fetch new base image
- Read its root filesystem, discover logically bound images
- Pull those images into a new bootc-owned container storage (xref https://github.com/containers/bootc/pull/724 )
- Garbage collect images which are not referenced by any root (e.g. pending/current/rollback).
Current design: symlink to .image or .container files
Introduce /usr/lib/bootc/bound-images.d that is symlinks to .image files or .container files.
Pros:
- Straightforward to implement
- The admin only needs to bump a
:sha256digest in one place to update
Cons:
- Handling systemd specifiers is tricky, we will error out on them
- No separation of concerns: an
.imagefile is intended to pull images not be parsed by an external tool for a separate purpose. - Updates to Quadlet may break the process and/or add (potential) continuous maintenance burden for bootc (i.e., "chasing/reimplementing new features").
- Forces users to use Quadlet even if they have no use for pulling images under systemd.
Note: we expect the .image files to reference images by digest or immutable tag. There is no mechanism to pull images out of band.
Other alternatives considered
New custom config file
A new TOML/usr/lib/bootc/bound-images.d, of the form e.g. 01-myimages.toml:
images = ["quay.io/testos/someimage@sha256:12345", "quay.io/testos/otherimage@sha256:54321"]
authfile = "/etc/containers/my-custom-auth.json"
Pros:
- Easy to just list multiple images vs one image per
.imagefile - TOML format is used by other bootc tooling and some of the container config formats
Cons:
- New file format relating to container images
- May need in the general case to support many of the existing options in
.imagefiles - The admin will need to bump a
:sha256digest in two places to update in general (both in a.containeror.imageand the custom.tomlhere)
Parse existing .image files
Pros:
- Well known
- Spec -> pull translation exists
- Existing spec handles most image pull fields
Cons:
- Would need to extend the spec to include a new
bootc=boundor equivalent opt-in - Handling specifiers is tricky
- Implentation complicated wrt managing systemd
What would happen under the covers here is that bootc would hook into podman and:
- disallow GC of these images even if the unit isn't running (for all deployments)
- Fetch new images (from the new base container image) on
bootc upgrade
TODO:
- [x] docs
- [x] CI test
- [x] PR to fedora bootc examples
- [x] Ensure compatibility with bootc-image-builder https://github.com/containers/bootc/issues/715
- [x] install path with
bootc install to-filesystem- simple scenario w/out pull secret? - [x] install path w/pull secret embedded in bootc image? podman pull happens from bootc container
- [x] install path w/bootc-image-builder where it pre-pulls images, demonstrated e2e w/konflux, we probably need to enable a model where bound images are provided in a mirror location or OCI directory
In the automotive world we often think of containers as two possible things. Either they come with the system, and are updated atomically with it, or they are separately installed. They way we expect this to work is for the system ones to be installed in a separate image store that is part of the ostree image. And then the "regular" containers will just be stored in /var/lib/container.
The automotive sig manifests ship a storage.conf that has:
[storage.options]
additionalimagestores = [ /usr/share/containers/storage ]
Then we install containers in the image with osbuild like:
- type: org.osbuild.skopeo
inputs:
images:
type: org.osbuild.containers
origin: org.osbuild.source
mpp-resolve-images:
images:
- source: registry.gitlab.com/centos/automotive/sample-images/demo/auto-apps
tag: latest
name: localhost/auto-apps
options:
destination:
type: containers-storage
storage-path: /usr/share/containers/storage
This was part of the driver for the need for composefs to be able to contain overlayfs base dirs (overlay nesting). Although that is less important if container/storage also uses composefs.
I love the idea of additonal stores for this.
Quadlet supports .image files now which can be directly referenced in .container files. Maybe that's a way to achieve a similar effect.
The .image files don't yet (easily) allow for pulling into an additional store, but this could be a useful feature.
Cc: @ygalblum
Then we install containers in the image with osbuild like:
So IMO this issue is exactly about having bootc install and bootc update handle these images. Because as is today, needing to duplicate the app images in an osbuild manifest is...unfortunate. With this proposal, when osbuild is making a disk image, it'd use bootc install internally to the pipeline, and we wouldn't need to re-specify the child container images out of band of the "source of truth" of the parent image.
Then we install containers in the image with osbuild like:
So IMO this issue is exactly about having
bootc installandbootc updatehandle these images. Because as is today, needing to duplicate the app images in an osbuild manifest is...unfortunate. With this proposal, when osbuild is making a disk image, it'd usebootc installinternally to the pipeline, and we wouldn't need to re-specify the child container images out of band of the "source of truth" of the parent image.
I understand that, and I merely pointed out how we currently do it in automotive, not how it would be done with bootc.
Instead, what I propose is essentially:
Dockerfile:
FROM bootc-base
RUN podman --root /usr/lib/containers/my-app pull quay.io/my/app
ADD my-app.container /etc/containers/systemd
my-app.container:
[Container]
Image=quay.io/my/app
PodManArgs=--storage-opt=overlay.additionalimagestore=/usr/lib/containers/my-app
And then you have an osbuild manifest that just deploys the above image like any normal image.
Of course, instead of open-coding the commands like this, a tool could do the right thing automatically.
You might also want the tool to tweak the image name in the quadlet to contain the actual digest so we know that the exact right image version is used every time.
Its also interesting to reflect on the composefs efficiency in a setup like this.
If we use composefs for the final ostree image, we will get perfect content sharing, even if each of the individual additional-image-stores use its own composefs objects dir. Even if no effort is made to try to share object files between image store directories. Because all the files will eventually be deduplicated as part of the full ostree composefs image.
In fact, we will even deduplicate files between image stores that use the traditional overlayfs or vfs container store formats.
In fact, maybe using vfs backend is the right approach here? It is a highly stable on-disk format, and its going to be very efficient to start such a container. And we can ignore all the storage inefficiencies, because they are taken care off by the outer composefs image.
my-app.container:
[Container] Image=quay.io/my/app PodManArgs=--storage-opt=overlay.additionalimagestore=/usr/lib/containers/my-app
Just wanted to note that --storage-opt is a global argument. So, the key to use is GlobalArgs instead of PodmanArgs.
I wonder if we should tweak the base images to have a standardized /usr location for additional image store images.
/usr/lib/containers/storage?
@rhatdan Yeah, that sounds good to me. Can we perhaps just add it alwas to our /usr/share/containers/storage.conf file?
You want that in the default storage.conf in containers/storage?
If you setup an empty additionalstore you need to precreate the directories and lock files. This is what we are doing to setup an empty AdditonalStore. We should fix this in containers/storage to create these files and directories if they do not exists.
RUN mkdir -p /var/lib/shared/overlay-images \
/var/lib/shared/overlay-layers \
/var/lib/shared/vfs-images \
/var/lib/shared/vfs-layers && \
touch /var/lib/shared/overlay-images/images.lock && \
touch /var/lib/shared/overlay-layers/layers.lock && \
touch /var/lib/shared/vfs-images/images.lock && \
touch /var/lib/shared/vfs-layers/layers.lock
@rhatdan Would it maybe be possible instead to have containers/storage fail gracefully when the directory doesn't exist?
Yes that is the way it should work. If I have time I will look at it. Basically ignore the storage if it is empty.
Actually I just tried it out, as long as the additional image store directory exists, the store seems to work. No need for those additonal files and directories.
cat /etc/containers/storage.conf
[storage]
driver = "overlay"
runroot = "/run/containers/storage"
graphroot = "/var/lib/containers/storage"
[storage.options]
pull_options={enable_partial_images = "true", use_hard_links = "false", ostree_repos=""}
additionalimagestores = [
"/usr/lib/containers/storage",
]
Additional store directory is empty
ls -l /usr/lib/containers/storage/
total 0
podman info
...
So podman will write to the empty directory and create
# ls -lR /usr/lib/containers/storage/
/usr/lib/containers/storage/:
total 4
drwx------. 2 root root 4096 Nov 24 07:03 overlay-images
/usr/lib/containers/storage/overlay-images:
total 0
-rw-r--r--. 1 root root 0 Nov 24 07:03 images.lock
So podman will write to the empty directory and create the missing content.
If the file system is read-only it fails.
podman info
Error: creating lock file directory: mkdir /usr/lib/containers/storage/overlay-images: read-only file system
So, I've been thinking about the details around this for a while. In particular about the best storage for these additional image directories. The natural approach would be to use the overlay backend, as we can then use overlay mounts for the actual container, but this has some issues.
First of all, historically, ostree doesn't support whiteout files. This has been recently fixed, although even that fix requires adding custom options to ostree. In addition, if ostree is using composefs, there are some issues with encoding both the whiteouts as well as the overlayfs xattrs in the image. These are solved by the overlay xattr escape support I have added in the most recent kernel, although we don't yet have that backported into the CS9 kernel.
However, I wonder if using overlay directories for the additional image dir is even the right approach? All the files in the additional image dir will anyway be deduplicated by ostree, so maybe it would be better if we used an approach more like the vfs backend, where each layers is completely squashed (and then we rely on the wrapping ostree to de-duplicate these). Such a layer would be faster to setup and use (since it is shallower), and fix all the issues regarding whiteouts and overlay xattrs.
I see two approaches for this:
- Use overlay backend with composefs format. This moves all the xattrs and whiteouts into the composefs image file, which will work fine in any ostree image
- Teach the overlay container/storage backend the ability to squash individual layers, and then do this for all the images in the additional image store.
Opinions?
So there's two totally different approaches going on here (and the second approach has two sub-approaches):
Physically embed the app images in the base image
In this model, bootc upgrade and bootc rollback will also upgrade/rollback the system images "naturally", the same way as any other files. (There's a lot of discussion above about the interactions with whiteouts/composefs/etc. though)
From the UX point of view, a really key thing is there is one container image - keeping the problem domain of "versioning/mirroring" totally simple.
However...note that this model "squashes" all the layers in the app images into one layer in the base image, so on the network, e.g. the base image used by an app changes, it will force a re-fetch of the entire app (all its layers), even if some of the app layers didn't change.
I think there's also the converse problem - unless we very carefully ensure that the podman pull or equivalent that generates the layer is fully reproducible (e.g. timestamps) it means any updates to the base image will generate a different squashed app layer, which is also quite problematic. (Forcing a new storage in the registry)
In other words, IMO this model breaks some of the advantages of the content-addressed storage in OCI by default. We'd need deltas to mitigate.
(For people using ostree-on-the-network for the host today, this is mitigated because ostree always behaves similarly to zstd:chunked and has static deltas; but I think we want to make this work with OCI)
Longer term though, IMO this approach clashes with the direction I think we need to take for e.g. configmaps - we really will need to get into the business of managing more than just one bootable container image, which leads to:
Reference the app images
A common advantage/disadvantage of the below is that the user must manage multiple container images for system installs - e.g. for a disconnected/offline install they must all be mirrored, not just one.
I am sure someone has already invented this, but I think we should suppport a "rollup" OCI artifact that (much like a manifest list) is just a pointer to a bunch of other container images. A bit like the OCP "release image" except not an executable itself.
Then tools like skopeo copy would know how to recurse into it and mirror all the sub-images, and bootc install could honor this image. bootc would learn about this too, so bootc upgrade would find all the things.
Loose binding
In this model, the app images would only be referenced from the base image as .image files.
We would teach bootc install (i.e. at disk write time) to support "pre-pulling" container images referenced by /usr/share/containers/systemd/*.image files in the tree (using the credentials embedded in the base image) - but physically the container images live in /var in the final installed filesystem.
(There's an interesting sub-question here of whether we do this by default for .image files we find)
Anyways though, here these images are disconnected from the base image lifecycle; bootc upgrade/rollback would not affect them. They can be fully uninstalled (though to do so the .image file would need to be masked). Updates to them work by fetching from the registry directly.
A corollary to this is that for e.g. disconnected installs, the user must mirror all the application container images too.
This for example is the model used AFAIK by Fedora Workstation when flatpaks are installed - they are embedded in the ISO, but live in /var.
Strict binding
A key aspect of this would be that like "loose binding", the container images would be fetched separately from a registry. For disconnected installs, the admin would need to mirror them all. But we wouldn't lose all the efficiency bits of OCI.
This is what I was getting at originally; the images would still live in /var/lib/containers (I think), but bootc upgrade would enforce that the referenced .image files in the new root are pre-fetched before the next boot.
Hmm...more generally really I think we may need to drive something into podman where instead of .image files effectively expanding into an imperative invocation of podman pull, things like podman image prune would at least optionally know how to not prune the images. On a bootc system, we'd make sure to wire things up so that podman would avoid pruning images referenced from .image files in both the booted root and the rollback.
That said, again once we switch to podman storage for bootc then it may just make more sense to physically locate the images in the bootc container storage and have bootc own all updates/GC.
I am sure someone has already invented this, but I think we should suppport a "rollup" OCI artifact that (much like a manifest list) is just a pointer to a bunch of other container images. A bit like the OCP "release image" except not an executable itself.
I saw this go by: https://opencontainers.org/posts/blog/2023-07-07-summary-of-upcoming-changes-in-oci-image-and-distribution-specs-v-1-1/#2-new-manifest-field-for-establishing-relationships Although, it seems like it's almost the inverse of what we want here. I guess in the end, maybe things like "super image" are just a special case of manifest lists.
Some discussion about this on the podman side in https://github.com/containers/podman/issues/22785
One discussion that intersects with parts of this issue happened in https://github.com/containers/podman/discussions/18182#discussioncomment-5925088. In short: we discussed how we can mark images to be un-removable.
Implemented in https://github.com/containers/bootc/pull/659.
@ckyrouac and I had a discussion and came up with a new proposed design. The original comment is edited but in a nutshell we propose to create /usr/lib/bootc/bound-images.d which is a set of symlinks to existing .image files. We will error out if we detect systemd specifiers in use.
which is a set of symlinks to existing
.imagefiles. We will error out if we detect systemd specifiers in use.
I think binding it to Quadlet will hurt longterm maintenance as it breaks separation of concerns. Having dedicated file of set of files in .d with a simple syntax doesn't run into the technical troubles. There are use cases outside of running containers under systemd which wouldn't be forced to fiddle with Quadlet.
I won't block but still think that a new config file is cleaner and easier to maintain longterm.
I think binding it to Quadlet will hurt longterm maintenance as it breaks separation of concerns.
Can you edit the comment at the top that has the set of Pros/Cons and clarify your concerns there?
One thing I found compelling about the symlink approach is when I realized this downside with the separate file:
The admin will need to bump a :sha256 digest in two places to update in general (both in a .container or .image and the custom .toml here)
However overall, I think we have explored the space pretty well and just need to make a decision and since @ckyrouac is doing the work I think it's his call, unless you have more information to add to the Pros/Cons.
Can you edit the comment at the top that has the set of Pros/Cons and clarify your concerns there?
Thanks, done :heavy_check_mark:
The admin will need to bump a :sha256 digest in two places to update in general (both in a .container or .image and the custom .toml here)
That's already the case. The proposal doesn't cover .container or .kube files, so admins are forced to move things into .image files.
That's already the case. The proposal doesn't cover .container or .kube files, so admins are forced to move things into .image files.
This was only lightly touched on but it does currently cover .container files - we would parse those and find their referenced Image= and I expect that to be the default case. Handling .kube directly would be an obvious extension to that as well.
One other thing to ponder here is related to https://github.com/containers/bootc/issues/518
Basically if you look at this from a spec/status perspective; we effectively have a clear spec that is readable by external tooling: the "symlink farm". It's not reflected in bootc status - should it? I'm not sure.
What we don't directly have is status; while I think we'll end up doing the podman create in order to pin, that still allows things like podman system reset or just a plain podman rm -f. Should bootc also expose a status for this in our status fields? I think so.
Perhaps the status is just a boolean pinnedImagesValid: true|false (or maybe we go all the way to a condition).
I also wonder if we may need an explicit verb to re-synchronize in the case of a podman system reset? Or maybe just typing bootc upgrade again should do that.