umoci [rfc] OCIv2 implementation

I have some proposal ideas for the OCIv2 image specification (it would actually be OCIv1.1 but that is a less-cool name for the idea) and they primarily involve swapping out the lower levels of the archive format to better designed (along the same lines as restic or borgbackup).

We need to implement this as a PoC in umoci before it's proposed to the image-spec proper so that we don't get stuck in debates over whether it has been tested "in the wild" -- which is something that I imagine any OCI extension is going to go through.

Sep 08 '18 14:09 cyphar

As an aside, it looks like copy_file_range(COPY_FR_DEDUP) wasn't merged. But you can use ioctl(FICLONERANGE) or ioctl(FIDEDUPERANGE) (depending on which is the most correct way of doing it -- I think FICLONERANGE is what we want). If it isn't enough we can always revive the patch, as one of the arguments against it was that nobody needed partial-file deduplication -- but we need this now for OCIv2 to have efficient deduplicated storage.

Sep 08 '18 15:09 cyphar

FICLONERANGE needs to be block-aligned (unsurprisingly) but unfortunately the block alignment is for both source and destination. This means that if we have different block sizes which are out-of-alignment of the filesystem block size we will have very few alignments.

On the plus side, for small files we can just use reflinks.

Sep 08 '18 19:09 cyphar

Some things that should be tested and discussed:

How bad is the Merkle tree hit? Should each individual file be linked from a map (or a packfile) of some kind to avoid really tall trees? How deep can a normal distribution's filesystem go? Each de-reference can be quite expensive (especially if it involves a pull -- but I would hope that HTTP/2 server push would resolve this somewhat).
What sort of chunk size is optimal?
How should we implement that canonical representation checking? This is something that should be a hard failure when trying to use an image, to avoid incompatible tools from doing something wrong.
As a point of comparison, looking at how much transfer-deduplication gain we can get from content-defined-chunking would be interesting.
Do we need to define a new rootfs type other than layered for this change? Layers are something we should potentially drop -- but maybe we should structure it as a "snapshot" concept in case people still want snapshots.

Sep 19 '18 05:09 cyphar

like your inspiration from https://github.com/restic/restic, I think there is a good argument that the chunks and content addressable storage ought to be compatible with https://github.com/systemd/casync too.

Sep 19 '18 20:09 vbatts

I will definitely look into this, though it should be noted (and I think we discussed this in-person in London) that while it is very important for fixed chunking parameters to be strongly recommended in the standard (so that all image builders can create compatible chunks for inter-distribution chunking) I think they should be configurable so that we have the option to transition to different algorithms in the future.

Is there a paper or document that describes how casync's chunking algorithm works? I'm looking at the code and it uses Buzhash (which has a Go implementation apparently) but it's not clear to me what the chunk boundary condition is in shall_break (I can see that it's (v % c->discriminator) == (c->discriminator - 1) but I don't know what that means).

I'm also quite interested in the serialisation format. Lennart describes it as a kind of random-access tar that is also reproducible (and contains all filesystem information in a sane way). I will definitely take a look at it. While I personally like using a Merkle tree because it's what git does and is kind of what makes the most sense IMO (plus it is entirely transparent to the CAS), I do see that having a streamable system might be an improvement too.

Sep 20 '18 02:09 cyphar

As an aside, since we are creating a new serialisation format (unless we reuse casync) we will need to implement several debugging tools because now you will no longer be able to use tar for debugging layers.

Sep 20 '18 02:09 cyphar

I've already talked with @cyphar about it, but I'll comment here as well so to not lose track of it. The deduplication could also be done only locally (for example on: XFS with reflinks support). So that network deduplication and local storage deduplication could be done separately.

I've played a bit with FIDEDUPERANGE here: https://github.com/giuseppe/containers-dedup

Oct 10 '18 13:10 giuseppe

@cyphar what was the argument against doing simply file-level deduplication? I don't claim to know the typology of all docker images, but on our side (NVIDIA) we have a few large libraries (cuDNN, cuBLAS, cuFFT) which are currently duplicated across multiple images we publish:

The files are duplicated if you redo a build even if nothing has changed, since it creates a new layer with the same content.
The files are duplicated across CUDA images with different distros: the same library is shipped for our CentOS 6/7, and Ubuntu 14.04/16.04/18.04 tags.

@giuseppe @cyphar it is my understanding that when deduplicating files/blocks at the storage level, we decrease storage space but the two files won't be able to share the same page cache entry. Is that accurate? Is that an issue that can be solved at this level too? Or will users still need to layer carefully to achieve this sharing?

Oct 10 '18 15:10 flx42

@flx42 overlayfs has best approach for reusing page cache, since it's the same inode on the same maj/min device

Oct 10 '18 15:10 vbatts

@vbatts right, and that's what we use today combined with careful layering. I just wanted to clarify if there was a solution at this level, for the cases where you do have the same file but not from the same layer.

Oct 10 '18 16:10 flx42

The deduplication could also be done only locally (for example on: XFS with reflinks support). So that network deduplication and local storage deduplication could be done separately.

I think we (at least I) have put focus on registry-side storage & network deduplication.

Runtime-side local deduplication is likely to be specific to runtimes and out of scope of OCI Image Spec & Dist Spec?

Oct 10 '18 17:10 AkihiroSuda

@AkihiroSuda

A few things:

It depends how you define "runtime". If you include everything about the machine that pulls the image, extracts the image, and the runs a container as the "runtime" then you're correct that it's a separate concern. But I would argue that most image users would need to do both pulling and extraction -- so it's clearly an image-spec concern to at least consider it.
Ignoring (or punting on) storage deduplication (when we have the chance to do it) would likely result in suboptimal storage deduplication -- which is something that people want! I would like OCIv2 images to actually replace OCIv1 and if the storage deduplication properties are worse or no better, then that might not happen.

Given that CDC (and separation of metadata, merkle tree or some similar filesystem representation) already solves both "registry-side storage & network deduplication" I think that considering whether it's possible to take advantage of the same features for storage deduplication is reasonable...

Oct 11 '18 04:10 cyphar

@flx42

what was the argument against doing simply file-level deduplication?

Small modifications of large files, or files that are substantially similar but not identical (think man pages, shared libraries and binaries shipped by multiple distributions, and so on) would be entirely duplicated. So for the image format I think that using file-level deduplication is flawed, for the same reasons that file-level deduplication in backup systems is flawed.

But for storage deduplication this is a different story. My main reason for wanting to use reflinks is to be able to use less disk space. Unfortunately (as I discovered above) this is not possible for variable-size chunks (unless they are all multiples of the chunk size).

Using file-based deduplication for storage does make some sense (though it does naively double your storage requirement out-of-the gate). My idea for this would be that when you download all of the chunks and metadata into your OCI store, you set up a separate content-addressed store which has files that correspond to each file represented in your OCI store. Then, when constructing a rootfs, you can just reflink (or hardlink if you want) all of the files from the file store into a rootfs (overlayfs would have to be used to make sure you couldn't touch any of the underlying files). Of course, it might be necessary (for fast container "boot" times) to pre-generate the rootfs for any given image -- but benchmarks would have to be done to see if it's necessary.

My main interest in reflinks was to see whether it was possible to use them to remove the need for the copies for the "file store", but given that you cannot easily map CDC chunks to filesystem chunks (the latter being fixed-size) we are pretty much required to make copies I think. You could play with a FUSE filesystem to do it, but that is still slow (though some recent proposals to use eBPF could make it natively fast).

As for the page-cache I'm not sure. Reflinks work by referencing the same extents in the filesystem, so it depends on how the page-cache interacts with extents or whether the page-cache is entirely tied to the particular inode.

Oct 11 '18 04:10 cyphar

@flx42

It should be noted that with this proposal there would no longer be a need for layers (because the practical deduplication they provide is effectively zero) though I think that looking into how we can use existing layered filesystems would be very useful -- because they are obviously quite efficient and it makes sense to take advantage of them.

Users having to manually finesse layers is something that doesn't make sense (in my view), because the design of the image format should not be such that it causes problems if you aren't careful about how images are layered. So I would hope that a new design would not repeat that problem.

Oct 11 '18 04:10 cyphar

@cyphar Thanks for the detailed explanation, I didn't have a clear picture of the full process, especially on how you were planning to assemble the rootfs, but now I understand.

As for the page-cache I'm not sure. Reflinks work by referencing the same extents in the filesystem, so it depends on how the page-cache interacts with extents or whether the page-cache is entirely tied to the particular inode.

I found the following discussion on this topic: https://www.spinics.net/lists/linux-btrfs/msg38800.html I was able to reproduce their results with btrfs/xfs, indicating that the page cache was not shared. As you mentioned, the solution could be to hardlink files when assembling the final rootfs instead of reflinking. You would need an overlay obviously, but that means you won't be able to leverage the CoW mechanism from the underlying filesystem (which might be fined grain) and instead rely on copy_up which copies the full file AFAIK.

Not necessarily a big deal, but nevertheless an interesting benefit of layer sharing+overlay that would be nice to keep.

Oct 11 '18 05:10 flx42

FWIW, I wanted to quantify the difference with block-level vs file-level deduplication on real data, so I wrote a few simple scripts here: https://github.com/flx42/layer-dedup-test

It pulls all the tags from this list (minus the Windows tags that will fail). This was the size of the layer directory after the pull:

+ du -sh /mnt/docker/overlay2
822G    /mnt/docker/overlay2

Using rmlint with hardlinks (file-level deduplication):

+ du -sh /mnt/docker/overlay2
301G	/mnt/docker/overlay2

Using restic with CDC (block-level deduplication):

+ du -sh /tmp/restic
244G    /tmp/restic

This is a quick test, so no guarantee that it worked correctly. But this is a good first approximation. File-level deduplication performed better than I expected, block-level with CDC is indeed better but at the cost of extra complexity and possibly a two-level content store (block then file).

Oct 16 '18 21:10 flx42

Funnily enough, Go 1.11 has changed the default archive/tar output -- something that having a canonical representation would solve. See #269.

Nov 07 '18 09:11 cyphar

@flx42

You would need an overlay obviously, but that means you won't be able to leverage the CoW mechanism from the underlying filesystem (which might be fined grain) and instead rely on copy_up which copies the full file AFAIK.

Does overlay share the page cache? It was my understanding that it didn't, but that might be an outdated piece of information.

Nov 07 '18 09:11 cyphar

@cyphar yes it does: https://docs.docker.com/storage/storagedriver/overlayfs-driver/#overlayfs-and-docker-performance

Page Caching. OverlayFS supports page cache sharing. Multiple containers accessing the same file share a single page cache entry for that file. This makes the overlay and overlay2 drivers efficient with memory and a good option for high-density use cases such as PaaS.

Also a while back I launched two containers, one pytorch and one tensorflow, using the same CUDA+cuDNN base layers. Then using /proc/<pid>/maps on both containers I was able to verify they loaded the same copy of one library (the same inode).

Nov 08 '18 04:11 flx42

My idea for this would be that when you download all of the chunks and metadata into your OCI store, you set up a separate content-addressed store which has files that correspond to each file represented in your OCI store. Then, when constructing a rootfs, you can just reflink (or hardlink if you want) all of the files from the file store into a rootfs (overlayfs would have to be used to make sure you couldn't touch any of the underlying files).

This is exactly what libostree is, though today we use a read-only bind mount since we don't want people trying to persist state in /usr. (It's still a mess today how in the Docker ecosystem / is writable by default but best practice is to use Kubernetes PersistentVolumes or equivalent). Though running containers as non-root helps since that will at least deny writes to /usr.

Dec 05 '18 15:12 cgwalters

Blog post on the tar issues is up. https://www.cyphar.com/blog/post/20190121-ociv2-images-i-tar

Jan 21 '19 13:01 cyphar

And so much conversation on The Twitter

-------- Original Message -------- On Jan 21, 2019, 14:37, Aleksa Sarai [see §317C(6)] wrote:

Blog post on the tar issues is up. https://www.cyphar.com/blog/post/20190121-ociv2-images-i-tar

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

Jan 21 '19 13:01 vbatts

Interesting Discussion on the image proposal based on files. If you want to see a production grade example have a look at the Image Packaging System (IPS) from illumos. Its based originally on the concept to be used as a package manager inside a Container but one can easily leave dependencies out of a manifest and thus create an image layer so to speak. Manifests are also merged ahead of time and so you only need to download what is needed. Additionally, by using text files to encode any metadata one can simply encode any attributes needed later in the spec. I was thinking on extending the server with a registry API so that one can download a dynamically generated tarfile and use the file based storage in the background.

While it has a few pythonisms in it, I made a port of the server and manifest code to golang some time ago. Let me know if any of this is interesting to you I can give detailed insights and information about challenges we stumbled upon in the field in the last 10 years.

Original Python Implementation (In use today on OpenIndiana and OmniOS) https://github.com/OpenIndiana/pkg5 Personal port to golang (server side only atm) https://git.wegmueller.it/Illumos/pkg6

May 07 '20 22:05 Toasterson

So here is my own simplistic parallel casync/desync alternative, written in Rust, which uses fixed sized chunking (which is great for VM images): https://github.com/borgbackup/borg/issues/7674#issuecomment-1654175985 . You can also see there benchmark, which compares my tool to casync, desync and other alternatives. And my tool is way faster than all them. (But I cheat by using fixed sized chunking). See whole issue for context and especially this comment https://github.com/borgbackup/borg/issues/7674#issuecomment-1656787394 for comparison between casync, desync and other CDC-based tools

Jul 30 '23 09:07 safinaskar

Okay, so here is list of Github issues I ~spammed~ wrote in last few days on this topic (i. e. fast fixed-sized and CDC-based deduplication). I hope they provide great insight to everyone interested in fast deduplicated storage. https://github.com/borgbackup/borg/issues/7674 https://github.com/systemd/casync/issues/259 https://github.com/folbricht/desync/issues/243 https://github.com/ipfs/specs/issues/227 https://github.com/dpc/rdedup/discussions/222 https://github.com/opencontainers/umoci/issues/256

Jul 30 '23 10:07 safinaskar

I'm working on puzzlefs which shares goals with the OCIv2 design draft. It's written in Rust and it uses the FastCDC algorithm to chunk filesystems. Here's a summary of the saved space compared to the traditional OCIv1 format. I will also present it at upcoming Open Source Summit Europe in September.

Aug 01 '23 20:08 ariel-miculas

@ariel-miculas, cool! Let me share some thoughts.

This thread https://groups.google.com/a/opencontainers.org/g/dev/c/icXssT3zQxE may be of interest. In particular, Sarai said "I am suggesting that a new filesystem would be a good optional way of optimising usage of OCI images" and Greg KH answered "No, never create a new filesystem unless you have 5-10 years to focus exclusivly on it before you can rely on it". I think Greg's words should not be taken too seriously
When I did my benchmark, I saw very strange behavior of another dedupper written in Rust: rdedup ( https://github.com/dpc/rdedup ). rdedup uses fastcdc as its default chunking method. My benchmark ( https://github.com/borgbackup/borg/issues/7674#issuecomment-1656787394 ) shows that for unknown reasons rdedup becomes x10 slower (!!!), when chunk size changes from 4096K to 64K. This may be some special property of my data or of fastcdc. Or bug in Rust's fastcdc implementation. Make sure you have no this bug
FUSE badly interacts with suspend. See this thread https://lore.kernel.org/lkml/CAPnZJGDWUT0D7cT_kWa6W9u8MHwhG8ZbGpn=uY4zYRWJkzZzjA@mail.gmail.com/ . So make sure that system can be suspended while your fs is mounted. If you see this bug, then, I think, it can be fixed by always using timeouts

Aug 01 '23 20:08 safinaskar

Thanks for your feedback!

Interesting issue with rdedup, puzzlefs is using the fastcdc crate which I've noticed it's not used by rdedup. Some benchmarks with puzzlefs would sure be useful.
I wasn't aware of the FUSE issue with suspend. However, I'm also working on a kernel driver for puzzlefs, see version 1, version 2 and also this github issue.

Aug 02 '23 10:08 ariel-miculas

umoci umoci copied to clipboard

[rfc] OCIv2 implementation

umoci
umoci copied to clipboard