Is the reason just be "docker did it this way so we remain it"? Or there be some benificials in not allowing multiple layer roots in same image...

Jul 08 '25 14:07 XenoAmess

I'm not understanding the question. Could you phrase it differently, perhaps with an example?

Jul 08 '25 14:07 sudo-bmitch

I think they're asking why we have a linear stack of layers, rather than letting you define subtrees.

@XenoAmess

The short version is that in general overlay filesystems work really well (and are designed for) when you have a set of directories that get stacked on top of each other. Docker based their format on AUFS (which we then inherited in OCI) but even modern overlay filesystems (like overlayfs) still work best with this model -- in fact, some implementations of stacking filesystems using snapshotting filesystems would struggle to work with anything else.

I wrote about this issue a long time ago, and I have had prototypes on how to fix it, but it's a fairly tricky problem to solve. tar archives suck for a lot of reasons, but it's quite difficult to change something this fundamental with the OCI image-spec -- distributions are very reticent to accept new features that make their lives harder, even if it would be a general benefit for everyone.

Maybe I'll take another crack at it one day.

Jul 08 '25 15:07 cyphar

I think they're asking why we have a linear stack of layers, rather than letting you define subtrees.

@cyphar Yes indeed.

It would be a general benefit for everyone.

It would save quite lot of people for 95%+ bandwidth in app image upgration, if allowing subtrees.

I'm not understanding the question. Could you phrase it differently, perhaps with an example?

@sudo-bmitch I thought this be a common question as actually everybody suffers from it, thus need no much explains, maybe I was wrong. I would prepare some documents about the problem, and how it troubles me(and others), and why allowing subtrees can help. well I will finish it this week and put it here...

Jul 09 '25 06:07 XenoAmess

There already is a document about it, we wrote it back in 2019. https://github.com/project-machine/puzzlefs is one project trying to solve this problem.

Jul 09 '25 07:07 cyphar

There already is a document about it, we wrote it back in 2019. https://github.com/project-machine/puzzlefs is one project trying to solve this problem.

seems it be still quite prototype... would look forward for it in kernel&k8s

Jul 09 '25 11:07 XenoAmess

I'm not understanding the question. Could you phrase it differently, perhaps with an example?

I thought this be a common question as actually everybody suffers from it, thus need no much explains, maybe I was wrong.

It was more that the question didn't quite parse for me. One layer lines could refer to the json encoding of layer descriptors. OCI works as a trailing spec, so I would want to see existing implementations of a tree structure. I'm also not sure how existing tooling would migrate to that design.

Could the same file be modified in two separate tree branches and there's some kind of merging logic of all the branches, allowing someone to assemble an image from multiple other images. I've seen that suggested many times, and the implementation logic quickly fails when trying to define the merging logic, particularly when the selected images conflict with each other (how do you merge the debian package manager state from multiple debian images each with different packages, and then throw in an alpine image that replaces lots of utilities in /usr/bin). While it's easy to define a happy path with an example (Nix being a common reference), OCI would need to define all of the edge cases and unhappy path resolutions.

If each sub tree is related to the filesystem structure, so one file can only be modified in one branch, that would imply that the current layered logic of building images would change to creating a lot of branches per step in the build process. The implementation I can quickly come up with becomes fractal, with a potentially exponential growth of subtrees for each step in the build process. That would encounter scaling concerns from generic build tooling.

For alternatives to tar, there's been work on that. Some folks are pushing for erofs, squashfs, and dm-verity. But as bad as tar is, there's a lot of tooling in the ecosystem built up around it, making it very portable and interoperable. So replacements need to not only show a single working implementation, but a library ecosystem that enables other tooling to adopt it. Even something like adding support for blake3 as a digest algorithm hits resistance because it's not yet part of the Go standard library.

Jul 09 '25 13:07 sudo-bmitch

@sudo-bmitch FWIW, my main issue with dm-verity, squashfs, and erofs is that there is no out-of-kernel parsing library for them, and they are not designed to be parsed (or generated!) by userspace tools. (squashfs somewhat gets points in this department, but it then immediately loses them because you need to shell out to the squashfs binaries to do anything with it.)

I still think a minimal CAS that can be parsed using standard tools with a well-defined form and eventually be kernel mounted (like puzzlefs) is preferable, but I there is undoubtedly a lot of extra work to do. I don't necessarily agree that tar is somewhat acceptable "because it's standard" -- our usage of tar is quite non-standard, and tar is very poorly standardised (yes, there is ustar and PAX but tar is generally a complete minefield of compatibility mishaps). The current situation mainly only works okay because the Go standard library happens to provide a per-entry parsing library for tar -- if container runtimes hadn't used Go or Go didn't include that, we probably would not still be using tar. And even today there are all sorts of compatibility issues with Go-generated tar archives and ctime/atime and some other PAX extensions...

Jul 09 '25 14:07 cyphar

OCI works as a trailing spec, so I would want to see existing implementations of a tree structure.

seems I'd better write some more docs for my design first. will do it in this weekend.

though I have only experience for handling the image packaging/repackaging/compressing/transforming, but no knowledge in runtime fs systems, so maybe the design I bring is not quite suitable. whatever, I accept any feedback/advice then.

if container runtimes hadn't used Go or Go didn't include that, we probably would not still be using tar. And even today there are all sorts of compatibility issues with Go-generated tar archives and ctime/atime and some other PAX extensions...

well I don't really think Go matters, or at least, matters that much. Go is just the lucky impl language that some people pick for some tools(means mainly docker here), and maybe it actually be a wrong choice(IMO)... designing a protoco basing on what be in Go's stupid poor standard lib doesn't sounds quite good/correct...

Jul 09 '25 15:07 XenoAmess

@sudo-bmitch FWIW, my main issue with dm-verity, squashfs, and erofs is that there is no out-of-kernel parsing library for them, and they are not designed to be parsed (or generated!) by userspace tools. (squashfs somewhat gets points in this department, but it then immediately loses them because you need to shell out to the squashfs binaries to do anything with it.)

i'm not sure why erofs and squashfs don't have out-of-kernel parsing library, erofs-utils has liberofs and squashfs also has squashfs-tools-ng which aims to provide a userspace library for squashfs.

even ext4 has tar2ext4 and dm-verity also has go library implementation here

I still think a minimal CAS that can be parsed using standard tools with a well-defined form and eventually be kernel mounted (like puzzlefs) is preferable, but I there is undoubtedly a lot of extra work to do.

I hope to have an honest conversation, but I've reiterated in previous threads that EROFS already supports CAS since Linux 6.1 (but all CAS chunkes has to be landed in limited blobs), it just doesn't use typical CDC or its varient fastCDC, but if you really want me to implement FastCDC in erofs-utils, that is doable and no need to land any new kernel change.

If you'd like to keep each chunk as a seperate file, I've already explained the reason in this reply why it's impossible in the kernel (if you don't trust me you could open a thread in the -fsdevel mailing list to discuss breaking files into variable chunks and store each chunk into a file and then how to handle mmap page fault in this way): https://github.com/project-machine/puzzlefs/issues/114#issuecomment-2369872133 In short, puzzlefs kernel implementation has never implemented mmap support at all (so it's impossible to launch real containers), and it's very hard to open unfixed files (even a single file due to the context limitation) in the kernel page fault context (please see what Dave Chinner said in the comment I mentioned). That is quite different from what composefs does because composefs only open files out of page fault contexts which is what overlayfs behaves.

In other words, if there had been a feasible way to implement this as a kernel filesystem, I would have done it a long time ago. However, it has been quite controversial on my side, and I believe I’m not the only one who thinks like this.

Jul 10 '25 12:07 hsiangkao

I still think things may not be so complex, but I don't know if I think it too easiely.

as we do accept .wh. files, I think it be not bad to add another grammar as .import. files, for example .import.abcde.txt
in the .import.abcde.txt file we describe which file/folder is imported from which block for example,

/root/data/abcde.txt from sha256:b5b2b2c507a0944348e0303114d8d93aaaa081732b86451d9bce1f432a538888

describe the blocks into image manifest

{
  "schemaVersion": 2,
  "mediaType": "application/vnd.oci.image.manifest.v1+json",
  "config": {
    "mediaType": "application/vnd.oci.image.config.v1+json",
    "digest": "sha256:b5b2b2c507a0944348e0303114d8d93aaaa081732b86451d9bce1f432a537bc7",
    "size": 7023
  },
  "layers": [
    {
      "mediaType": "application/vnd.oci.image.layer.v1.tar+gzip",
      "digest": "sha256:9834876dcfb05cb167a5c24953eba58c4ac89b1adf57f28f2f9d09af107ee8f0",
      "size": 32654
    },
    {
      "mediaType": "application/vnd.oci.image.layer.v1.tar+gzip",
      "digest": "sha256:3c3a4604a545cdc127456d94e421cd355bca5b528f4a9c1905b15da2eb4a4c6b",
      "size": 16724
    },
    {
      "mediaType": "application/vnd.oci.image.layer.v1.tar+gzip",
      "digest": "sha256:ec4b8955958665577945c89419d1af06b5f7636b4ac3da7f12184802ad867736",
      "size": 73109
    }
  ],
//here
"blocks": [
    {
      "mediaType": "application/vnd.oci.image.layer.v1.tar+gzip",
      "digest": "sha256:b5b2b2c507a0944348e0303114d8d93aaaa081732b86451d9bce1f432a538888",
      "size": 12345
    }
],
//here
  "subject": {
    "mediaType": "application/vnd.oci.image.manifest.v1+json",
    "digest": "sha256:5b0bcabd1ed22e9fb1310cf6c2dec7cdef19f0ad69efa1f392e94a4333501270",
    "size": 7682
  },
  "annotations": {
    "com.example.key1": "value1",
    "com.example.key2": "value2"
  }
}

store block using a samilar way to the layers (as it can also be seen as some file patches)

In this way, we give opertunity to image makers to share common libs into blocks, and share them between multiple images's multiple layers.

Jul 19 '25 09:07 XenoAmess

While an interesting idea, a change like that would require updates to not only every single registry server (to ensure blocks are treated properly as children of the manifest), but also every single runtime client to handle these new files, and the failure modes if they aren't updated before trying to use the new functionality are pretty nasty.

From the registry, the failure mode might be that garbage collection simply deletes the blocks because it doesn't recognize them as part of the manifest (making it impossible for updated runtimes to pull/use those blobs).

The failure mode in clients is that they'll simply not do anything with these new files, which means the resulting image will be "missing" the files specified by these blocks entirely.

Jul 19 '25 14:07 tianon

While an interesting idea, a change like that would require updates to not only every single registry server (to ensure blocks are treated properly as children of the manifest), but also every single runtime client to handle these new files, and the failure modes if they aren't updated before trying to use the new functionality are pretty nasty.

From the registry, the failure mode might be that garbage collection simply deletes the blocks because it doesn't recognize them as part of the manifest (making it impossible for updated runtimes to pull/use those blobs).

The failure mode in clients is that they'll simply not do anything with these new files, which means the resulting image will be "missing" the files specified by these blocks entirely.

I thought they shall implemented json schema to prevent image manifest json containing unknown block field, thus make the image noticely errored during the push/pull, then make people notice they shall upgrade their client/server for these new image additions. maybe I be wrong. if things be in that case, "schemaVersion": 2 might have to inc.

Jul 19 '25 15:07 XenoAmess

I still don't understand what the issue is you are attempting to solve. And questions asking for clarification above have gone ignored. How would the tgz "blocks" help image distribution over the existing tgz layers?

Jul 19 '25 18:07 sudo-bmitch

@tianon It really is quite unfortunate that (as a spec) we added annotations, but we didn't provide a way for extensions to create their own new reference mechanisms with some kind of references or weak-pointer-like reference system in the base spec so that legacy systems would at least not garbage-collect the wrong data when we want to add extensions.

This is basically the reason we ended up with the hacks necessary for OCI artefacts and the similar hacks that stuff like puzzlefs uses.

Jul 21 '25 01:07 cyphar

Another thing I have to mention that is while per-chunk CDC seems efficient for space saving in principle, but that is another illusion if you consider treat each variable-sized chunk as individual files on some real on-disk fs.

Because the typical CDC algorithm only cuts data into chunks which have best-effort block-aligned (e.g. from 16k to 64k), but in practice all chunks will not be aligned with the EXT4/XFS/... filesystem block (and if considering compression it will be even worse), so the unaligned chunks will waste extra space in practice (and the smaller cut point size you choose, and more space will be wasted), and since those chunks are also non-block-aligned, reflink or page cache sharing is also impossible.

EROFS supports a varient CDC but typically divide data into chunks which will compressed data into block-aligned chunks (e.g. currently from 4k to 1m compressed chunks) and already do compressed data deduplication, so it will never waste the underlayfs EXT4/XFS space because the chunk size is always predictable.

Since EROFS deduplicated compressed data are typically all block-aligned, you could always use XFS/btrfs reflink to form container images from the block-aligned blobs in the local global content-addressable-storage chunk pools and with this approach there is no need to open too many chunk files.

Jul 21 '25 02:07 hsiangkao

@hsiangkao Yeah, my first design ideas were to use CDC in the storage layer but I very quickly came to the same conclusion that it doesn't actually help for the reasons you outlined. The best you can do with real filesystems on real operating systems is to have extent-boundary-aligned chunks so you can dedup them via reflinks or use whatever filesystem features there (which is a nice partial improvement but it suffers from the common issues that motivated CDC in the first place).

In my view, CDC should probably have been done on the distribution side instead, but I think at this point we are not going to be able to convince registries to adopt format changes that require them to do more computation on image blobs. Ultimately, the main reasons why you care about deduplication are different based on whether you are running the image or pushing it to a registry (when running, you want maximum read performance, but when uploading you want the diffs to be as small as possible).

EDIT: This is the main reason why I haven't worked on any of the OCIv2 proposals in a long time -- the CDC stuff was kind of a critical piece that needed to be solved, and I don't see a practical way forward to solving it. Not that the EROFS stuff isn't a welcome improvement -- it definitely is.

Jul 21 '25 02:07 cyphar

@hsiangkao Yeah, my first design ideas were to use CDC in the storage layer but I very quickly came to the same conclusion that it doesn't actually help for the reasons you outlined. The best you can do with real filesystems on real operating systems is to have extent-boundary-aligned chunks so you can dedup them via reflinks or use whatever filesystem features there (which is a nice partial improvement but it suffers from the common issues that motivated CDC in the first place).

In my view, CDC should probably have been done on the distribution side instead, but I think at this point we are not going to be able to convince registries to adopt format changes that require them to do more computation on image blobs. Ultimately, the main reasons why you care about deduplication are different based on whether you are running the image or pushing it to a registry (when running, you want maximum read performance, but when uploading you want the diffs to be as small as possible).

EDIT: This is the main reason why I haven't worked on any of the OCIv2 proposals in a long time -- the CDC stuff was kind of a critical piece that needed to be solved, and I don't see a practical way forward to solving it. Not that the EROFS stuff isn't a welcome improvement -- it definitely is.

Hi @cyphar, thanks for the reply, I mostly agree with your point, I just would like to point out CDC: content defined chunking is not just the typical way mentioned in the previous paper (like FastCDC), it's just a way to form CAS. EROFS also implements a CDC but it tries to compress data into block-aligned chunks and using rolling-hash to CDC from those block-aligned compressed chunks. What block-aligned characteristic is the previous popular CDC algorithms don't care. I'm not sure if we will write a formal paper for this CDC approach, but just see.

Also, in short, I’m not against this general approach (though it's totally incompatible to the current OCI/docker images in any case). However, I’m more concerned about how these chunks are divided or stored locally:

If it’s per-file chunks (as in /root/data/abcde.txt from sha256:b5b2b2c507a0944348e0303114d8d93aaaa081732b86451d9bce1f432a538888), it’s doable since that’s what composefs currently does;
If each chunk corresponds to a single file and we directly use those files runtimely, I don’t see how this could work in the kernel due to numerous kernel deadlock vectors. It’s also inefficient because of wasted space (especially considering compressing, as I mentioned in the eariler reply);
If each chunk corresponds to a single file, but each compressed chunk is also block-aligned (so that underlayfs reflinks can be used for local deduplication and referencing), I think it’s doable: and EROFS already supports this mode since reflink is transparent to EROFS.
If multiple chunks correspond to a single file, it’s also doable. This is what Nydus currently does. EROFS supports up to 65,536 blobs per container image as external data source.

Jul 21 '25 03:07 hsiangkao

I still don't understand what the issue is you are attempting to solve. And questions asking for clarification above have gone ignored. How would the tgz "blocks" help image distribution over the existing tgz layers?

@sudo-bmitch

OK I would write a doc for you.

document.pdf

the images in the doc is in format plantuml, which is not support by github, if you need the pictures to help understand you can grab the pdf I generated...

Issue 1270: request to add an import mechanism to oci image format, to reduce bandwidth cost on image upgrades

1. The problem

For example, now you be an architect, holding a java application (or a python one, not really matters), depend on 100+ jar dependencies; each dependency has a file size at about 200kb—50mb.

And as you need to distribute this application as a docker image, you want to make the bandwidth transfer as efficient as possible.

For our long-term customer, you need to make the update package as small as possible, so that they can update the application with the least bandwidth cost.

(actually the real situation I'm facing here is this is not network transfer, but local usb installer package transfer ( due to security reason), but the principle is the same.).

2. Ways to handle

2.1 Easiest thought: layer split based on hot and cold isolation

The usual way, witch most people can reach, and most small companies/groups be using, is layer split.

Just split the big application image into multiple layers, like the graph below

name	content	cost(experience value)
application layer	application	200kb-10mb
dependencies layer	dependencies	10mb-1gb
sdk layer	jdk, or nodejs sdk, or python or something	50mb-200mb
opt packages layer	additional packages installed, like newer version of curl, git, vim, font types etc	0mb-100mb
system layer	system	20mb-200mb

@startuml
'https://~~plantuml.com/component~~-diagram

top to bottom direction

stack "image layers" {

    node "application layer" {
        file "app.jar"
    }

    node "dependencies layer" {
        file "dependency1.jar"
        file "dependency2.jar"
        file "dependency3.jar"
        file "dependencyxxx.jar"
        "dependency1.jar" .[hidden] "dependency2.jar"
        "dependency2.jar" .[hidden] "dependency3.jar"
        "dependency3.jar" -> "dependencyxxx.jar" : ......
    }

    node "sdk layer" {
        folder "jdk-valhalla-25"
    }

    node "opt packages layer" {
        component "curl 8.15.0"
        component "htop 3.4.1"
    }

    node "system layer" {
        component "ubuntu 25.04" 
    }

    "application layer" --> "dependencies layer"
    "dependencies layer" --> "sdk layer"
    "sdk layer" --> "opt packages layer"
    "opt packages layer" --> "system layer"

}

@enduml

Well depending on the size of the group (and situations of the application itself), the number of layers can be more or less, but the principle is the same.

The principle is, the more top layers be hotter, and the more bottom layers be colder, as the bottom layers are more stable and less frequently changed.

So in most cases (at least they have to think so, as there be no better solutions); the application bugfix, or feature update, can be done in the application layer, thus makes bandwidth transfer smaller.

@startuml
'https://~~plantuml_com/component~~-diagram

set namespaceSeparator none

top to bottom direction

stack "image layers, v1.4.5" {

    node "application layer, v1.4.5" {
        file "app_v1_4_5.jar"
    }
    node "dependencies layer, v1.4.0" {
    }
    node "sdk layer, v1.1.0" {
    }
    node "opt packages layer, v1.0.28" {
    }
    node "system layer, v1.0.5" {
    }

    "application layer, v1.4.5" --> "dependencies layer, v1.4.0"
    "dependencies layer, v1.4.0" --> "sdk layer, v1.1.0"
    "sdk layer, v1.1.0" --> "opt packages layer, v1.0.28"
    "opt packages layer, v1.0.28" --> "system layer, v1.0.5"

}

stack "update package, v1.4.6" {

    node "application layer, v1.4.6" {
        file "app_v1_4_5.jar"
    }

}

"application layer, v1.4.6" --> "dependencies layer, v1.4.0"

@enduml

The problem is things do not always go this way.

A bugfix may need to upgrade a dependency, but wait, dependency is at the dependencies' layer.

You must choose, from either add the updated dependency into the new application layer, which breaks the promise;

or upgrade the dependency in the dependencies' layer, which means the updation's cost is the whole dependencies layer, means you wanna upgrade a 200 kb lib, but actually cost you 500mb+ bandwidth

image for choice 1(hotfix, fast for short time but troublesome for long time):

@startuml
'https://~~plantuml_com/component~~-diagram

set namespaceSeparator none

top to bottom direction

stack "image layers, v1.4.6" {

    node "application layer, v1.4.6" {
        file "app_v1_4_6.jar"
    }
    node "dependencies layer, v1.4.0" {
        file "dependency1_v1_2_1.jar"
        file "dependency2_v_1_25_8.jar"
        file "dependency3_v_3_8_0.jar"
        file "dependencyxxx.jar"
        "dependency1_v1_2_1.jar" .[hidden] "dependency2_v_1_25_8.jar"
        "dependency2_v_1_25_8.jar" .[hidden] "dependency3_v_3_8_0.jar"
        "dependency3_v_3_8_0.jar" -> "dependencyxxx.jar" : ......
    }
    node "sdk layer, v1.1.0" {
    }
    node "opt packages layer, v1.0.28" {
    }
    node "system layer, v1.0.5" {
    }

    "application layer, v1.4.6" --> "dependencies layer, v1.4.0"
    "dependencies layer, v1.4.0" --> "sdk layer, v1.1.0"
    "sdk layer, v1.1.0" --> "opt packages layer, v1.0.28"
    "opt packages layer, v1.0.28" --> "system layer, v1.0.5"

}

stack "update package, v1.4.7-hotfix" {

    node "application layer, v1.4.7-hotfix" {
        file "app_v1_4_7.jar"
        file "dependency1_v1_2_2.jar"
        file ".wh.dependency1_v1_2_1.jar"
    }

}

"application layer, v1.4.7" --> "dependencies layer, v1.4.0"

@enduml

image for choice 2(normal, with full dependency layer update):

@startuml
'https://~~plantuml_com/component~~-diagram

set namespaceSeparator none

top to bottom direction

stack "image layers, v1.4.6" {

    node "application layer, v1.4.6" {
        file "app_v1_4_6.jar"
    }
    node "dependencies layer, v1.4.0" {
        file "dependency1_v1_2_1.jar"
        file "dependency2_v_1_25_8.jar"
        file "dependency3_v_3_8_0.jar"
        file "dependencyxxx.jar"
        "dependency1_v1_2_1.jar" .[hidden] "dependency2_v_1_25_8.jar"
        "dependency2_v_1_25_8.jar" .[hidden] "dependency3_v_3_8_0.jar"
        "dependency3_v_3_8_0.jar" -> "dependencyxxx.jar" : ......
    }
    node "sdk layer, v1.1.0" {
    }
    node "opt packages layer, v1.0.28" {
    }
    node "system layer, v1.0.5" {
    }

    "application layer, v1.4.6" --> "dependencies layer, v1.4.0"
    "dependencies layer, v1.4.0" --> "sdk layer, v1.1.0"
    "sdk layer, v1.1.0" --> "opt packages layer, v1.0.28"
    "opt packages layer, v1.0.28" --> "system layer, v1.0.5"

}

stack "update package, v1.4.7" {

    node "application layer, v1.4.7" {
        file "app_v1_4_7.jar"
    }
    
    node "dependencies layer, v1.4.7" {
        file "dependency1_v1_2_2.jar"
        file "dependency2_v_1_25_8.jar"
        file "dependency3_v_3_8_0.jar"
        file "dependencyxxx.jar"
        "dependency1_v1_2_2.jar" .[hidden] "dependency2_v_1_25_8.jar"
        "dependency2_v_1_25_8.jar" .[hidden] "dependency3_v_3_8_0.jar"
        "dependency3_v_3_8_0.jar" -> "dependencyxxx.jar" : ......
    }

    "application layer, v1.4.7" --> "dependencies layer, v1.4.7"
}

"dependencies layer, v1.4.7" --> "sdk layer, v1.1.0"

@enduml

For more severe situations, like, if your application, for example, facing a severe situation that need to upgrade the sdk, or the opt package, or even the system, the cost would be unnecessarily high.

For a real world example, a third-party one-drive client application needs you to upgrade your curl...

So using this strategy, people always suffer from choices...

Be this package a hotfix? Be it hot enough to break the promise of layer split? Should we upgrade the dependency layer or not?

2.2 More complex and more sorrow.

Now your business grows larger. You have several groups which each be taking care of several applications. Of course, you be in microservice system, and each group holding several repos, which repo containing one or several applications.

Notice that your several groups have their own dependencies/dependency management system, and they are not meant to be using the same version of the same lib.

(due to compatibility reason usually, as semver not always be followed correctly...). So a natural way for you seems to be like this:

@startuml
'https://~~plantuml.com/component~~-diagram

top to bottom direction

stack "common image layers" {

    node "dependencies layer" {
        file "dependency1.jar"
        file "dependency2.jar"
        file "dependency3.jar"
        file "dependencyxxx.jar"
        "dependency1.jar" .[hidden] "dependency2.jar"
        "dependency2.jar" .[hidden] "dependency3.jar"
        "dependency3.jar" -> "dependencyxxx.jar" : ......
    }

    node "sdk layer" {
        folder "jdk-valhalla-25"
    }

    node "opt packages layer" {
        component "curl 8.15.0"
        component "htop 3.4.1"
    }

    node "system layer" {
        component "ubuntu 25.04" 
    }

    "dependencies layer" --> "sdk layer"
    "sdk layer" --> "opt packages layer"
    "opt packages layer" --> "system layer"

}

stack "application A layers" {
    node "application A layer" {
        file "app_A.jar"
        file "app_A_dependencies.txt"
    }
}

"application A layer" --> "dependencies layer"

stack "application B layers" {
    node "application B layer" {
        file "app_B.jar"
        file "app_B_dependencies.txt"
    }
}

"application B layer" --> "dependencies layer"

@enduml

But as you can see, it actually enlarges the problems described in section 2.1 ...

Now you have to take care not only the original problems (for several groups), but also have to think about: for when to merge the dependency layer, when to allow business groups to add dependencies in their own application layer? How to handle when some bussiness must upgrade some dependency, which the other bussiness group have no need&time to upgrade?

So as your business grows larger, the problems become more complex, and the cost of updating becomes higher.

And yes, you got stuck, and start to notice this solution takes an architect/expert to handle and balance all the business groups, not an one-time job, but a long-term work ...

As we all know, this kind of guy is usually costly, and shall be put into more important position, not be wasting time on things like this...

2.3 What would some people do in this situation

2.3.1 extract the files from image, transform, then re-package the image.

The problem is mainly for the cost of the image repacking, as tar is a very slow file format, and so does gzip.

So your customer might suffer from the long time of installing for image repacking, which is not acceptable for some of them.

2.3.2 deploy an additional file server, register dependency files, and request dependencies when pod start-up.

The problems be,

This stage might take more time in bad network conditions / bad disk
Pod startup → request avalanche.The file server might be a single point of failure.
Bad for disk(ssd especially)
Complexity

3. A far better way on this

Why doesn't oci support this kind of things natively?

3.1 The feature needed on oci part

as we do accept .wh. files, I think it be not bad to add another grammar as .import. files, for example .import.abcde.txt
in the .import.abcde.txt file we describe which file/folder is imported from which block for example,

/root/data/abcde.txt from sha256:b5b2b2c507a0944348e0303114d8d93aaaa081732b86451d9bce1f432a538888

describe the blocks into image manifest

{
  "schemaVersion": 2,
  "mediaType": "application/vnd.oci.image.manifest.v1+json",
  "config": {
    "mediaType": "application/vnd.oci.image.config.v1+json",
    "digest": "sha256:b5b2b2c507a0944348e0303114d8d93aaaa081732b86451d9bce1f432a537bc7",
    "size": 7023
  },
  "layers": [
    {
      "mediaType": "application/vnd.oci.image.layer.v1.tar+gzip",
      "digest": "sha256:9834876dcfb05cb167a5c24953eba58c4ac89b1adf57f28f2f9d09af107ee8f0",
      "size": 32654
    },
    {
      "mediaType": "application/vnd.oci.image.layer.v1.tar+gzip",
      "digest": "sha256:3c3a4604a545cdc127456d94e421cd355bca5b528f4a9c1905b15da2eb4a4c6b",
      "size": 16724
    },
    {
      "mediaType": "application/vnd.oci.image.layer.v1.tar+gzip",
      "digest": "sha256:ec4b8955958665577945c89419d1af06b5f7636b4ac3da7f12184802ad867736",
      "size": 73109
    }
  ],
//here
"blocks": [
    {
      "mediaType": "application/vnd.oci.image.layer.v1.tar+gzip",
      "digest": "sha256:b5b2b2c507a0944348e0303114d8d93aaaa081732b86451d9bce1f432a538888",
      "size": 12345
    }
],
//here
  "subject": {
    "mediaType": "application/vnd.oci.image.manifest.v1+json",
    "digest": "sha256:5b0bcabd1ed22e9fb1310cf6c2dec7cdef19f0ad69efa1f392e94a4333501270",
    "size": 7682
  },
  "annotations": {
    "com.example.key1": "value1",
    "com.example.key2": "value2"
  }
}

store block using a samilar way to the layers (as it can also be seen as some file patches)

In this way, we give opportunity to image makers to share common libs into blocks, and share them between multiple images's multiple layers.

3.2 The feature needed on the image maker side

Simply make some tool chains, to embedding the dependencies into blocks, and distribute it.

And then make some plugins to import the blocks when building the image from microservice artifact.

@startuml
'https://~~plantuml.com/component~~-diagram

top to bottom direction

stack "common image layers" {

    node "dependencies layer" {
        file ".import.dependency1.jar"
        file ".import.dependency2.jar"
        file ".import.dependency3.jar"
        file ".import.dependencyxxx.jar"
        ".import.dependency1.jar" .[hidden] ".import.dependency2.jar"
        ".import.dependency2.jar" .[hidden] ".import.dependency3.jar"
        ".import.dependency3.jar" -> ".import.dependencyxxx.jar" : ......
    }

    node "sdk layer" {
        folder ".import.jdk-valhalla-25"
    }

    node "opt packages layer" {
        component ".import.curl"
        component ".import.htop"
    }

    node "system layer" {
        component "ubuntu 25.04" 
    }

    "dependencies layer" --> "sdk layer"
    "sdk layer" --> "opt packages layer"
    "opt packages layer" --> "system layer"

}


stack "common image blocks" {
    
    node "block curl" {
        
    }
    
    node "block htop" {
        
    }
    
    node "dependency1.jar" {
        
    }

    node "dependency2.jar" {
        
    }
    
    node "dependency3.jar" {
        
    }
    
    node "dependencyxxx.jar" {
        
    }

}

".import.curl" --> "block curl": import
".import.htop" --> "block htop": import
".import.dependency1.jar" --> "dependency1.jar": import
".import.dependency2.jar" --> "dependency2.jar": import
".import.dependency3.jar" --> "dependency3.jar": import
".import.dependencyxxx.jar" --> "dependencyxxx.jar": import

stack "application A layers" {
    node "application A layer" {
        file "app_A.jar"
        file "app_A_dependencies.txt"
    }
}

"application A layer" --> "dependencies layer"

stack "application B layers" {
    node "application B layer" {
        file "app_B.jar"
        file "app_B_dependencies.txt"
    }
}

"application B layer" --> "dependencies layer"

@enduml

4. Why the original design...?

Oci inherited the format from docker, and were following docker's original design...

Well, docker's original design doesn't have a mind of such things.

Come to think of it, go people usually do not have enough ability/habit to use dynamic link/another dynamic dependency mechanism, they tend to use a single binary file as compiler output (the GOreat is simple, they said).

That might be why the original design ignores things like this.

Well I can't stop to think if cpp/c/objc/rust/java/python/c#/ts/js/ people were in charge of the design, they might have thought about this. But go people, well.

Jul 21 '25 15:07 XenoAmess

@XenoAmess as best I can tell, none of what you describe requires a change to OCI. Instead of blocks with an import, you could create small tar layers for each library you want to package separately, and multiple images can include an overlapping set of layers in their manifests.

This does have a downside. The more granular the layers, particularly if you have block level CDC, you create a lot of round trip overhead to transfer the content, plus the added management overhead. Some will take this to the other extreme, repackaging their image as a single layer. I believe there is a happy medium between these two extremes, but the entire spectrum is possible within the existing OCI spec.

Note that there's an added disadvantage of putting a block level import inside of a layer, the inter layer dependencies are an added failure point, since the layers are no longer independent of each other. It's no longer possible to blindly assemble a collection of library layers into an image like Nix may want to do. Instead, each layer needs to be inspected for cross layer dependencies, and those need to be recursively included.

Jul 22 '25 12:07 sudo-bmitch

the layers are no longer independent of each other.

@sudo-bmitch

I don't really think so, as according to the doc I provided I do think the layers remains to be independent.

please give more details.

please notice that only layers import from blocks, but never layers import from layers. that is why I think block should be in another field, as it be not 'layer'. it be, just, file/data block.

Jul 22 '25 12:07 XenoAmess

you could create small tar layers for each library you want to package separately

@sudo-bmitch you will have many many layers then.

well, I don't think it fit the definition of 'layer', in other words, this solution use layers definition weirdly. but yes, you re right, it can 'work'.

though far from good/perfect.

another problem is you have to import the same lib in exactly same folder, to make the layer tar.gz same, for the reuse to work. In small group/company, yes that is achievable. But I can hardly believe people would come to an agreement across company for what file shall be put into which path...

Jul 22 '25 12:07 XenoAmess

the layers are no longer independent of each other.

@sudo-bmitch

I don't really think so, as according to the doc I provided I do think the layers remains to be independent.

please give more details.

please notice that only layers import from blocks, but never layers import from layers. that is why I think block should be in another field, as it be not 'layer'. it be, just, file/data block.

A block is another kind of layer to me, so this is likely terminology confusion. What happens if someone defines an import inside of a layer, but does not include that digest in the list of blocks in the manifest? Today, the contents of a layer are independent of the content of the manifest.

Jul 22 '25 13:07 sudo-bmitch

the layers are no longer independent of each other.

@sudo-bmitch I don't really think so, as according to the doc I provided I do think the layers remains to be independent. please give more details. please notice that only layers import from blocks, but never layers import from layers. that is why I think block should be in another field, as it be not 'layer'. it be, just, file/data block.

A block is another kind of layer to me, so this is likely terminology confusion. What happens if someone defines an import inside of a layer, but does not include that digest in the list of blocks in the manifest? Today, the contents of a layer are independent of the content of the manifest.

that be a good point. I suggest raise an runtime error when initialize from the image to the pod.(if finding something does not include that digest in the list of blocks in the manifest) Can also accept a checking when push/pull, though as the blocks might be tar.gz(to suite current format of layer, to make people not feeling too weird), it be time costing to search all files in it and do the checking...

Jul 22 '25 13:07 XenoAmess

The primary issues I have with this kind of manually-defined "here are the bits I want to share" approach are:

Building images this way will require very tight coupling between distributions (who would be the best folks to produce these "blessed components") and build tools that simply doesn't exist today. This means that most image tools will not benefit from these.

Speaking somewhat selfishly, we could actually do this for (open)SUSE image builds because our build process uses the Open Build Service that then uses umoci directly. The Open Build Service has information about what packages go into an image, so we could easily construct per-package archives and then use those instead of just tarring up the image and using that as the artefact.

But most other build systems aren't so lucky, and we shouldn't require such a tight coupling between build tools. docker build, buildah, docker buildx, (pick your favourite build tool) would not be able to practically support this -- making such a feature practically useless because the primary benefit of deduplication is if it is shared between most images. This is the main reason I wanted to embed CDC into images (as it would give you this deduplication for free, regardless of whatever build tool you use). I think this kind of coupling is only really justified when generating BoMs or other such artefacts.
It doesn't really justify spec-level changes -- you can already construct the same kind of manual shared subsets by creating a layer for each package. Nobody does this because build systems don't work that way (see point 1) but there really isn't enough of a new idea here that cannot be implemented with existing methods in order to justify a change to the spec (which would then need to get buy-in from registries -- and would ultimately probably end up requiring a backward-compatible mechanism to work with older registries, which is what artefacts did and still kind of does.)

@XenoAmess You've alluded several times that the design is like this because of "Go people" (with what I would read as an unfairly derisive tone) because "Go people" don't care about shared libraries or deduplication. This isn't really true. The current design exists because Docker just needed some kind of basic image format back in 2013 and so they picked the simplest solution (just tar up the AUFS on-disk format -- which is why we have the horrible .wh. wart). Many years ago, I wrote a blog post about why this is an awful design, but it was picked because it was simple. At that point in time, Docker was being open sourced by a SaaS company that was going to go bankrupt (dotCloud), they really didn't have time to design a better format. As soon as this image format became ossified (around 2015-16 with Docker 1.10) and other registries started popping up, it became very hard to change fundamental thing about the format.

For OCIv1, we needed to make sure that the format was completely compatible with Docker's layered archive format so that the existing registries could transition to OCIv1 support without needing to modify their incredibly large image registry stores (only high-level metadata could be changed). Changes to that format were basically out of scope, so we could ensure that everyone would switch.

The format is not like this because nobody could think of anything better (to the contrary, we did think about this a lot) but because of a series of historical factors that meant we couldn't improve things at each step. With all due respect, this is not nearly as trivial a problem as you seem to think it is, and this is far from the first time we've heard of proposals about it.

Jul 24 '25 00:07 cyphar

they really didn't have time to design a better format.

Though they really should. That is 201x, not 197x, people who design things shall have wider eyesite/knowledge.

The format is not like this because nobody could think of anything better (to the contrary, we did think about this a lot) but because of a series of historical factors that meant we couldn't improve things at each step. With all due respect, this is not nearly as trivial a problem as you seem to think it is, and this is far from the first time we've heard of proposals about it.

Glad to hear about it, and if there be any other former tryout, please have a pin, as I don't really mind if people using my design; I just want to solve the problem.

From my former knowledge of this, we need at least:

a builder to support such grammar & buildout, either buildkit or buildah be opensourced and acceptable IMO
an image registry to store the image, either the nexus(though I don't know if the docker part be open sourced) or harbor (it be open sourced)
a runtime who accept the built image, containerd or CRI-O.

Yes all of them seems have to be third partied, as I didn't find anything like that in those repos...(not sure if people there be willing to merge? likely no.)

If there be people who have already a set of implementation, I would gladly tryout...

Jul 24 '25 07:07 XenoAmess

I think this kind of coupling is only really justified when generating BoMs or other such artefacts.

I must make you sure I think the format not only benifit this senarial, but yes, I do want to provide a way for the producer to produce bom image, you get it correctly. After all, third party built is always not as good as original (while, there be exception, like bitnami) Once there be rules set, there would be vendor who willing to provide the bom image by themself.(just like they provide cyclon sbom or bom-pom)

Jul 24 '25 07:07 XenoAmess

But most other build systems aren't so lucky, and we shouldn't require such a tight coupling between build tools. docker build, buildah, docker buildx, (pick your favourite build tool) would not be able to practically support this -- making such a feature practically useless because the primary benefit of deduplication is if it is shared between most images.

I think I get what you mean. Only if buildkit/buildah impleted something like this, can we know build toolchains would practically support this, though maybe only a new grammar be accepted by oci, would poeple in buildkit/buildah think it usful/add it. Though I worried about if the building toolchains go doing first, of course they will have their own design on it, and then it get out and de facto standard, then the oci have to suite it, just like what happened in oci-1...

Jul 24 '25 07:07 XenoAmess

Though I worried about if the building toolchains go doing first, of course they will have their own design on it, and then it get out and de facto standard, then the oci have to suite it, just like what happened in oci-1.

OCI is a trailing spec, so this is the normal way for changes to get adopted. For multiple implementations to collaborate, a working group could be created as a sandbox to work out any issues with the design and interoperability.

Jul 24 '25 14:07 sudo-bmitch

created a project to track it better. https://github.com/users/XenoAmess/projects/1

hope I can finish this in 2 years.

currently learning buildkit.

Jul 29 '25 07:07 XenoAmess

why each image must only have one layer lines, but not a multiple-root layer tree?

Issue 1270: request to add an import mechanism to oci image format, to reduce bandwidth cost on image upgrades

1. The problem

2. Ways to handle

2.1 Easiest thought: layer split based on hot and cold isolation

2.2 More complex and more sorrow.

2.3 What would some people do in this situation

2.3.1 extract the files from image, transform, then re-package the image.

2.3.2 deploy an additional file server, register dependency files, and request dependencies when pod start-up.

3. A far better way on this

3.1 The feature needed on oci part

3.2 The feature needed on the image maker side

4. Why the original design...?