distribution-spec icon indicating copy to clipboard operation
distribution-spec copied to clipboard

Discussion: "Diff" Pulls?

Open dweekly opened this issue 3 years ago • 3 comments

Should we consider "image diffs" as part of the spec?

For most clients pulling an image, they have a very recent, slightly older version of the image. Some clients, such as those in "edge" deployments, may have very constrained WAN bandwidth where it's important to maximize transmission efficiency. In most CI/CD setups, the actual binary differences between one image version and the next may be very small relative to the overall image size.

Consequently it would seem advantageous for:

  1. A client to be able to advertise that it already has a given, older version when performing a pull request.
  2. The server, if in possession of the older AND newer version, to be able to vend a diff to the client.

After the client receives and applies the diff, the client could use the hash from the manifest to verify that the "patched" image is indeed identical to the new image. If a discrepancy is noted (e.g. because the diff was corrupted or something went wrong with the patching process), the client could log the error and fall back to the current method of just doing a full pull of the new image.

Note that a given diff from Version $X to Version $Y is cacheable and immutable.

Tools like bsdiff & bspatch or similar could be used to actually perform the image diffing and patching.

Questions:

  1. Does the problem formulation make sense?
  2. While thoughtful/custom packaging of stacked image layers could allow a clever builder & client to effectively perform this function without needing to modify the repository protocol or implementation, this would severely constrain the set of containers for which this efficient distribution could be deployed. So the hope would be to have a solution that most clients could use with most containers and most repositories to be able to efficiently update their images.

dweekly avatar Oct 28 '22 19:10 dweekly

Thread with some comments from @sudo-bmitch here: https://cloud-native.slack.com/archives/C01GVR8SY4R/p1666743836052589

I agree there is scope here for lowering bandwidth of image distribution, but feel like there may be more elegant solutions available.

Is this something that has been implemented already and you are interested in making more widely available? Perhaps showing some real world numbers here could help.

Jamstah avatar Nov 03 '22 16:11 Jamstah

Definitely agree that an implementation would be useful.

Summing up my comments from slack: I'd avoid doing this in registries that tend to avoid complex processing. If this is going to be done, I'd push to do it client-side using layers stored using estargz so the clients can make range requests for parts of the blob it needs.

In practice, I think there are several challenges:

  • How do clients know which blobs to compare? Builds can change commands being run, reorder, delete steps, insert intermediate steps, or combine multiple steps together.
  • Do clients retain the previous layer blobs after unpacking? Without that, there may not be something from which to start the diff.
  • Can clients reproduce a blob, especially if there are settings in compression algorithms that change between implementations?

sudo-bmitch avatar Nov 03 '22 16:11 sudo-bmitch

A couple of more options:

  1. mirror your images locally more
  2. since dist-spec returns Location headers, maybe there is opportunity to make it more CDN-friendly.

rchincha avatar Nov 03 '22 17:11 rchincha