buildkit Global S3 Cache

This allows to use an s3 bucket as a "global cache". All builds using the same settings will share cache. For example, this is really handy in a k8s setup where state is a bit hard to maintain.

refactored s3 FinalizeWithChain function into a finalizeBlobs and a finalizeBlobs, to reuse them later on.
added an s3 ListAllOjectsV2Prefix function that lists everything matching a prefix
added an s3 exporter that stores links as empty files and mini manifests containing layer chains.
added an s3 importer that uses s3 list+get functionalities to query cache.
added tests for exporter

related #4295 https://github.com/moby/buildkit/issues/3971

Nov 21 '23 14:11 azr

Hello, thanks for checking this out.

Design

For the design, the TLDR is, the exporter takes the cache-chain from a build, loads an in memory manifest from it and spurts a set of files that will be queried later on.

As what maters to buildkit for caching is the topology of the tree, for each node, it generates a unique ID from a node's parents + its digest:

https://github.com/moby/buildkit/blob/4d0a55ec0068d223f9e97e332418f722d17e010d/cache/remotecache/s3/global_exporter.go#L147-L177

This makes sure that when a build re-runs, the same things will be generated.

Now that we have a stable ID for that node, 3 types of files will be created related to it:

1/ empty link files, : https://github.com/moby/buildkit/blob/4d0a55ec0068d223f9e97e332418f722d17e010d/cache/remotecache/s3/global_exporter.go#L185-L190 that will be used in the globalKeyImporter.WalkLinks to give us the child nodes from a link, for a node https://github.com/moby/buildkit/blob/4d0a55ec0068d223f9e97e332418f722d17e010d/cache/remotecache/s3/global_importer.go#L151-L169

2/ mini manifests (feel free to make me change that name, they are not so mini, they just only contain layer digests)

https://github.com/moby/buildkit/blob/4d0a55ec0068d223f9e97e332418f722d17e010d/cache/remotecache/s3/global_exporter.go#L216-L218

These mini-manifests will basically list all the layers for that record (which has a unique ID recursively generated from parents ids). These will be used in globalKeyImporter.WalkResults https://github.com/moby/buildkit/blob/4d0a55ec0068d223f9e97e332418f722d17e010d/cache/remotecache/s3/global_importer.go#L171

3/ backlinks, also empty files very similar to links, but backwards, used at the end of the build to ask for parents of some nodes we didn't know about. ( these could be optional ).

How using JSON with big manifests affects performance.

Okay, taking a step back here. At HuggingFace, we have spaces, (the closest thing you can think of for spaces, would be a GitHub repo but for ML, I'll be calling them repo for simplicity). These repos are cloned a lot. In these repos, every time a user changes something: a new buildkit build is triggered. In one case, the Dockerfile is generated by us, but has a lot of different possible versions, and can change often; in another one, the user creates & maintains them.

Currently, we're storing cache manifests in per-repo files, on S3, and link to the cache from any parent repo to a 'child' repo. But this is not enough. There are cases where it's hard or impossible for us to link two repos together, but the steps taken to build will be super similar; and we would like the first build to do them to cache these for everyone. Moreover, some of these steps are expensive and time consuming. (ML)

Sharing a cache across all repos would allow us to reach a far better cache hit ratio in docker builds and reducing overall build times, it would also allow us to simplify a lot of things. We also are aware of the persistent node use case with consistent hashing, but these types of stateful architectures are brittle and hard to maintain on k8s. It is also hard to scale, at least harder than S3.

I'm fairly confident this will have use cases for dev teams, as they won't need to juggle with branch import anymore to cache steps, simplifying CI steps and making builds generally faster.

Re: The weird big manifest 🤦🏼

Intentional, this manifest, I can totally remove/rename, as you wish, I used it for tests, and I currently load it to test against. In the exporter tests. The exporter needs a cache-chain, hence why I committed a manifest that I load to get some cache-chains in order to get a 'proper' cache setting. What would be a better way to test this? Would for example manually creating a cache chain from the code be a better idea ? I took this manifest from an actual complicated build, because it was complicated and had duplicate records/layers and I wanted to make sure there wasn't any duplication, etc. )

Moreover, could you help me try to figure out a good way to test the importer ? Maybe I should mock s3 etc.

Is it ever-growing ?

Well, technically yes. But not so badly. Let me explain, if two builds are doing the same things, the exact same set of files will be created, because the ID is generated from the digests and their position. But, if these steps were to stop being used for a while, nothing would delete them. So a vacuum cleaner cronjob could be nice to implement or (for smaller caches:) deleting the s3 and creating a new one, or giving these file an expiration date, etc.

Currently this is not solved, not entirely sure how bug of an issue this will be. But if we need to I'd like to try to open source it too.

why this is S3 specific?

This can be ported to Azure and GH, nothing specific. We just don't use those.

Okay, again, thanks for your time. I know this is a lot to read & feel free to reach out to me on the Docker slack if you'd like to setup a call to talk, I'm in there and dm'ed you once already (sorry if it wasn't the procedure)

Nov 22 '23 17:11 azr

If your goal is to avoid multiple separate cache manifests then would it make sense to start from a tool that can just merge 2 manifests together into one. This doesn't need any knowledge of build semantics btw.

I see this changes the semantics from remote cache where metadata is pulled before build and layers on demand to model where metadata is also requested on demand.

Some issues with this:

The Query() function from the scheduler event loop is called synchronously atm. This means the it can't be long-running and the whole scheduler is stopped when it runs. This can be updated (and should) but requires some (not so simple) changes in the scheduler.
Without extra optimizations, I'm not sure if this model is very performant. When you have 100 build steps, the cache lookup will need to be checked somewhere around 2x for each step at least I think. Worst case should be a fully matching build that ideally should be instantaneous. If every such query makes a remote request then that latency will become huge. I think some way the remote request needs carry more information for future changes as well to get the number of requests down, and that probably means much more complicated DB backend.
PTAL https://github.com/moby/buildkit/pull/4447 as well where there seems to be some issue causing lots of requests to this lookup function.

As what maters to buildkit for caching is the topology of the tree, for each node, it generates a unique ID from a node's parents + its digest:

How is this different from regular remote cache manifest format? All the chains are also unique there.

builds are doing the same things, the exact same set of files will be created, because the ID is generated from the digests and their position.

Note that same cache chain does not mean identical content (layer bytes) but content that is considered equal for cache imports. Eg. 3 containers with same command and --no-cache all create same cache checksum but likely different bytes.

Intentional, this manifest, I can totally remove/rename, as you wish, I used it for tests, and I currently load it to test against. In the exporter tests.

This wasn't embedded in a test file, so I didn't notice that. If it would be only embedded in the test then we can look at it again in code review time and don't need to remove for now.

Nov 29 '23 00:11 tonistiigi

If your goal is to avoid multiple separate cache manifests then would it make sense to start from a tool that can just merge 2 manifests together into one. This doesn't need any knowledge of build semantics btw.

This is not exactly our goal, this would work, but not everywhere, we have cases were we cannot possibly link two repos and some cases where it would be possible but a bit complicated. In those cases we'd like the cache to work together.

The Query() function from the scheduler event loop is called synchronously atm. This means the it can't be long-running and the whole scheduler is stopped when it runs. This can be updated (and should) but requires some (not so simple) changes in the scheduler.

I see your concern here, we are running buildkit in k8s not in a StatefulSet, but with a separate buildkit instance per build, and in our case this is fine(-ish, could be better). I totally see buildkit sort of hang for ~3seconds downloading/listing ~170 files for our average Dockerfile size (big_manifest.json with ~20 steps). This sucks, but this is very small compared to what would happen without the cache hits. I would like to improve on this. (Edit: followed that hunch https://github.com/moby/buildkit/pull/4429#discussion_r1405832849 (GET vs LIST) and the time got down a looot, this is feeling and I need to add means to measure )

Can this freeze break a build ?

What sorts of changes would you add here ? Would you like me to take a stab at this ?

Without extra optimizations, I'm not sure if this model is very performant. When you have 100 build steps, the cache lookup will need to be checked somewhere around 2x for each step at least I think. Worst case should be a fully matching build that ideally should be instantaneous. If every such query makes a remote request then that latency will become huge. I think some way the remote request needs carry more information for future changes as well to get the number of requests down, and that probably means much more complicated DB backend.

Some of my previous answers respond to this part, and the cache gains here still feel positive, at least in the cases I saw. 100 steps Dockerfiles are rather rare (in my own limited experience) and in such cases I would maybe recommend splitting them a little more to leverage cache.
This triggers a question, if I have more than 1 FROM, would the listing still be synchronous ? My assumption here was that things would run in parallel.
I'm curious as to what sorts of change would you add here to carry more info per request. Can you elaborate a bit on that ? Always happy to change things etc. I'm trying to treat S3 like a local folder, so like the local cache would work, but well S3 being a bit farther away. So it's not 100% perfectly snappy like a local thing but still positive in the bigger picture.

How is this different from regular remote cache manifest format? All the chains are also unique there.

I tried to be very similar, but the cache manager needs an ID per cache key, so instead of generating a random new string like the current manifests path does: I tried to have a way to have these ids be static by using their parent ids, this allows to differentiate them, but if two exactly similar build run at the same time, the same ID will be generated. This way we won't have 'clutter'.

Note that same cache chain does not mean identical content (layer bytes) but content that is considered equal for cache imports. Eg. 3 containers with same command and --no-cache all create same cache checksum but likely different bytes.

Yup, the way I see it it's like a web request, if the url (Docker command) is the same, it's fine to return the same page (layer chain). Then it's a user problem to deal with maybe adding dates to commands and stuff like to make sure that for example the apt-get layers don't get too old.

I moved the manifest to a better suited location; with a better name too.

Thanks for your time !!! 😊

Edit: lots of tweaks before start of your day on your SF TZ

Dec 01 '23 07:12 azr

This is really cool to see. Shared cache would also be very valuable to us, for similar reasons. I'm curious to see what the performance looks like, as I share the concerns about latency to look up build steps in a remote cache. I wonder if there's a good way to use heuristics to avoid checking remote cache for a steps that can be reproduced fast and aren't worth the overhead of accessing a remote cache.

Dec 01 '23 17:12 aaronlehmann

we have cases were we cannot possibly link two repos and some cases where it would be possible but a bit complicated.

I'm not sure I fully understand this. The merge could happen today between any two manifests generated by buildkit. They don't need shared parts or know anything about what happened during build time.

Can this freeze break a build ?

No, but it blocks the whole daemon progressing forward with any build request/step.

What sorts of changes would you add here ? Would you like me to take a stab at this ?

https://github.com/moby/buildkit/blob/7b462d437f09e0c22541875a8cf177e77ac840d5/solver/edge.go#L229 call in here can't happen directly but needs to set up new async request with f.NewFuncRequest like it is done for cachemap/exec functions.

100 steps Dockerfiles are rather rare (in my own limited experience) and in such cases I would maybe recommend splitting them a little more to leverage cache.

I'm not sure about splitting comment. The problem atm is that the highest latency would be in a build that is split into more steps and also build that matches the most cache. In buildkit design we want to promote build graphs that are more precisely defined with the rough guideline being that if you split a task into more steps, you get similar exec speed (unless distributed) but better cache.

This triggers a question, if I have more than 1 FROM, would the listing still be synchronous ? My assumption here was that things would run in parallel.

All the Query() requests are synchronous atm. It works because atm no implementation of Query() makes remote requests, even the remote cache backends.

I'm curious as to what sorts of change would you add here to carry more info per request. Can you elaborate a bit on that ?

When you are doing a request like "are there matches for this checksum?", in addition to just saying true/false a smart database could respond bool, list of results that match and (limited) list of future checksum chains that are children of this checksum. Then no new requests need to be made for pulling the result or checking the child checksum. Obviously the server can't just respond with all children as something common like RUN apt-get update will have 10s of thousands of children in cloud scale. In that case we would need an extra request, but this would be an exception.

Dec 01 '23 20:12 tonistiigi

we have cases were we cannot possibly link two repos and some cases where it would be possible but a bit complicated.

I'm not sure I fully understand this. The merge could happen today between any two manifests generated by buildkit. They don't need shared parts or know anything about what happened during build time.

That wouldn't be as simple in our case because we have a lot of repos, like, 6 digits (and growing fast), and merging everything would probably not be as optimal. These could be grouped into repos that do very similar things, but it is hard to know beforehand who does something similar to who. Making this option to me harder than this global cache.

For example, some of these repos download gigabytes of data at build time to run stuff on. Which will turn into duplicate (big) cache blobs and (later on) long CPU usages.

100 steps Dockerfiles are rather rare (in my own limited experience) and in such cases I would maybe recommend splitting them a little more to leverage cache.

I'm not sure about splitting comment.

Totally agree with you, when I said splitting I was more referring to try and have more FROM steps, that don't need any cache and, generally it is probably rare to change all of the lines of a Dockerfile often and with the ones towards the bottom usually changing more often, etc.

The problem atm is that the highest latency would be in a build that is split into more steps and also build that matches the most cache. In buildkit design we want to promote build graphs that are more precisely defined with the rough guideline being that if you split a task into more steps, you get similar exec speed (unless distributed) but better cache.

Totally true, I understand why you'd like to avoid having the highest latencies in builds that are split into multiple steps; this would make the Dockerfiles all crammed with big lines and many more issues. I think we can work on performances, and I'll work on that aspect a little bit and come back with more concrete numbers.

When you are doing a request like "are there matches for this checksum?", in addition to just saying true/false a smart database could respond bool, list of results that match and (limited) list of future checksum chains that are children of this checksum.

Super cool idea, this is currently very doable on S3, moreover, because the IDs of the nodes are unique to their 'past', this would sort of balance the number of possible children they have. Here's an idea I have: add the following options to the global cache:

predict_max_count: int ( max number of items to try to fetch )
predict_max_depth: int ( if we have found any and less than < predict_max_count potential children, try to go down the tree to get some prediction until quota reach, breadth first )

In each WalkLinks call, we could try to list the possible children of a node trying to respect these limits.

Lastly, I did some light tweaks:

parallelised the s3 blob upload, took the cache export time of my 20 step build from ~20s to ~3s, by parallelising the blob upload with 5 goroutines (kind of handpicked number), I can make this a parameter. This would also improve things for the already existing s3 exporter.
added a first context.Context parameter to each cache calls, this will allow to sort of know the time spent in there with traces, because currently it is a black box.

TBC 😊

Edit:

currently that 20step build takes ~7s fully cached vs ~1s with a manifest
I also looked at the NewFuncRequest and, it looks like a great improvement, yeah, non trivial though and I think this should be done in another PR.
some of this code is already on prod on a subset of users, and doing great !

Dec 04 '23 15:12 azr

Hello there, after a bit of a trial period we realised that this creates a lot of duplicate caching entries. Some sort of GC will need to be implemented, and currently, in the cache manager and at write we do not know wether an entry was gotten from the cache or not, and therefore the manager will rewrite an entry every single time.

TBC

Sep 25 '24 08:09 azr