buildkit
buildkit copied to clipboard
Proposal: Allow setting cache expiration time for build steps
There are certain build steps that do not generate a very stable cache key. Eg. container runs are mainly cached by the command arguments. For commands that access the network, their meaning may completely change over time without the arguments changing. A simple example is installing a package and running something like apt-get update
. The cache generated for this step never expires unless updated with a build-arg or --no-cache[-filter]
is used.
Ideally, we would want to avoid these commands and define package installs more precisely so BuildKit can track the dependencies and their expiration correctly. This can be done with custom frontends but is hard in something like Dockerfile.
A possible solution would be to allow build step to define the expiration time for the cache that it generates.
The simplest way to implement it would be to save the creation and expration time when the cache record is created(creation time is already saved) and then delete the records when they expire. There are multiple reasons why this doesn't work with the current cache model, though.
First of all, BuildKit cache chains do not require cache records to be available for intermediate steps (this is how BuildKit cache can step over stages in multi-stage builds, for example). This means that it is unlikely that the cache for apt-get update
will ever be pulled, and question is more about pulling the cache for any step that depended on apt-get update
result later on in the build. But the cache records for these steps could be created much later.
Secondly, if apt-get update
does run again (either by no-cache match or because it was expired), its result is saved by the same cache key as the previous run. Meaning that the cache for the next steps would automatically be usable on top of it.
Thirdly, removing content doesn't work with remote cache backends, where we can't do any data modifications after the build.
An alternative approach would be to make changes to how the cache chains are stored in https://github.com/moby/buildkit/blob/master/solver/cachestorage.go#L16 .
Currently, when a cache record is saved, it calls AddResult()
function to register a snapshot with a specific cache location. When a process from the same cache location runs multiple times, these snaphots are added under the same record. On snapshot load, the client can query all records and usually picks the newest one.
For the expiration to work, snapshots from different execution times can't be saved under the same cache location but need to create a separate branch point in the cache graph. First of all, we would need to distinguish that the cache key is for an LLB Exec operation, as for only these the expiration makes sense. Then, for such keys, instead of calling AddResult()
with the previous ID value, a completely new ID needs to be generated, and creation time needs to be associated with this key. The creation time for this key can never change in the future, so no new records are allowed to be added under the same key. The same creation time is also added to AddLink
https://github.com/moby/buildkit/blob/master/solver/cachestorage.go#L26 to add extra time constraints to cache key links.
WalkLinks
method https://github.com/moby/buildkit/blob/master/solver/cachestorage.go#L27 gets called when a cache match query is happening. This method needs to be extended with the maxAge time.Duration
property. Every key associated with the creation time(from AddLink
) is compared against this maxAge
and links only return if they are new enough. This makes sure that any future matches for other steps also need to be newer than this record, and even if they are newer, they can't be based on any older cache in their parent chain from that point.
This method should also work for remote cache backends as long as the creation times for cache locations are exported.
In Dockerfile setting, cache expiration could look something like:
FROM debian
# cache=12h
RUN apt-get update
RUN go build ...
It could also make sense to think about some categorization of the cache records. This proposal is already separating the cache for LLB Exec from other cache that guarantees better uniqueness. Another idea could be to distinguish steps with --network=none
(or if we could detect that step did not use the network when it ran) from steps that make network requests and are therefore less trustworthy. Some similar separation could be imagined for cache from CI and local cache (or in the future between signed and unsigned cache).
# cache=12h
RUN apt-get update
RUN go build ...
Does the cache expiry in this case only apply to the 1st RUN
instruction?
@dvdksn It only applies to the apt-get
command, but in this example, if apt-get
runs again, it would mean that go build
would also run again just because the dependencies of go build
have changed (the same is not true for multi-stage build links).
Not commenting on the proposal yet, but I think a syntax of RUN --cache=12h
is probably more appropriate considering some of the other syntax.
Not commenting on the proposal yet, but I think a syntax of
RUN --cache=12h
is probably more appropriate considering some of the other syntax.
Agree with @jsternberg but would like a more meaningful flag like --cache-ttl
.
FROM debian # cache=12h RUN apt-get update RUN go build ...
If we want to use comments in Dockerfile, I think we should have a prefix like:
FROM debian
# org.mobyproject.buildkit.cache-ttl=12h
RUN apt-get update
RUN go build ...
I like this idea :smile:
Secondly, if
apt-get update
does run again (either by no-cache match or because it was expired), its result is saved by the same cache key as the previous run. Meaning that the cache for the next steps would automatically be usable on top of it.
I'm not super familiar, why is this the case? Is there a legitimate reason for the behavior, I can't quite work out why an apt-get update
result would get saved, but not update the cache key. If you have any code links, would be interested in reading more.
Every key associated with the creation time(from AddLink) is compared against this maxAge and links only return if they are new enough
So the control of expiry is down to the consumer then, as opposed to the creator of the cache record? I think that makes sense, but it does mean that if user A creates a cache for an apt update
, user B can use it - even if it's older than how long user A intended it to be around for.
I don't think a --cache-ttl
flag would be right then - users would associate that with DNS-like semantics, essentially how long the cache lives, which isn't quite right. It's about how long the cache can be consumed for.
I'm not super familiar, why is this the case? Is there a legitimate reason for the behavior, I can't quite work out why an apt-get update result would get saved, but not update the cache key. If you have any code links, would be interested in reading more.
The result of digest in CacheMap()
will always be the same for apt-get update
https://github.com/moby/buildkit/blob/b9fcd7688a5d7d4461f0c1706216b6d4b3e584e6/solver/llbsolver/ops/exec.go#L153 . If you run it 3 times you will get 3 snapshots, but they all have the same cache checksum. The condition for matching cache is to find the chain of checksums from past execution that matches the current execution.
So the control of expiry is down to the consumer then, as opposed to the creator of the cache record?
Yes, correct. By this proposal, # cache=12h
on a specific RUN
statement does not do anything if there isn't a cache match. The creation time is saved for any RUN
, and the cache expiry condition is only evaluated when deciding if the previous cache should be loaded or not ("is there any cache record available that is not older than 12h?").
If user A creates cache for apt-get update
and had # cache=12h
condition in the Dockerfile and now user B has another Dockerfile with apt-get update
without any cache restriction, B will still get cache after 12h has passed.
This one is probably very related here; it's worth going through that thread to see if there's use-cases there that can help with the design;
- https://github.com/moby/moby/issues/1996
If you run it 3 times you will get 3 snapshots, but they all have the same cache checksum. The condition for matching cache is to find the chain of checksums from past execution that matches the current execution.
Ahhh right, yes, thanks for the clarification! I was getting confused, thinking that maybe ComputeDigestFunc
would be used here - but that's only used to calculate the slow cache path, it doesn't invalidate the existing fast cache path.
Also interested in @tianon and @yosifkit's input on this proposal (I know @tianon had feedback / ideas on https://github.com/moby/moby/issues/1996, and both have dealt with many Dockerfiles for the official images, so may have requirements / thoughts on UX / feature)
/cc @dgageot as well; perhaps you can add your use-case (ADD
from remote and slow network connections)
Sorry for the delay in getting to looking at this! :bow:
I'm actually struggling to understand real-world use cases here. :see_no_evil:
In the case of apt-get update
, it's really almost always only necessary if you're going to apt-get install
, for which you always need fresh package lists, hence apt-get update
, so I really don't see a compelling reason to have RUN apt-get update
all by itself in the first place (even running apt-get update
explicitly at all is only because APT doesn't have something like apk --no-cache
), so I'd love to hear more use cases. :sweat_smile:
@tianon The example above meant that any command depending on the new files from apt update
would need to run again as well, eg. install. There is no special rules for one RUN
step skipping cache and then one after that picking up cache again. Yes, in this case the ideal version would be if apt update
result would figure out if there was anything in that updated run that actually impacted the next apt install
. This can theoretically be done with some multi-stage construction where after apt update
some file is made that defines uniqueness that is then copied to another stage. Then the cache reuse for commands in that other stage would be defined based on the checksum of that "uniqueness" file. This will probably be quite complex though without some code-reuse and custom frontend like https://github.com/tonistiigi/buildkit-alpine would work better.
Ideally, we would want to avoid these commands and define package installs more precisely so BuildKit can track the dependencies and their expiration correctly. This can be done with custom frontends but is hard in something like Dockerfile.
Sorry for the delay in getting to looking at this! 🙇
I'm actually struggling to understand real-world use cases here. 🙈
In the case of
apt-get update
, it's really almost always only necessary if you're going toapt-get install
, for which you always need fresh package lists, henceapt-get update
, so I really don't see a compelling reason to haveRUN apt-get update
all by itself in the first place (even runningapt-get update
explicitly at all is only because APT doesn't have something likeapk --no-cache
), so I'd love to hear more use cases. 😅
@tianon, @tonistiigi referred me here after a conversation in slack, so I just wanted to weigh in: it's very important for a few steps in our pipeline to get rerun every day -- apt is one, because we want e.g. security updates, and there are also internal packages that we consume as unversioned dependencies. At the same time, those installs are expensive and we build these images on every commit, so I don't want to throw out caching for them entirely. The ideal configuration would be a a cache that we could configure to expire e.g. every 24h.
UPDATE: I ultimately got the behavior I wanted by
- adding a built-at annotation to all my images
- pulling the build time of the most recent deps build by checking the annotation with the registry API
- injecting a
RUN echo "{BUILD_TIME_OF_DEPS}" > /deps-built-at
into whatever dockerfile my script is building
It works, but it is clunky and it took time -- a cache expiry feature would have totally eliminated the need for this.