buildkit icon indicating copy to clipboard operation
buildkit copied to clipboard

Fix cache cannot reuse lazy layers

Open ktock opened this issue 3 years ago • 3 comments

Fixes https://github.com/docker/buildx/issues/1306

Currently, cache cannot reuse lazy layers that returns cache.NeedsRemoteProviderError on cacheresult.cacheResultStorage.Exist(). This causes the issue that cache doesn't reuse lazy layers as reported in https://github.com/docker/buildx/issues/1306 . This commit tries to fix this issue by allowing cache.NeedsRemoteProviderError during cache lookup.

ktock avatar Sep 12 '22 21:09 ktock

cc @sipsma @tonistiigi @gabrieldemarmiesse

ktock avatar Sep 12 '22 22:09 ktock

Is this specific to stargz or for unpulled layers as well. Eg. cache match for merge(image(), ..).

Thinking out loud: not sure if a cache match on unpulled non-stargz layers is really that useful. If you need to pull the layers anyways then that's as good as a cache miss. I'm guessing that for stargz this matters because you can have a "half-pulled" layer where some of the data is present locally but you still need a remote provider for the unpulled data, is that correct @ktock?

For the change itself, it seems like the main code path where Exists gets called is CacheManager.Records, which I think only gets called in solver/edge.go when e.cacheMap is available. That in turn probably means that we could plumb the cache opts from cacheMap through ctx to Exists and verify that the caller actually has a remote provider hooked up? Maybe? It's obviously kind of convoluted, so not sure if this actually adds up. If it does make sense though then maybe that'd help guarantee the result is actually going to be loadable.

The other place Exists gets called is ReleaseUnreferenced, which we'd need to handle too.

If there's a different way of guaranteeing the record will be loadable that's simpler then I'm all for it, the above is just my first thought looking through the code paths.

sipsma avatar Sep 17 '22 22:09 sipsma

@tonistiigi @sipsma

Thank you for taking a look at this.

I'm guessing that for stargz this matters because you can have a "half-pulled" layer where some of the data is present locally but you still need a remote provider for the unpulled data, is that correct @ktock?

Thank you for elaborating this. Yes, it's correct.

For the change itself, it seems like the main code path where Exists gets called is CacheManager.Records, which I think only gets called in solver/edge.go when e.cacheMap is available. That in turn probably means that we could plumb the cache opts from cacheMap through ctx to Exists and verify that the caller actually has a remote provider hooked up? Maybe? It's obviously kind of convoluted, so not sure if this actually adds up. If it does make sense though then maybe that'd help guarantee the result is actually going to be loadable. The other place Exists gets called is ReleaseUnreferenced, which we'd need to handle too.

Thank you for the suggestion. Fixed the patch to pass cache opts to CacheManager.Records through ctx (as similar as what is done by sharedOp.LoadCache calling CacheManager.Load).

ktock avatar Sep 19 '22 10:09 ktock

@ktock Could you also take a look at https://github.com/docker/buildx/issues/1325#issuecomment-1253129431 . Lmk if you think that is completely different.

tonistiigi avatar Sep 30 '22 19:09 tonistiigi

@ktock Could you also take a look at https://github.com/docker/buildx/issues/1325#issuecomment-1253129431 . Lmk if you think that is completely different.

@tonistiigi @sipsma It seems that https://github.com/docker/buildx/issues/1325#issuecomment-1253129431 can be addressed separately from this PR. (Please see also https://github.com/moby/buildkit/pull/3229). Could we move this PR forward if the patch looks fine?

ktock avatar Oct 27 '22 07:10 ktock

@tonistiigi It seems that the call of Records API can reach to (worker/base).Worker.LoadRef() which calls cache.cacheManager.Get() API. Though the current implementation of the Get API checks the existence of descHandlers but doesn't look like cause slow operation like unlazying of the layer. It calls Stat API of the snapshotter but it's just a metadata lookup on the bbot db.

ktock avatar Nov 04 '22 08:11 ktock

CI passed, can we move this patch forward? @tonistiigi

ktock avatar Jan 25 '23 01:01 ktock