Image caching logic doesn't take into account dependencies from intermediate stages
There is a scenario where the image caching logic incorrectly skips building a Dockerfile when that Dockerfile has an intermediate stage based on an image that has a cache miss. An example of this scenario is the this build (internal link). That build was triggered from this PR: https://github.com/dotnet/dotnet-buildtools-prereqs-docker/pull/1207. In that PR, a change was made only to the crossdeps-builder Dockerfile. The build only ended up publishing that crossdeps-builder image and none of the images that are dependent on it.
Here is an example Dockerfile with such a dependency: https://github.com/dotnet/dotnet-buildtools-prereqs-docker/blob/642ddfa47486760a15781758c7e8c0709c4c96c0/src/azurelinux/3.0/net9.0/cross/amd64/Dockerfile#L3. The image caching logic only accounts for the dependency of the final image. In this case, the final image is based on crossdeps-llvm, which has not changed. So the cross Dockerfile never gets built. But it needs to be built because the content it contains is copied from crossdeps-builder, which has changed.
Instead of adding lots of complexity to our custom caching mechanism, maybe it's time to evaluate using Docker's built-in caching. There have been lots of changes since BuildKit became the default. If we store the cache on the registry, Docker should cache most or all layers using the normal cache invalidation rules, which includes base image changes and individual Dockerfile instructions. If we use the "max" cache option, then it'll also store intermediate stages/layers on the registry.
Instead of adding lots of complexity to our custom caching mechanism, maybe it's time to evaluate using Docker's built-in caching.
Maybe. But that defeats the whole purpose of https://github.com/dotnet/docker-tools/pull/1449. 🤷♂️
[Triage] One way to fix this would be to include all FROM images in the Dockerfile in the image-info file on https://github.com/dotnet/versions. Then we could re-build when any of those images change. However, this is not a small amount of work for this one scenario that really only affects the buildtools-prereqs repo. Therefore, we should close this as not planned for now, and consider re-opening in the future if it becomes a bigger issue.
[Triage] This would still be high-cost to fix, still only applies to the buildtools prereqs repo, and still has valid workarounds. A lower-cost solution would be to write a guide/workflow on the buildtools-prereqs repo for handling these scenarios.
To validate these scenarios in PRs, it would also be good to add a no-cache build leg to that repo. Special care should be taken to avoid spending too much compute time with this.
This has been on the sprint column for 5 months. I am moving it back to current release. @mthalman I see this is on two boards at once. Should it be removed from the .NET Docker board?
Yeah, I think it's fine to remove it since it only applies to the buildtools repo.
This continues to be a problem - the latest images produced from https://github.com/dotnet/dotnet-buildtools-prereqs-docker/pull/1516 don't have clang 21.1.3. (The crossdeps-builder images do, but the dependent images like crossdeps-llvm-amd64, cross-amd64, etc don't as described in the issue above.) I don't think it's an acceptable solution to require every version update in crossdeps-builder to touch the dependent dockerfiles. Could we reconsider addressing this? @richlander
If I were going to fix this today, I'd probably add a new FromImages property to PlatformData alongside what already exists.
https://github.com/dotnet/docker-tools/blob/46e34e7be617899ad8e243a66ccf60c03786adac/src/ImageBuilder/Models/Image/PlatformData.cs#L35-L36
You would need to add more logic to BuildCommand.cs to fill out the FromImages property in PlatformData. I would probably add it to SetPlatformDataBaseDigest since it already does something similar:
https://github.com/dotnet/docker-tools/blob/46e34e7be617899ad8e243a66ccf60c03786adac/src/ImageBuilder/Commands/BuildCommand.cs#L257-L285
Then, we would need to update ImageCacheService.cs to invalidate the cache when any of the FROM images were changed, not just the base image digest.
This would also probably get us some rudimentary automated re-builds for FROM scratch images. Related: https://github.com/dotnet/dotnet-docker/issues/1455