contenthash: implement proper Linux symlink semantics
This series fixes the symlink following implementation of cache/contenthash. The primary issue with the previous implementation is that it assumed that path.Join is a reasonable way of implementing trailing symlink following -- this is not correct on Linux. path.Clean is a Plan9-ism, and while it is useful, on Linux symlinks are resolved left-to-right and thus lexically transforming a/b/../c to a/c is incorrect if b is a symlink. The paper describing Plan9's path_clean makes reference to this Unix-ism they fixed by removing symlinks from Plan9.
As the trailing symlink logic was written in quite a few places, all of the implementations needed to be fixed. The headline changes are:
- All of the symlink following implementations are now iterative, based on https://github.com/cyphar/filepath-securejoin. Because of the requirement to apply each component in a symlink individually, recursive implementations end up with a lot of wasted work.
- The
followflag now meansfollowTrailing(O_NOFOLLOW) everywhere. Previously,getFollowLinksandcc.checksumused it to mean "don't follow any links" (a-laRESOLVE_NO_SYMLINKS) whilecc.Checksumandcc.checksumFollowused it to meanO_FOLLOW. There was only one user of theRESOLVE_NO_SYMLINKSimplementation (cc.includedPaths) and it appears it was only used as an optimisation (when in fact, since we are iterating over the keys in the cache, the key definitely exists,cc.checksumwill just do a simple lookup regardless ofRESOLVE_NO_SYMLINKS). cc.checksumFollowis gone (now thatchecksumsupportsfollowTrailingdirectly), andcc.checksumNoFollowandcc.lazyChecksumhave been renamed and reworked slightly. Removing the loop inchecksumFollowalso means we will detect symlink loops 255 times faster :smile_cat:, not to mention that we now don't do a bunch of unnecessary scanning and checksumming when dealing with trailing symlinks.needsScannow just usesgetFollowLinksCallbackto track the status of a lookup to see if there is a non-symlink path component that has already been scanned in the cache. This removes one extra implementation of symlink lookups.scanPathnow does a scan ofpath.Dirof the resolved path, rather than resolving the lexical directory of the requested path (this meansscanPath,needsScan, and thusrootPathnow also takefollowTrailing). I suspect that always fully-resolving the path would be okay (because we fill in symlinks during resolution) but in the case offollowTrailing=falseit seems unclear to me which directory "should" be scanned. The old implementation would scan all of them (becausechecksumFollowwould re-apply the trailing link and re-checksum everything).
Signed-off-by: Aleksa Sarai [email protected]
For reference, here is a Dockerfile which reliably triggered this issue:
FROM alpine:3.19 AS base
USER root
RUN mkdir -p /target/dir /links1 /links2 ; \
touch /target/dir/foo ; \
ln -s ../target/dir /links1/dir_rel ; \
ln -s /target/dir /links1/dir_abs ; \
ln -s ../target /links1/target_rel ; \
ln -s /target /links1/target_abs ; \
ln -s ../links1/ /links2/links1_rel ; \
ln -s /links1/ /links2/links1_abs
FROM scratch as test_0
COPY --from=base /target/dir /data/0
COPY --from=base /links1/dir_rel /data/1
COPY --from=base /links1/dir_abs /data/2
COPY --from=base /links1/target_rel/dir /data/3
COPY --from=base /links1/target_abs/dir /data/4
COPY --from=base /links2/links1_rel/dir_abs /data/5
COPY --from=base /links2/links1_abs/dir_abs /data/6
COPY --from=base /links2/links1_rel/target_abs/dir /data/7
COPY --from=base /links2/links1_abs/target_abs/dir /data/8
# The following COPY commands fail.
FROM scratch as test_1
COPY --from=base /links2/links1_rel/dir_rel /data/9
COPY --from=base /links2/links1_abs/dir_rel /data/a
COPY --from=base /links2/links1_rel/target_rel/dir /data/b
COPY --from=base /links2/links1_abs/target_rel/dir /data/c
The getFollowParentLinks thing that exists in one patch and is removed in another is a bit unfortunate but I don't see a nice way of splitting the fixes to getFollowLinks and the followTrailing logic without making one mega patch. If you'd prefer that, let me know and I'll squash things down.
Maybe I should get a dummy PR merged so you don't need to re-trigger the CI :sweat_smile:.
Not gotten a chance to dive into this yet, but just a quick comment - I wonder if we need any changes in https://github.com/tonistiigi/fsutil as well? There's some logic there for following symlinks as well, and other path resolution-y things.
Yeah, I looked at fsutil earlier and it does have bugs with symlink resolution (in short, anything that does filepath.Join(foo, bar.Linkname) or filepath.Clean(bar.Linkname) probably has bugs).
However, it wasn't clear to me how it's used by buildkit/docker/containerd. I can write some patches, but if they have subtle semantics (like getFollowLinks does in buildkit) it might be difficult to figure out whether or not I'm breaking something.
(It's a little frustrating that I spent so much time getting RESOLVE_IN_ROOT merged but we can't use it. I really need to pick up the leftover stuff in https://github.com/openSUSE/libpathrs...)
@tonistiigi wdyt? Also, did you want me to work on a PR for fsutil?
Windows CI failure in new test
=== Failed
=== FAIL: cache/contenthash TestRootPathSymlinks/resolve("link3/target_abs",followTrailing=true) (0.00s)
path_test.go:115:
Error Trace: D:/a/buildkit/buildkit/cache/contenthash/path_test.go:115
Error: Not equal:
expected: "C:\\Users\\RUNNER~1\\AppData\\Local\\Temp\\TestRootPathSymlinks412891216\\001\\target"
actual : "C:\\Users\\RUNNER~1\\AppData\\Local\\Temp\\TestRootPathSymlinks412891216\\001\\link2\\final"
Diff:
--- Expected
+++ Actual
@@ -1 +1 @@
-C:\Users\RUNNER~1\AppData\Local\Temp\TestRootPathSymlinks412891216\001\target
+C:\Users\RUNNER~1\AppData\Local\Temp\TestRootPathSymlinks412891216\001\link2\final
Test: TestRootPathSymlinks/resolve("link3/target_abs",followTrailing=true)
--- FAIL: TestRootPathSymlinks/resolve("link3/target_abs",followTrailing=true) (0.00s)
=== FAIL: cache/contenthash TestRootPathSymlinks/resolve("link2/target_abs",followTrailing=true) (0.00s)
path_test.go:115:
Error Trace: D:/a/buildkit/buildkit/cache/contenthash/path_test.go:115
Error: Not equal:
expected: "C:\\Users\\RUNNER~1\\AppData\\Local\\Temp\\TestRootPathSymlinks412891216\\001\\target"
actual : "C:\\Users\\RUNNER~1\\AppData\\Local\\Temp\\TestRootPathSymlinks412891216\\001\\link2\\final"
Diff:
--- Expected
+++ Actual
@@ -1 +1 @@
-C:\Users\RUNNER~1\AppData\Local\Temp\TestRootPathSymlinks412891216\001\target
+C:\Users\RUNNER~1\AppData\Local\Temp\TestRootPathSymlinks412891216\001\link2\final
Test: TestRootPathSymlinks/resolve("link2/target_abs",followTrailing=true)
--- FAIL: TestRootPathSymlinks/resolve("link2/target_abs",followTrailing=true) (0.00s)
Okay, it seems that Windows's CreateSymlink syscall changes the path we provide for the link (Z:\link2\link1sub\..\final becomes Z:\link2\final when we create the symlink), making the tests fail. Since we can't even create these kinds of symlinks on Windows, I've added a check to skip the test if the symlink we get is not the one we expected on Windows.