zig icon indicating copy to clipboard operation
zig copied to clipboard

zig prereq races as of 0.11.0-dev.1782+b52be973d

Open motiejus opened this issue 1 year ago • 3 comments

Dear maintainers,

We still observe some very-hard-to-reproduce races when building libc/rt prerequisites.

Some anecdotal evidence:

error: FileNotFound

When executed on a fresh installation, zig build-exe toolchain/launcher.zig (exact command) sometimes fails with:

error: FileNotFound

This happens only on a fresh $ZIG_GLOBAL_CACHE_DIR (which we keep in /tmp/bazel-zig-cc). We have seen this happen on Darwin x86_64 and Darwin M1. We may saw it on Linux, but I no longer have the logs to verify. My memory is poor.

I tried to reproduce this on my MacOS machine overnight, without success. But we do receive a couple of complaints a week consistently over the last few weeks. Note that the sample size is quite large.

libcompiler_rt.a: No such file or directory

This happened on our CI yesterday:

/tmp/bazel-zig-cc/o/4421bb1adcf01feee7185ccb98640027/libcompiler_rt.a: No such file or directory

Unfortunately, I can no longer access the build host nor access it's global cache dir. It may be related.

Summary

I understand this is very little information to troubleshoot effectively. Here are the steps I am trying to do:

  1. Capture the first error (FileNotFound) on any Linux machine and instruct the engineer to re-run the command under strace, to see which file they are missing. However, this was not reported on Linux for the last week or so: either it did not happen, or people learned to remove the cache directory and move on. Since this happens more on OSX, it would make sense to debug it here. However, our engineers cannot run dtruss for compliance reasons.
  2. I will try to repro this on my MacOS machine again, but slightly differently.

Food for thought: is it time to reconsider how error context is propagated during the build phase, so errors could be augmented with additional context?

motiejus avatar Mar 06 '23 07:03 motiejus

I do have $ZIG_GLOBAL_CACHE_DIR archive from MacOS x86_64 right after when this happened. However, I wasn't able to plug in the archive to my machine and reproduce the issue.

If that helps, I could send over the archive privately.

motiejus avatar Mar 06 '23 07:03 motiejus

I have CI a machine in production where I can reproduce this.

<...>:~/.cache/bazel/_bazel_jenkins/d216a97b1920f6547de128d379fd31b3/external/zig_sdk/tools$ env ZIG_GLOBAL_CACHE_DIR=/tmp/bazel-zig-cc ZIG_LOCAL_CACHE_DIR=/tmp/bazel-zig-cc ../zig 'build-exe' '-OReleaseSafe' 'launcher.zig' '-s
tatic'
error: ld.lld: cannot open /tmp/bazel-zig-cc/o/8867000f2fcde52678126003825526ac/libc.a: No such file or directory
error: ld.lld: cannot open /tmp/bazel-zig-cc/o/116b7d9400e39e5a8d3fc65084a9ce7f/libcompiler_rt.a: No such file or directory
<...>:~/.cache/bazel/_bazel_jenkins/d216a97b1920f6547de128d379fd31b3/external/zig_sdk/tools$ 

I will do my best for it to keep it's state until tomorrow. If there is a command I can run there, I am all ears.

motiejus avatar Mar 06 '23 18:03 motiejus

@motiejus and I have been discussing this issue on IRC, and he shared with me a tarball of the zig-cache directory from the affected system. We have made the following observations:

  • the issue has occurred on both macOS and on Linux.
  • in the zig-cache directory on the Linux computer, the h/ directory is very small, but the o/ directory has 3726 empty directories in it. The h/ directory is reproduced here:
h/0bb411f073f43ac89ac94f4702ec0974.txt
h/b97a8a3a3a84ebf83ea6cd9d984e1fd0.txt
h/timestamp

The only artifacts found in o/ apart from 3726 empty directories are:

o/8e8f0d3efc02827d94d9e35c6989e340/builtin.zig
o/a07523923b2b200807e54b2be152a486/builtin.zig
o/369045f44d4564feae17c4e32f314866/builtin.zig

What I find curious about this is that zig only creates a directory inside o/ if it is about to write a build artifact there. I can't fathom why there would be 3726 empty directories here, apart from a third party going in and deleting files.

As for a plan of attack here, I'd like to finish my work on #14647. I'm hoping to land this branch within 1 week. There is a high chance, that if there is a race condition in zig's caching system which is causing this issue, that it will be reproduced by either my local computer, or the CI, when testing this branch. If it does not get reproduced, that suggests for us to look into a different possible cause. Perhaps something more straightforward, hiding under our very noses.

Meanwhile, we have seen the merge of #14821 yesterday, addressing what could be a related issue to this one. This issue was indeed being triggered reliably by my work in the branch as mentioned above. Another avenue of attack that I will pursue is to examine the caching system and look for an equivalent bug to this one that was recently solved.

andrewrk avatar Mar 08 '23 21:03 andrewrk

I have landed #14647, and as hoped, I have seen some sporadic issues that look related to this one. For example, in https://github.com/ziglang/zig/actions/runs/4444638053/jobs/7803016844:

/Users/runner/work/zig/zig/test/behavior/translate_c_macros.zig:7:11: error: C import failed: FileNotFound

Likewise, in https://github.com/ziglang/zig/actions/runs/4444638053/jobs/7803016681:

zig build-exe test ReleaseFast native: error: error: unable to open 'zig-local-cache/o/6b2e769a5b7088fc2f30d8d9c03c4e44': FileNotFound

This is likely a related issue, and has a completely straightforward solution: #14978

So, I will continue by addressing the tail end of issues that my build-parallel branch has exposed. I think it is very likely that, in a few weeks time, as these follow-up issues are resolved, it will also solve this issue (the one I am commenting on).

andrewrk avatar Mar 17 '23 20:03 andrewrk

To capture/troubleshoot such cases I am considering to run a ReleaseSafe, unstripped zig in production. Will run some benchmarks tomorrow on my end.

Also, #14923 would have been resolved much quicker with ReleaseSafe, since zld had an assertion that was skipped due to a ReleaseFast build.

In this specific case we would see the stack trace immediately where the error: FileNotFound is coming from.

Specific questions to maintainers (sorry for previous pings):

  • ~~thoughts on doing this on CI?~~ just realized this is happening already.
  • should we consider publishing unstripped ReleaseSafe nightlies in ziglang.org/download on top of the existing ReleaseSafe/strip ones?

Use case for publishing ReleaseSafe/unstripped builds:

  • whoever wants to run it in production
  • whoever is investigating an issue with a compiler will have one readily available (e.g. I still haven't learned how to build zig with system's llvm on my Mac).
  • integration tests of 3rd party systems. Long story short, I will, once I fix the remaining Go unit tests with CC="zig cc", will propose an official CI runner to the Go team which does just that. For this use case we want a ReleaseSafe+debug build, so the Zig's stack traces that Go devs will be looking at will be actionable much quicker. No guarantees it will be accepted, but there is a chance given Uber's relationship with the Go team and our long-term interest to keep those two working.

motiejus avatar Mar 28 '23 20:03 motiejus

Sporadic failure observed on a CI run mac-debug (https://github.com/ziglang/zig/pull/15285#issuecomment-1513290585). Not sure, if fixed by #15307 in 40e1fca.

2023-04-18T08:30:11.3938460Z Install the project...
2023-04-18T08:30:11.4063650Z -- Install configuration: "Debug"
2023-04-18T08:57:06.4523620Z ++ pwd
2023-04-18T08:57:06.4799110Z + stage3/bin/zig build test docs --zig-lib-dir /Users/runner/work/zig/zig/build/../lib -Denable-macos-sdk -Dstatic-llvm -Dskip-non-native --search-prefix /Users/runner/zig+llvm+lld+clang-x86_64-macos-none-0.11.0-dev.2441+eb19f73af
2023-04-18T09:23:00.6724990Z zig test ReleaseFast native: error: thread 183881 panic: reached unreachable code
2023-04-18T09:23:00.6854820Z /Users/runner/work/zig/zig/lib/std/os.zig:962:23: 0x100e24071 in ftruncate (zig)
2023-04-18T09:23:00.6969080Z             .INVAL => unreachable, // Handle not open for writing
2023-04-18T09:23:00.7074050Z                       ^
2023-04-18T09:23:00.7194090Z /Users/runner/work/zig/zig/lib/std/fs/file.zig:254:30: 0x100c90736 in setEndPos (zig)
2023-04-18T09:23:00.7302450Z         try os.ftruncate(self.handle, length);
2023-04-18T09:23:00.7403820Z                              ^
2023-04-18T09:23:00.7916250Z /Users/runner/work/zig/zig/lib/std/Build/Cache.zig:870:55: 0x100c903c3 in writeManifest (zig)
2023-04-18T09:23:00.8029440Z             try manifest_file.setEndPos(contents.items.len);
2023-04-18T09:23:00.8029780Z                                                       ^
2023-04-18T09:23:00.8030070Z /Users/runner/work/zig/zig/src/Compilation.zig:2122:26: 0x100cff3d0 in update (zig)
2023-04-18T09:23:00.8030380Z         man.writeManifest() catch |err| {
2023-04-18T09:23:00.8030600Z                          ^
2023-04-18T09:23:00.8031030Z /Users/runner/work/zig/zig/src/main.zig:3384:36: 0x100d2f1c6 in serve (zig)
2023-04-18T09:23:00.8031350Z                     try comp.update(main_progress_node);
2023-04-18T09:23:00.8031600Z                                    ^
2023-04-18T09:23:00.8031880Z /Users/runner/work/zig/zig/src/main.zig:3202:31: 0x100b98d1e in buildOutputType (zig)
2023-04-18T09:23:00.8032790Z                 test_exec_args.items,
2023-04-18T09:23:00.8033040Z                               ^
2023-04-18T09:23:00.8038770Z /Users/runner/work/zig/zig/src/main.zig:273:31: 0x100b6b62f in mainArgs (zig)
2023-04-18T09:23:00.8039320Z         return buildOutputType(gpa, arena, args, .zig_test);
2023-04-18T09:23:00.8039580Z                               ^
2023-04-18T09:23:00.8039840Z /Users/runner/work/zig/zig/src/main.zig:211:20: 0x100b6a887 in main (zig)
2023-04-18T09:23:00.8060930Z     return mainArgs(gpa, arena, args);
2023-04-18T09:23:00.8061200Z                    ^
2023-04-18T09:23:00.8061470Z /Users/runner/work/zig/zig/lib/std/start.zig:609:37: 0x100b6d0b3 in main (zig)
2023-04-18T09:23:00.8061760Z             const result = root.main() catch |err| {
2023-04-18T09:23:00.8061990Z                                     ^
2023-04-18T09:23:00.8062190Z ???:?:?: 0x7fff203e7f3c in ??? (???)
2023-04-18T09:23:00.8062590Z ???:?:?: 0x12 in ??? (???)
2023-04-18T09:23:00.8062770Z 
2023-04-18T09:23:00.8062960Z zig test ReleaseFast native: error: the following command terminated unexpectedly:
2023-04-18T09:23:26.2318440Z /Users/runner/work/zig/zig/build/stage3/bin/zig test /Users/runner/work/zig/zig/test/behavior.zig -OReleaseFast --cache-dir /Users/runner/work/zig/zig/build/zig-local-cache --global-cache-dir /Users/runner/work/zig/zig/build/zig-global-cache --name test -I /Users/runner/work/zig/zig/test -L /Users/runner/zig+llvm+lld+clang-x86_64-macos-none-0.11.0-dev.2441+eb19f73af/lib -I /Users/runner/zig+llvm+lld+clang-x86_64-macos-none-0.11.0-dev.2441+eb19f73af/include --zig-lib-dir /Users/runner/work/zig/zig/lib --listen=- 
2023-04-18T09:23:26.2319480Z zig test ReleaseFast native: error: thread 183882 panic: reached unreachable code
2023-04-18T09:23:26.2424740Z /Users/runner/work/zig/zig/lib/std/os.zig:962:23: 0x107649071 in ftruncate (zig)
2023-04-18T09:23:26.2526060Z             .INVAL => unreachable, // Handle not open for writing
2023-04-18T09:23:26.2640040Z                       ^
2023-04-18T09:23:26.2743480Z /Users/runner/work/zig/zig/lib/std/fs/file.zig:254:30: 0x1074b5736 in setEndPos (zig)
2023-04-18T09:23:26.2852980Z         try os.ftruncate(self.handle, length);
2023-04-18T09:23:26.2964870Z                              ^
2023-04-18T09:23:26.3068300Z /Users/runner/work/zig/zig/lib/std/Build/Cache.zig:870:55: 0x1074b53c3 in writeManifest (zig)
2023-04-18T09:23:26.3177990Z             try manifest_file.setEndPos(contents.items.len);
2023-04-18T09:23:26.3279350Z                                                       ^
2023-04-18T09:23:26.3381060Z /Users/runner/work/zig/zig/src/Compilation.zig:2122:26: 0x1075243d0 in update (zig)
2023-04-18T09:23:26.3482580Z         man.writeManifest() catch |err| {
2023-04-18T09:23:26.3583880Z                          ^
2023-04-18T09:23:26.3685600Z /Users/runner/work/zig/zig/src/main.zig:3384:36: 0x1075541c6 in serve (zig)
2023-04-18T09:23:26.3787030Z                     try comp.update(main_progress_node);
2023-04-18T09:23:26.3889390Z                                    ^
2023-04-18T09:23:26.3990850Z /Users/runner/work/zig/zig/src/main.zig:3202:31: 0x1073bdd1e in buildOutputType (zig)
2023-04-18T09:23:26.4042130Z                 test_exec_args.items,
2023-04-18T09:23:26.4042810Z                               ^
2023-04-18T09:23:26.4043350Z /Users/runner/work/zig/zig/src/main.zig:273:31: 0x10739062f in mainArgs (zig)
2023-04-18T09:23:26.4043910Z         return buildOutputType(gpa, arena, args, .zig_test);
2023-04-18T09:23:26.4044370Z                               ^
2023-04-18T09:23:26.4044840Z /Users/runner/work/zig/zig/src/main.zig:211:20: 0x10738f887 in main (zig)
2023-04-18T09:23:26.4045340Z     return mainArgs(gpa, arena, args);
2023-04-18T09:23:26.4045760Z                    ^
2023-04-18T09:23:26.4046240Z /Users/runner/work/zig/zig/lib/std/start.zig:609:37: 0x1073920b3 in main (zig)
2023-04-18T09:23:26.4046760Z             const result = root.main() catch |err| {
2023-04-18T09:23:26.4047210Z                                     ^
2023-04-18T09:23:26.4047650Z ???:?:?: 0x7fff203e7f3c in ??? (???)
2023-04-18T09:23:26.4048070Z ???:?:?: 0x13 in ??? (???)
2023-04-18T09:23:26.4048390Z 
2023-04-18T09:23:26.4050030Z zig test ReleaseFast native: error: the following command terminated unexpectedly:
2023-04-18T09:23:26.4053880Z /Users/runner/work/zig/zig/build/stage3/bin/zig test /Users/runner/work/zig/zig/test/behavior.zig -lc -OReleaseFast --cache-dir /Users/runner/work/zig/zig/build/zig-local-cache --global-cache-dir /Users/runner/work/zig/zig/build/zig-global-cache --name test -I /Users/runner/work/zig/zig/test -L /Users/runner/zig+llvm+lld+clang-x86_64-macos-none-0.11.0-dev.2441+eb19f73af/lib -I /Users/runner/zig+llvm+lld+clang-x86_64-macos-none-0.11.0-dev.2441+eb19f73af/include --zig-lib-dir /Users/runner/work/zig/zig/lib --listen=- 

matu3ba avatar Apr 18 '23 21:04 matu3ba

Sporadic failure observed on a CI run mac-debug (#15285 (comment)).

Fixed by #15351. However there are still more issues left

andrewrk avatar Apr 18 '23 21:04 andrewrk

I spent 2 hours today troubleshooting this failure:

+ stage3-release/bin/zig build test docs --zig-lib-dir /Users/m1/actions-runner/_work/zig/zig/build/../lib -Denable-macos-sdk -Dstatic-llvm -Dskip-non-native --search-prefix /Users/m1/zig+llvm+lld+clang-aarch64-macos-none-0.11.0-dev.2441+eb19f73af
zig build-exe passing_u0_to_function Debug x86_64-linux: error: error: unable to open 'zig-local-cache/o/72128730f11303794fecca406df163cf': FileNotFound

zig build-exe passing_u0_to_function Debug x86_64-linux: error: the following command exited with error code 1:
/Users/m1/actions-runner/_work/zig/zig/build/stage3-release/bin/zig build-exe -fno-LLVM -fno-LLD /Users/m1/actions-runner/_work/zig/zig/build/zig-local-cache/o/72128730f11303794fecca406df163cf/tmp.zig --cache-dir /Users/m1/actions-runner/_work/zig/zig/build/zig-local-cache --global-cache-dir /Users/m1/actions-runner/_work/zig/zig/build/zig-global-cache --name passing_u0_to_function -target x86_64-linux -mcpu x86_64 -L /Users/m1/zig+llvm+lld+clang-aarch64-macos-none-0.11.0-dev.2441+eb19f73af/lib -I /Users/m1/zig+llvm+lld+clang-aarch64-macos-none-0.11.0-dev.2441+eb19f73af/include --zig-lib-dir /Users/m1/actions-runner/_work/zig/zig/build/../lib --listen=- 
zig build-exe passing_u0_to_function Debug arm-linux: error: error: unable to open 'zig-local-cache/o/72128730f11303794fecca406df163cf': FileNotFound

zig build-exe passing_u0_to_function Debug arm-linux: error: the following command exited with error code 1:
/Users/m1/actions-runner/_work/zig/zig/build/stage3-release/bin/zig build-exe -fno-LLVM -fno-LLD /Users/m1/actions-runner/_work/zig/zig/build/zig-local-cache/o/72128730f11303794fecca406df163cf/tmp.zig --cache-dir /Users/m1/actions-runner/_work/zig/zig/build/zig-local-cache --global-cache-dir /Users/m1/actions-runner/_work/zig/zig/build/zig-global-cache --name passing_u0_to_function -target arm-linux -mcpu baseline -L /Users/m1/zig+llvm+lld+clang-aarch64-macos-none-0.11.0-dev.2441+eb19f73af/lib -I /Users/m1/zig+llvm+lld+clang-aarch64-macos-none-0.11.0-dev.2441+eb19f73af/include --zig-lib-dir /Users/m1/actions-runner/_work/zig/zig/build/../lib --listen=- 
zig build-exe passing_u0_to_function Debug x86_64-macos: error: error: unable to open 'zig-local-cache/o/72128730f11303794fecca406df163cf': FileNotFound

zig build-exe passing_u0_to_function Debug x86_64-macos: error: the following command exited with error code 1:
/Users/m1/actions-runner/_work/zig/zig/build/stage3-release/bin/zig build-exe -fno-LLVM -fno-LLD /Users/m1/actions-runner/_work/zig/zig/build/zig-local-cache/o/72128730f11303794fecca406df163cf/tmp.zig --cache-dir /Users/m1/actions-runner/_work/zig/zig/build/zig-local-cache --global-cache-dir /Users/m1/actions-runner/_work/zig/zig/build/zig-global-cache --name passing_u0_to_function -target x86_64-macos -mcpu x86_64 -L /Users/m1/zig+llvm+lld+clang-aarch64-macos-none-0.11.0-dev.2441+eb19f73af/lib -I /Users/m1/zig+llvm+lld+clang-aarch64-macos-none-0.11.0-dev.2441+eb19f73af/include --zig-lib-dir /Users/m1/actions-runner/_work/zig/zig/build/../lib --listen=- 
zig build-exe passing_u0_to_function Debug aarch64-linux: error: error: unable to open 'zig-local-cache/o/72128730f11303794fecca406df163cf': FileNotFound

zig build-exe passing_u0_to_function Debug aarch64-linux: error: the following command exited with error code 1:
/Users/m1/actions-runner/_work/zig/zig/build/stage3-release/bin/zig build-exe -fno-LLVM -fno-LLD /Users/m1/actions-runner/_work/zig/zig/build/zig-local-cache/o/72128730f11303794fecca406df163cf/tmp.zig --cache-dir /Users/m1/actions-runner/_work/zig/zig/build/zig-local-cache --global-cache-dir /Users/m1/actions-runner/_work/zig/zig/build/zig-global-cache --name passing_u0_to_function -target aarch64-linux -mcpu generic -L /Users/m1/zig+llvm+lld+clang-aarch64-macos-none-0.11.0-dev.2441+eb19f73af/lib -I /Users/m1/zig+llvm+lld+clang-aarch64-macos-none-0.11.0-dev.2441+eb19f73af/include --zig-lib-dir /Users/m1/actions-runner/_work/zig/zig/build/../lib --listen=- 
zig build-exe passing_u0_to_function Debug aarch64-macos: error: error: unable to open 'zig-local-cache/o/72128730f11303794fecca406df163cf': FileNotFound

zig build-exe passing_u0_to_function Debug aarch64-macos: error: the following command exited with error code 1:
/Users/m1/actions-runner/_work/zig/zig/build/stage3-release/bin/zig build-exe -fno-LLVM -fno-LLD /Users/m1/actions-runner/_work/zig/zig/build/zig-local-cache/o/72128730f11303794fecca406df163cf/tmp.zig --cache-dir /Users/m1/actions-runner/_work/zig/zig/build/zig-local-cache --global-cache-dir /Users/m1/actions-runner/_work/zig/zig/build/zig-global-cache --name passing_u0_to_function -target aarch64-macos -mcpu generic -L /Users/m1/zig+llvm+lld+clang-aarch64-macos-none-0.11.0-dev.2441+eb19f73af/lib -I /Users/m1/zig+llvm+lld+clang-aarch64-macos-none-0.11.0-dev.2441+eb19f73af/include --zig-lib-dir /Users/m1/actions-runner/_work/zig/zig/build/../lib --listen=- 
zig build-exe passing_u0_to_function Debug x86_64-windows: error: error: unable to open 'zig-local-cache/o/72128730f11303794fecca406df163cf': FileNotFound

zig build-exe passing_u0_to_function Debug x86_64-windows: error: the following command exited with error code 1:
/Users/m1/actions-runner/_work/zig/zig/build/stage3-release/bin/zig build-exe -fno-LLVM -fno-LLD /Users/m1/actions-runner/_work/zig/zig/build/zig-local-cache/o/72128730f11303794fecca406df163cf/tmp.zig --cache-dir /Users/m1/actions-runner/_work/zig/zig/build/zig-local-cache --global-cache-dir /Users/m1/actions-runner/_work/zig/zig/build/zig-global-cache --name passing_u0_to_function -target x86_64-windows -mcpu x86_64 -L /Users/m1/zig+llvm+lld+clang-aarch64-macos-none-0.11.0-dev.2441+eb19f73af/lib -I /Users/m1/zig+llvm+lld+clang-aarch64-macos-none-0.11.0-dev.2441+eb19f73af/include --zig-lib-dir /Users/m1/actions-runner/_work/zig/zig/build/../lib --listen=- 
zig build-exe passing_u0_to_function Debug wasm32-wasi: error: error: unable to open 'zig-local-cache/o/72128730f11303794fecca406df163cf': FileNotFound

zig build-exe passing_u0_to_function Debug wasm32-wasi: error: the following command exited with error code 1:
/Users/m1/actions-runner/_work/zig/zig/build/stage3-release/bin/zig build-exe -fno-LLVM -fno-LLD /Users/m1/actions-runner/_work/zig/zig/build/zig-local-cache/o/72128730f11303794fecca406df163cf/tmp.zig --cache-dir /Users/m1/actions-runner/_work/zig/zig/build/zig-local-cache --global-cache-dir /Users/m1/actions-runner/_work/zig/zig/build/zig-global-cache --name passing_u0_to_function -target wasm32-wasi -mcpu generic -L /Users/m1/zig+llvm+lld+clang-aarch64-macos-none-0.11.0-dev.2441+eb19f73af/lib -I /Users/m1/zig+llvm+lld+clang-aarch64-macos-none-0.11.0-dev.2441+eb19f73af/include --zig-lib-dir /Users/m1/actions-runner/_work/zig/zig/build/../lib --listen=- 

The FileNotFound error is coming directly from the input file given to zig build-exe in each command.

The relevant code is here:

https://github.com/ziglang/zig/blob/a1aa55ebe52636f28335fd7ab30321fc52f48775/lib/std/Build/Cache.zig#L395-L435

We have 7 processes racing to populate /Users/m1/actions-runner/_work/zig/zig/build/zig-local-cache/o/72128730f11303794fecca406df163cf/tmp.zig with the same contents:

pub fn main() void {
    doNothing(0);
}
fn doNothing(arg: u0) void {
    _ = arg;
}

// run
//

They all report "cached", which prompts the question, where did the file get written from? I almost want to suspect hash collision, since it would explain this error output, if the colliding hash used a different filename than "tmp.zig".

|  +- run passing_u0_to_function transitive failure
|  |  +- zig build-exe passing_u0_to_function Debug x86_64-linux failure
|  |     +- WriteFile tmp.zig cached
|  +- run passing_u0_to_function transitive failure
|  |  +- zig build-exe passing_u0_to_function Debug arm-linux failure
|  |     +- WriteFile tmp.zig cached
|  +- run passing_u0_to_function transitive failure
|  |  +- zig build-exe passing_u0_to_function Debug aarch64-linux failure
|  |     +- WriteFile tmp.zig cached
|  +- run passing_u0_to_function transitive failure
|  |  +- zig build-exe passing_u0_to_function Debug x86_64-macos failure
|  |     +- WriteFile tmp.zig cached
|  +- run passing_u0_to_function transitive failure
|  |  +- zig build-exe passing_u0_to_function Debug aarch64-macos failure
|  |     +- WriteFile tmp.zig cached
|  +- run passing_u0_to_function transitive failure
|  |  +- zig build-exe passing_u0_to_function Debug x86_64-windows failure
|  |     +- WriteFile tmp.zig cached
|  +- run passing_u0_to_function transitive failure
|  |  +- zig build-exe passing_u0_to_function Debug wasm32-wasi failure
|  |     +- WriteFile tmp.zig cached

When I run this locally on my linux host, zig build test-cases succeeds, and the above output is slightly different in that the first WriteFile says "success" instead of "cached" and then the other 6 say "cached".

Perhaps there is a false positive cache hit happening here? I don't see how that could be happening based on the logic in Cache.zig however.

The only thing I can think of right now is the possibility that file system writes on macOS are asynchronous in the sense that file system operations on one file done after another file are not guaranteed to be observed in the same order by another process.

Pseudocode example:

  1. Process A writes data.txt.
  2. Process A creates lock.txt.
  3. Process B waits until open() returns OK for lock.txt.
  4. Process B attempts to open data.txt and receives ENOENT.

If this sequence of operations is expected to be possible on macOS then there is a design flaw in Zig's cache system.

Perhaps @jamii has some insight on this last point?

andrewrk avatar Apr 20 '23 05:04 andrewrk

This is probably not correct even with many linux filesystems, depending on the settings. Operations on directories (ie creating files) can be reordered arbitrarily except before/after an fsync on the directory. See eg https://www.usenix.org/system/files/conference/osdi14/osdi14-paper-pillai.pdf.

I think at minimum you need:

  • Write data.txt
  • Fsync data.txt (to ensure the full write is visible to other processes)
  • Fsync the parent directory (to ensure that the creation of data.txt is visible to other processes)
  • Write lock.txt

It's roughly the same situation as reads/writes between threads, where fsync is a combined read+write barrier.

On mac the fsync might have to be fcntl(F_FULLFSYNC) if it's important that rebooting doesn't leave the directory in an weird state.

jamii avatar Apr 20 '23 23:04 jamii

I'd recommend writing the caching code so that it can be run against a mocked filesystem which reorders operations at random unless prevented by an fsync. Or even just throws an error if one process tries to read from a file that another process has modified and not fsynced.

jamii avatar Apr 20 '23 23:04 jamii

Thank you @jamii! Appreciate your insight here.

andrewrk avatar Apr 21 '23 19:04 andrewrk

They all report "cached"

This should be fixed by #15641

The previously described behavior was possible because:

  1. Process A on line 398 tries to open a manifest file with the intent to get an exclusive lock, which fails because the file does not exist.
  2. Process A on line 414 tries to create a manifest file with the intent to get an exclusive lock, which successfully creates the file.
  3. Process B on line 398 opens the manifest file with the intent to get an exclusive lock, which succeeds because the file exists.
  4. Processes A and B race for the lock, and Process B wins.
  5. Process A on line 429 runs the error.WouldBlock prong because the lock was already taken.
  6. Process B, having opened an existing file and obtained the exclusive lock, decides that it was created by another process, and so must be ready to receive a hit (and there are no (missing) inputs in the manifest file to contradict this and force a miss).
  7. Process B eventually downgrades to a shared lock to handle the hit.
  8. Process A on line 408 unblocks from taking a shared lock and proceeds as Process B did in step 6.
  9. This generalizes to N processes receiving a hit, and no process ever receiving an initial miss.

jacobly0 avatar May 11 '23 16:05 jacobly0

Thank you for your fix, @jacobly0 !

@andrewrk mentioned there is still a race in Windows:

Finally, we have observed one last cache race condition that only affects Windows. It is diagnosed to be related to Windows-specific DLL API usage but not yet solved.

Do we have a tracking issue for this? I need a reference to put into the hermetic_cc_toolchain workaround.

motiejus avatar May 17 '23 07:05 motiejus

Do we have a tracking issue for this?

#14978

Edit: @motiejus to be clear, this unrelated issue is not affected by a cache clear, and can be resolved by simply rerunning the build (and not getting very unlucky again).

jacobly0 avatar May 17 '23 07:05 jacobly0

Do we have a tracking issue for this?

#14978

Edit: @motiejus to be clear, this unrelated issue is not affected by a cache clear, and can be resolved by simply rerunning the build (and not getting very unlucky again).

Thanks! Closing this one and simplifying my workaround in https://github.com/uber/hermetic_cc_toolchain/pull/74 now.

motiejus avatar May 17 '23 07:05 motiejus