Zig cache issue on concurrent builds

Open nmattia opened this issue 6 months ago • 0 comments

The zig toolchain produces bad output every now and then. This is more likely on short, concurrent builds. One way to reproduce this reliably is to use a configure-style build with transitions:

load("@bazel_skylib//rules:common_settings.bzl", "bool_setting")

load("//:defs.bzl", "enable_a")

bool_setting(
    name = "a",
    build_setting_default = False,
)

alias(
    name = "jemalloc",
    actual = "@jemalloc//:libjemalloc",
)

enable_a(
    name = "jemalloc-a",
    actual = "@jemalloc//:libjemalloc",
)

[!NOTE] the full repro is here. Check out that commit and run podman build . or docker build .. This will trigger multiple concurrent jemalloc builds and should reproduce the issue. If --jobs=1 is added to the last command of the Dockerfile, the issue disappears.

Building this will result in errors like the following:

...
checking whether the C compiler works... yes
checking for C compiler default output file name... a.out
checking for suffix of executables... 
checking whether we are cross compiling... configure: error: in  ...:
configure: error: cannot run C compiled programs.
If you meant to cross compile, use `--host'.

(sometimes it's another step in the configure script that fails, for instance "unsupported pointer size: 0")

The theory is that the build runs into this zig bug where the cache is not consistent. When (in the example above) the configure script runs, it will produce lots of (sequential) short CC commands to test the compiler features. If Bazel schedules another -- similar -- build with a different configuration then the zig toolchain seems to mix the outputs. If the configuration has no actual impact on the build command, then the zig cache will reuse the same "entry".

One very hacky way to work around the issue (implemented here) is to tweak the zig-wrapper.zig to use a different cache location depending on the Bazel configuration:

 
@@ -161,6 +161,71 @@ fn execUnix(arena: mem.Allocator, params: ExecParams) u8 {
     return 1;
 }
 
+fn makeSuffix(allocator: std.mem.Allocator, pwd: []const u8) ![]const u8 {
+    var it = std.mem.tokenize(u8, pwd, "/");
+
+    while (it.next()) |segment| {
+        if (std.mem.startsWith(u8, segment, "k8-opt-")) {
+            var hasher = std.hash.Wyhash.init(0);
+            hasher.update(segment);
+            const hash_value = hasher.final();
+            return std.fmt.allocPrint(allocator, "config-{x}", .{hash_value});
+        }
+    }
+
+    // no "k8-opt-" found
+    return std.fmt.allocPrint(allocator, "config-catchall", .{});
+}
@@ -217,9 +282,19 @@ fn parseArgs(
     var env = process.getEnvMap(arena) catch |err|
         return parseFatal(arena, "error getting env: {s}", .{@errorName(err)});
 
+    // Get the current working directory (PWD)
+    const allocator = std.heap.page_allocator;
+    const pwd = std.fs.cwd().realpathAlloc(allocator, ".") catch {
+        std.process.exit(1);
+    };
+    defer allocator.free(pwd);
+
+    const suffix = try makeSuffix(arena, pwd);
+    const cache_dir = try std.fmt.allocPrint(arena, "{s}/{s}", .{ CACHE_DIR, suffix });
+
     try env.put("ZIG_LIB_DIR", zig_lib_dir);
-    try env.put("ZIG_LOCAL_CACHE_DIR", CACHE_DIR);
-    try env.put("ZIG_GLOBAL_CACHE_DIR", CACHE_DIR);
+    try env.put("ZIG_LOCAL_CACHE_DIR", cache_dir);
+    try env.put("ZIG_GLOBAL_CACHE_DIR", cache_dir);

This works by inferring the configuration from $PWD. This relies on some internal stuff like the output path and will only work on k8 but this was enough to unblock us.

Jun 13 '25 13:06 nmattia