zig 10% regression in ReleaseFast performance introduced in 0.12.0-dev.876+aaf46187a

Zig Version

0.12.0-dev.876+aaf46187a

Steps to Reproduce and Observed Behavior

I noticed a regression on my project's benchmark and bisected it via nightlies - zig-macos-aarch64-0.12.0-dev.866+3a47bc715 produces code that runs ~10% faster than zig-macos-aarch64-0.12.0-dev.876+aaf46187a:

$ git clone https://github.com/pkmn/engine.git
<snip>
$ cd engine
$ git reset --hard 12e7292c9df3ad73c9681205fb3acfee1536c425
$ zig version
0.12.0-dev.866+3a47bc715
$ zig build benchmark --  1 1000/10000 281483566841860
122294722,1232901,1471351722788935846

# change zig versions
$ zig version
0.12.0-dev.876+aaf46187a
$ zig build benchmark --  1 1000/10000 281483566841860
137394552,1232901,1471351722788935846

The benchmark tool uses std.time.Timer to perform its own internal timing which it prints out as the first number there (second two numbers are for confirming the benchmarks are computing the same results).

Alternatively, since the regression is large enough you can literally just use time (though this is measuring a different thing than the internal benchmark timer, but thats kind of unimportant from Zig's POV):

$ zig version
0.12.0-dev.866+3a47bc715
$ zig build --verbose benchmark --  1 1000/10000 281483566841860
<grab actual binary from output to run>
$ time ./zig-cache/o/7cb294da313404bb8f51f4c304d47793/benchmark 1 1000/10000 281483566841860
124231503,1232901,1471351722788935846

real	0m0.178s
user	0m0.168s
sys	0m0.005s

# change zig versions
$ zig version
0.12.0-dev.876+aaf46187a
$ zig build --verbose benchmark --  1 1000/10000 281483566841860
<grab actual binary from output to run>
$ time ./zig-cache/o/30082df782561e9cee72e116248d663a/benchmark 1 1000/10000 281483566841860
136831126,1232901,1471351722788935846

real	0m0.198s
user	0m0.182s
sys	0m0.006s

Expected Behavior

There not to be a regression :)

From https://github.com/ziglang/zig/compare/3a47bc715...aaf46187a I'm guessing #17391 is a likely suspect?

Oct 28 '23 17:10 scheibo

mmm interresting.

looks like reverting the change does indeed restore the perf.

--- a/src/codegen/llvm.zig
+++ b/src/codegen/llvm.zig
@@ -10528,7 +10528,8 @@ pub const FuncGen = struct {
             if (isByRef(elem_ty, mod)) {
                 return self.loadByRef(ptr, elem_ty, ptr_alignment, access_kind);
             }
-            return self.loadTruncate(access_kind, elem_ty, ptr, ptr_alignment);
+            //return self.loadTruncate(access_kind, elem_ty, ptr, ptr_alignment);
+            return self.wip.load(access_kind, try o.lowerType(elem_ty), ptr, ptr_alignment, "");
         }
 
         const containing_int_ty = try o.builder.intType(@intCast(info.packed_offset.host_size * 8));

a quick look with perf report shows the main difference comes from gen1.data.Battle(common.rng.PRNG(1)).choice that uses a u4 loop index.

            var slot: u4 = 2;
            while (slot <= 6) : (slot += 1) {
                const id = side.order[slot - 1];
                if (id == 0 or side.pokemon[id - 1].hp == 0) continue;
                out[n] = .{ .type = .Switch, .data = slot };
                n += 1;
            }

The loop (and others in the function) was previously unrolled by llvm, and no longer is.

Oct 29 '23 07:10 xxxbxxx

I've tried to poke at it a little bit, but no idea how to fix this.

Short of doing some kind of range propagation pass or something, I'm not quite sure how it is possible distinguish between "this is a nice clean local variable" and "this may contain some uninitialized left over bits" (as in #14200)

(but then, I know nothing about llvm or zig internals, so maybe there's a way...)

Of course, as a workaround, changing the loop counter to u8 (and adding a few @intCast()) removes the truncations on the loop index and allows llvm heuristics to work and unroll/simplify the code.

Oct 29 '23 18:10 xxxbxxx

Thanks for digging in, @xxxbxxx! Its nice to know that I can work around this with @intCast and different loop counter types throughout the code, though I'm a little worried in general about the performance cliff/footgun so I don't know what Zig wants to do here

Oct 29 '23 18:10 scheibo

Of course, as a workaround, changing the loop counter to u8 (and adding a few @intCast()) removes the truncations on the loop index and allows llvm heuristics to work and unroll/simplify the code.

Changing all non-power-of-2 loop counters to u8 or usize and then adding @intCast()s solved half of the problem (i.e. this is now a 5% regression instead of a 10% regression), but im still stuck with a sizable regression.

~~I'm also confused as to why #17391 is problematic if its supposed to only be extending a change that didn't cause any regression to "optional and unions", and my project does not use unions and has no non-byte-sized optionals.~~

Nov 25 '23 20:11 scheibo

I'm also confused as to why #17391 is problematic if its supposed to only be extending a change that didn't cause any regression to "optional and unions", and my project does not use unions and has no non-byte-sized optionals.

that's a misunderstanding: the "change that didn't cause any regression" is indeed the change triggering the performance issue...

Nov 26 '23 21:11 xxxbxxx