zig Poor code generation for std.mem.eql

I discovered when I was benchmarking some parser code I was working on. As the title states, the current implementation of std.mem.eql is very naive:

https://github.com/ziglang/zig/blob/bf67a3fdc9efea7126058b21e687c24868bc1268/lib/std/mem.zig#L485-L493

It iterates over slices byte by byte which causes the compiler to generate a bunch of compare and jump instructions: https://godbolt.org/z/Mobabsq8a

One of the most frequent use cases (95%+ in the standard library) for std.mem.eql is comparing a string with a compile time known string. Seeing as comparing memory is so common; especially in high performance code like parsers (and a bunch of other standard library functions). I think it is worth having a less naive more optimized implementation of std.mem.eql. I identified a few points which could improve the implementation:

Remove the if (a.ptr == b.ptr) return true; line. (I think this is a premature optimization) It causes a extra comparison and branch in all compilations of the eql function but I can't think of any scenarios where this would acutally be a major time save.
Convert every slice to a slice of native int type if the alignment allows for it, this causes less loads and compares. Though you will need a clever to deal with the remainder by using smaller int types and/or comparing some bytes twice.
If possible/desirable do the above but with the SIMD registers.

I was working on a implementation that did some of the above but haven't been able to get it done yet. There are so many cases to account for and some of the optimizations you can only make if you know something about the length of the comparison at compile time. I think having a dedicated @memcmp akin to @memcpy and @memset might be a solution, though I'm not sure how much work adding that would be :-)

May 04 '21 20:05 ManDeJan

If you decide to work on this, I think you can leverage this repo to measure any performance improvements you may gain: https://github.com/ziglang/gotta-go-fast

Also if there's one thing I've learned, it's that modern processors are very unpredictable when it comes to performance. Things you think might make something faster may actually make things slower. For example, a simple implementation, even if it does more work, may run faster if it's encoded in fewer instructions due to the icache. One of the slowest things a computer does nowadays is fetch memory from ram, so any extra instructions and/or data you need to access, espcially if it's not within the same cache line can have drastic affects on performance.

In any case, that's why it's good for us to use benchmarks where we can measure whether something results in a performance improvement. In your case here, the compiler test suite might be a good benchmark for this.

May 05 '21 01:05 marler8997

Relevant snippet from @fengb: https://github.com/fengb/fundude/blob/ad1783281625f2b7b05fbced8ca358d79ff51b7e/src/util.zig#L105-L110

May 05 '21 08:05 Jarred-Sumner

I just ran into this and had to call out to __builtin_memcmp to get performance on par with the ported C code (after noticing the std.mem.eql code standing out prominently in the profiler)

In short, swapping out std.StringHashMap with std.HashMap with a context pointing to a __builtin_memcmp wrapper for the eql implementation increased performance by 20%.

Dec 30 '21 15:12 cryptocode

FWIW there's a memcmp.zig (https://github.com/ziglang/zig/blob/master/lib/compiler_rt/memcmp.zig) that seems to be completely unreferenced in the entire codebase (dead file) and it seems like it could be used for a @memcmp builtin.

Oct 22 '22 10:10 wooster0

zig zig copied to clipboard

Poor code generation for std.mem.eql

zig
zig copied to clipboard