Jamie Brandon
Jamie Brandon
IPC doesn't seem terrible: ``` 175,850,971,567 cpu_core/cycles/ (35.11%) 309,221,910,919 cpu_core/instructions/ (42.03%) 3,567,518,372 cpu_core/cache-references/ (48.93%) 1,930,730,082 cpu_core/cache-misses/ (55.69%) 1,804,591,591 cpu_core/L1-dcache-load-misses/ (62.61%) 61,552,296,620 cpu_core/L1-dcache-loads/ (62.46%) 28,322,845,875 cpu_core/L1-dcache-stores/ (62.46%) 324,262,031 cpu_core/L1-icache-load-misses/ (55.56%) cpu_core/L1-icache-loads/...
Here are some perf samples - https://gist.github.com/jamii/c6cbde8b172380ba974ccd02933da3e5 Big offenders for various misses are: * various hashmap functions * tree.lookup_from_memory/table_immutable.get (binary search) * memset (zeroing blocks in grid) * sort *...
We spend 1/3rd of cpu time in blake3. Building with release-fast gives ~10% throughput improvement. Building with -Dcpu=alderlake gives none. I checked that the output of vsr.checksum is using the...
https://github.com/ziglang/zig/blob/6d44a6222d6eba600deb7f16c124bfa30628fb60//lib/std/crypto/benchmark.zig#L404 reports 90 mb/s for blake3. https://raw.githubusercontent.com/BLAKE3-team/BLAKE3-specs/master/blake3.pdf reports 0.5 cycles/byte for single-thread blake3 on an older cpu. That would work about to about 6-8gb/s on my cpu, which roughly matches...
As further evidence that we're cpu-bound, building in release-fast gives me a 17% throughput increase. If someone can figure out a way to diff the two profiles, maybe we'd get...
> Also, out of interest, what is the effect of making our checksum function a memset(0) function that always returns true when validated? Hashes are used as unique identifiers, so...
> We may be able to amortize more of the hash function setup cost in this way. It might be that this is the weak spot in the Zig code....
I've add spans for io_callback (one yield-free region) and io_flush (mutliple io_callbacks with no io_uring submission in between). On my laptop the longest io_callbacks are ~200-250ms for table mutable sorts....
> we shouldn't try to bypass the cache during compaction, to save the cache miss (but still pay the I/O) The idea was not to save the cache miss, but...
Maybe another way to think about this: If a block was not in the cache, being touched by compaction does not indicate that it is likely to be read again...