zig icon indicating copy to clipboard operation
zig copied to clipboard

std.mem: adapt SIMD code from indexOfScalar for other scalar functions

Open Vexu opened this issue 8 months ago • 0 comments

Continued from #23889

All benchmarks compiled in ReleaseFast on -target x86_64-linux-gnu -mcpu=haswell.

echo 1 > data.bin
head -c 1G /dev/zero >> data.bin
echo 2 >> data.bin

head -c 1G /dev/urandom > data.bin
const std = @import("std");
pub fn main() !void {
    const file = try std.fs.cwd().openFile("data.bin", .{ .mode = .read_only });
    const buffer = try file.readToEndAlloc(std.heap.page_allocator, 1 << 48);

    for (0..20) |_| {
        std.mem.doNotOptimizeAway(std.mem.indexOfScalar(u8, buffer, '2'));

        // std.mem.doNotOptimizeAway(std.mem.containsAtLeastScalar(u8, buffer, 1 << 32, 0));

        // std.mem.doNotOptimizeAway(std.mem.count(u8, buffer, "\x00"));
        // std.mem.doNotOptimizeAway(std.mem.countScalar(u8, buffer, 0));

        // std.mem.doNotOptimizeAway(std.mem.lastIndexOfScalar(u8, buffer, '1'));
        // std.mem.doNotOptimizeAway(std.mem.lastIndexOf(u8, buffer, "1"));
    }
}

indexOfScalar:

$ hyperfine -w 2 -r 10 ./index-old ./index-new
Benchmark 1: ./index-old
  Time (mean ± σ):      1.914 s ±  0.031 s    [User: 1.551 s, System: 0.339 s]
  Range (min … max):    1.863 s …  1.970 s    10 runs
 
Benchmark 2: ./index-new
  Time (mean ± σ):      1.900 s ±  0.025 s    [User: 1.538 s, System: 0.338 s]
  Range (min … max):    1.876 s …  1.950 s    10 runs
 
Summary
  ./index-new ran
    1.01 ± 0.02 times faster than ./index-old

containsAtLeastScalar:

$ hyperfine -w 2 -r 10 ./contains-old ./contains-new
Benchmark 1: ./contains-old
  Time (mean ± σ):     17.224 s ±  0.050 s    [User: 16.916 s, System: 0.231 s]
  Range (min … max):   17.164 s … 17.321 s    10 runs
 
Benchmark 2: ./contains-new
  Time (mean ± σ):      1.826 s ±  0.018 s    [User: 1.584 s, System: 0.227 s]
  Range (min … max):    1.808 s …  1.853 s    10 runs
 
Summary
  ./contains-new ran
    9.43 ± 0.10 times faster than ./contains-old

count/countScalar:

$ hyperfine -w 2 -r 10 ./count-old ./count-new
Benchmark 1: ./count-old
 ⠸ Performing warmup runs         ██████████████████████████████████░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░ ETA 00:03:35
^C
$ hyperfine -w 2 -r 10 ./count-new
Benchmark 1: ./count-new
  Time (mean ± σ):      1.775 s ±  0.016 s    [User: 1.533 s, System: 0.225 s]
  Range (min … max):    1.755 s …  1.806 s    10 runs

lastIndexOf/lastIndexOfScalar:

$ hyperfine -w 2 -r 10 ./last-index-scalar-old ./last-index-old ./last-index-new
Benchmark 1: ./last-index-scalar-old
  Time (mean ± σ):      8.408 s ±  0.038 s    [User: 8.119 s, System: 0.231 s]
  Range (min … max):    8.354 s …  8.477 s    10 runs
 
Benchmark 2: ./last-index-old
  Time (mean ± σ):     10.867 s ±  0.036 s    [User: 10.583 s, System: 0.226 s]
  Range (min … max):   10.817 s … 10.918 s    10 runs
 
Benchmark 3: ./last-index-new
  Time (mean ± σ):      1.696 s ±  0.013 s    [User: 1.452 s, System: 0.227 s]
  Range (min … max):    1.680 s …  1.716 s    10 runs
 
Summary
  ./last-index-new ran
    4.96 ± 0.04 times faster than ./last-index-scalar-old
    6.41 ± 0.05 times faster than ./last-index-old

Vexu avatar May 19 '25 19:05 Vexu