Different performance results of huge_memset()

Open rilysh opened this issue 2 years ago • 0 comments

Hello,

Presently, this implementation of huge_memset() seems to have different performance in different hardware. To ensure the correctness of the provided benchmarks, I ran three (glibc, musl, and this) implementations of memset() on my machine and the results were dramatically different than the provided ones. The fastest one was musl implementation, the second was glibc, and the third was this implementation.

To provide some more context, I've used glibc v2.38, musl v1.2.4, for the compiler, I've used both gcc 12.2.0 and clang 15.0.6, and for benchmarking, I've used hyperfine. Note that for glibc, I didn't call this directly from my host machine glibc, but rather stripped it from the source tree.

You haven't mentioned which version of glibc and musl you've used for the benchmark. Although seeing this, it seems like the latest one (that wasn't modified in last 6 years on source tree). However, in the case of glibc, it's hard for me to guess.

Besides this, I've also found something interesting. For some reason, using GCC with -O2, the huge_memset() function boils down to a lot of mov instructions and no vectorization. This, however, doesn't happen with clang. Passing -march=native to GCC will create the vectorized version. glibc and musl implementations are unaffected by this behavior of GCC.

I've done all these benchmarks on an Intel i3-3210 (Ivy Lake) CPU.

Also, the program I've tested is just a huge (100000) array where I filled it by characters (memsetting).

Here's the log:

Benchmark (using GCC)

Benchmark 1: ./huge
  Time (mean ± σ):     351.0 ms ±  24.4 ms    [User: 349.7 ms, System: 0.4 ms]
  Range (min … max):   337.0 ms … 423.7 ms    20 runs

Benchmark 2: ./glibc
  Time (mean ± σ):     445.3 ms ± 227.1 ms    [User: 444.6 ms, System: 0.4 ms]
  Range (min … max):   294.2 ms … 786.9 ms    20 runs

Benchmark 3: ./musl
  Time (mean ± σ):     309.4 ms ±   6.0 ms    [User: 308.2 ms, System: 0.9 ms]
  Range (min … max):   301.4 ms … 323.6 ms    20 runs

Benchmark (using Clang)

Benchmark 1: ./huge
  Time (mean ± σ):     296.8 ms ±   6.0 ms    [User: 296.2 ms, System: 0.4 ms]
  Range (min … max):   286.4 ms … 316.1 ms    20 runs

Benchmark 2: ./glibc
  Time (mean ± σ):     639.3 ms ± 226.9 ms    [User: 638.3 ms, System: 0.6 ms]
  Range (min … max):   294.1 ms … 792.3 ms    20 runs

Benchmark 3: ./musl
  Time (mean ± σ):     262.3 ms ±   5.2 ms    [User: 261.4 ms, System: 0.5 ms]
  Range (min … max):   280.3 ms … 308.4 ms    20 runs

I've only played with huge_memset(), and haven't tried the small one. Although these benchmarks are tightly bound with a generation of CPU architecture, it would be nice, if there were more benchmarks available to know the difference between hardware. All these binaries were compiled with -O2 -march=native optimization parameters.

Nov 25 '23 19:11 rilysh