Strong performance regression with target-cpu=native
So, ahash with target-cpu=native on my setup shows significant performance regression
This may be a Rust/LLVM issue, but I'll create an issue here first.
Repro: https://github.com/fr0staman/rust-ahash-target-native-performance-issue
My setup
Rust:
rustc 1.74.1 (a28077b28 2023-12-04)
binary: rustc
commit-hash: a28077b28a02b92985b3a3faecf92813155f1ea1
commit-date: 2023-12-04
host: x86_64-unknown-linux-gnu
release: 1.74.1
LLVM version: 17.0.4
System:
CPU: AMD Ryzen 5 4500U
OS: Ubuntu 22.04.3 LTS
Results
Standard target
fr0staman@kotobook:~/source/repos/rust/rust-ahash-target-native-performance-issue$ cargo bench
Finished bench [optimized] target(s) in 36.18s
Running unittests src/main.rs (target/release/deps/rust_ahash_target_native_performance_issue-4df22a78d1110619)
running 0 tests
test result: ok. 0 passed; 0 failed; 0 ignored; 0 measured; 0 filtered out; finished in 0.00s
Running benches/issue.rs (target/release/deps/ahash-3b7ae86a7bc7bacb)
Gnuplot not found, using plotters backend
Performance/ahash/(32, 128)
time: [21.672 µs 21.698 µs 21.727 µs]
Found 6 outliers among 100 measurements (6.00%)
3 (3.00%) low mild
3 (3.00%) high mild
Performance/ahash/(256, 1024)
time: [983.01 µs 983.94 µs 984.92 µs]
Found 6 outliers among 100 measurements (6.00%)
3 (3.00%) low mild
2 (2.00%) high mild
1 (1.00%) high severe
Performance/ahash/(1024, 4096)
time: [15.256 ms 15.298 ms 15.341 ms]
target-cpu=native
fr0staman@kotobook:~/source/repos/rust/rust-ahash-target-native-performance-issue$ RUSTFLAGS='-C target-cpu=native' cargo bench
Finished bench [optimized] target(s) in 46.42s
Running unittests src/main.rs (target/release/deps/rust_ahash_target_native_performance_issue-4df22a78d1110619)
running 0 tests
test result: ok. 0 passed; 0 failed; 0 ignored; 0 measured; 0 filtered out; finished in 0.00s
Running benches/issue.rs (target/release/deps/ahash-3b7ae86a7bc7bacb)
Gnuplot not found, using plotters backend
Performance/ahash/(32, 128)
time: [37.734 µs 37.761 µs 37.789 µs]
change: [+73.336% +73.657% +73.980%] (p = 0.00 < 0.05)
Performance has regressed.
Found 3 outliers among 100 measurements (3.00%)
1 (1.00%) low severe
1 (1.00%) low mild
1 (1.00%) high severe
Performance/ahash/(256, 1024)
time: [2.4681 ms 2.4698 ms 2.4717 ms]
change: [+150.51% +150.90% +151.29%] (p = 0.00 < 0.05)
Performance has regressed.
Found 2 outliers among 100 measurements (2.00%)
2 (2.00%) high mild
Performance/ahash/(1024, 4096)
time: [38.308 ms 38.369 ms 38.433 ms]
change: [+149.98% +150.82% +151.60%] (p = 0.00 < 0.05)
Performance has regressed.
Found 2 outliers among 100 measurements (2.00%)
2 (2.00%) high mild
example can be reduced to target-feature=+aes
It looks like this bench is only hashing char which SHOULD be specialized both cases. (Ideally to identical instructions.) I'll take a look this.
This does not appear to happen on my intel i9. There must be something odd in the assembly for the Ryzen. If +aes is giving identical performance to native it is possible it's not picking up the sse2 instructions for some reason.
@fr0staman If you run rustc --print=target-cpus what does it indicate the detected CPU target is?
This might be related https://github.com/rust-lang/rust/issues/80633
rustc --print=target-cpus
Available CPUs for this target:
native - Select the CPU of the current host (currently znver4).
alderlake
amdfam10
athlon
athlon-4
athlon-fx
athlon-mp
athlon-tbird
athlon-xp
athlon64
athlon64-sse3
atom
atom_sse4_2
atom_sse4_2_movbe
barcelona
bdver1
bdver2
bdver3
bdver4
bonnell
broadwell
btver1
btver2
c3
c3-2
cannonlake
cascadelake
cooperlake
core-avx-i
core-avx2
core2
core_2_duo_sse4_1
core_2_duo_ssse3
core_2nd_gen_avx
core_3rd_gen_avx
core_4th_gen_avx
core_4th_gen_avx_tsx
core_5th_gen_avx
core_5th_gen_avx_tsx
core_aes_pclmulqdq
core_i7_sse4_2
corei7
corei7-avx
emeraldrapids
generic
geode
goldmont
goldmont-plus
goldmont_plus
grandridge
graniterapids
graniterapids-d
graniterapids_d
haswell
i386
i486
i586
i686
icelake-client
icelake-server
icelake_client
icelake_server
ivybridge
k6
k6-2
k6-3
k8
k8-sse3
knl
knm
lakemont
meteorlake
mic_avx512
nehalem
nocona
opteron
opteron-sse3
penryn
pentium
pentium-m
pentium-mmx
pentium2
pentium3
pentium3m
pentium4
pentium4m
pentium_4
pentium_4_sse3
pentium_ii
pentium_iii
pentium_iii_no_xmm_regs
pentium_m
pentium_mmx
pentium_pro
pentiumpro
prescott
raptorlake
rocketlake
sandybridge
sapphirerapids
sierraforest
silvermont
skx
skylake
skylake-avx512
skylake_avx512
slm
tigerlake
tremont
westmere
winchip-c6
winchip2
x86-64 - This is the default target CPU for the current build target (currently x86_64-unknown-linux-gnu).
x86-64-v2
x86-64-v3
x86-64-v4
yonah
znver1
znver2
znver3
znver4
```
Also has regression
rustc --print=target-cpus
Available CPUs for this target:
native - Select the CPU of the current host (currently znver1).
alderlake
amdfam10
athlon
athlon-4
athlon-fx
athlon-mp
athlon-tbird
athlon-xp
athlon64
athlon64-sse3
atom
atom_sse4_2
atom_sse4_2_movbe
barcelona
bdver1
bdver2
bdver3
bdver4
bonnell
broadwell
btver1
btver2
c3
c3-2
cannonlake
cascadelake
cooperlake
core-avx-i
core-avx2
core2
core_2_duo_sse4_1
core_2_duo_ssse3
core_2nd_gen_avx
core_3rd_gen_avx
core_4th_gen_avx
core_4th_gen_avx_tsx
core_5th_gen_avx
core_5th_gen_avx_tsx
core_aes_pclmulqdq
core_i7_sse4_2
corei7
corei7-avx
emeraldrapids
generic
geode
goldmont
goldmont-plus
goldmont_plus
grandridge
graniterapids
graniterapids-d
graniterapids_d
haswell
i386
i486
i586
i686
icelake-client
icelake-server
icelake_client
icelake_server
ivybridge
k6
k6-2
k6-3
k8
k8-sse3
knl
knm
lakemont
meteorlake
mic_avx512
nehalem
nocona
opteron
opteron-sse3
penryn
pentium
pentium-m
pentium-mmx
pentium2
pentium3
pentium3m
pentium4
pentium4m
pentium_4
pentium_4_sse3
pentium_ii
pentium_iii
pentium_iii_no_xmm_regs
pentium_m
pentium_mmx
pentium_pro
pentiumpro
prescott
raptorlake
rocketlake
sandybridge
sapphirerapids
sierraforest
silvermont
skx
skylake
skylake-avx512
skylake_avx512
slm
tigerlake
tremont
westmere
winchip-c6
winchip2
x86-64 - This is the default target CPU for the current build target (currently x86_64-unknown-linux-gnu).
x86-64-v2
x86-64-v3
x86-64-v4
yonah
znver1
znver2
znver3
znver4
@tkaitchuck I actually think this issue might be relevant: https://internals.rust-lang.org/t/slower-code-with-c-target-cpu-native/17315/7
https://share.firefox.dev/3RWEHk5 without aes flag https://share.firefox.dev/48D3E9Y with aes flag
Aes feature is indeed detected
@fr0staman Can you check if this is fixed on the 0.9 prerelease branch
Certainly!
Unfortunately, nothing has changed:
fr0staman@kotobook:~/source/repos/rust/rust-ahash-target-native-performance-issue$ RUSTFLAGS='-C target-cpu=native' cargo bench
...
Compiling ahash v0.9.0 (https://github.com/tkaitchuck/aHash?branch=0.9-prerelease#af37d79e)
...
Finished bench [optimized] target(s) in 43.16s
Running unittests src/main.rs (target/release/deps/rust_ahash_target_native_performance_issue-a98c230d15dcf9ae)
running 0 tests
test result: ok. 0 passed; 0 failed; 0 ignored; 0 measured; 0 filtered out; finished in 0.00s
Running benches/issue.rs (target/release/deps/issue-a3d835f7ef64d9be)
Gnuplot not found, using plotters backend
Performance/ahash/(32, 128)
time: [37.539 µs 37.543 µs 37.546 µs]
change: [+97.437% +97.897% +98.305%] (p = 0.00 < 0.05)
Performance has regressed.
Found 4 outliers among 100 measurements (4.00%)
2 (2.00%) high mild
2 (2.00%) high severe
Performance/ahash/(256, 1024)
time: [2.3726 ms 2.3733 ms 2.3740 ms]
change: [+156.12% +156.46% +156.76%] (p = 0.00 < 0.05)
Performance has regressed.
Found 2 outliers among 100 measurements (2.00%)
2 (2.00%) high mild
Performance/ahash/(1024, 4096)
time: [38.066 ms 38.109 ms 38.153 ms]
change: [+154.20% +155.09% +155.95%] (p = 0.00 < 0.05)
Performance has regressed.
Found 1 outliers among 100 measurements (1.00%)
1 (1.00%) high mild