feat: Runtime detection, take 2
What type of PR is this?
feat: A new feature
Check the PR title.
- [x] This PR title match the format: <type>(optional scope): <description>
- [x] The description of this PR title is user-oriented and clear enough for others to understand.
- [ ] Attach the PR updating the user documentation if the current PR requires user awareness at the usage level. User docs repo
(Optional) More detailed description for this PR(en: English/zh: Chinese).
en:
This PR adds runtime detection of SIMD features but, unlike in #55, not on the level of SIMD instructions, but instead implements enum dispatch over multiple inner parsers that each either use AVX2, SSE2, or NEON (or the scalar fallback).
(Optional) Which issue(s) this PR fixes:
Closes #14
(optional) The PR that updates user documentation:
This isn't conditionally doing the improvements for the NEON backend via the NeonBits struct yet. Thinking about also adding support for that.
Thanks a lot, i need time to review this
That commit should get rid of a bunch of compile issues related to adding generic types to structs that don't take them.
I was testing this on an x86 system, not passing any compiler options that would enable architecture-specific optimizations, meaning it only compiled with SSE2 support. I didn't even get to see the errors.
I'm still trying to figure out what the best way forward is to make to_bitmask64 on NEON CPUs work.
Thanks a lot, I will review the PR this week. I will benchmark the performance at first~
@liuq19 Would you be up to benching the speed of an implementation for NEON that runtime dispatches the bitmask creation? We could technically cache the result whether NEON (or any other feature, really) is supported in globals. That way the performance loss shouldn't be too bad.
Because hacking in runtime dispatch for bitmask creation otherwise is really tricky.
Hacked in the version that dispatches on each bitmask call. Maybe the performance hit is too severe to justify..
Okay, I'm not sure why this is broken? It's on ARM64, right? I guess I'll have to whip out the cross-compilation for now. I don't own a suitable ARM machine for testing.
Now I just need to find a way to properly express this in trait form, preferrably very generic.
I benched in x86 and maybe the simd is not work in runtime-detect. There maybe some function missed and not working in simd
Running benches/deserialize_struct.rs (target/release/deps/deserialize_struct-016910da72a3ea13)
twitter/sonic_rs::from_slice_unchecked
time: [861.50 µs 862.23 µs 863.02 µs]
change: [+76.254% +76.499% +76.735%] (p = 0.00 < 0.05)
Performance has regressed.
Found 5 outliers among 100 measurements (5.00%)
4 (4.00%) high mild
1 (1.00%) high severe
twitter/sonic_rs::from_slice
time: [885.21 µs 885.86 µs 886.62 µs]
change: [+73.328% +73.607% +73.868%] (p = 0.00 < 0.05)
Performance has regressed.
Found 5 outliers among 100 measurements (5.00%)
4 (4.00%) high mild
1 (1.00%) high severe
citm_catalog/sonic_rs::from_slice_unchecked
time: [1.6269 ms 1.6298 ms 1.6327 ms]
change: [+71.170% +71.493% +71.800%] (p = 0.00 < 0.05)
Performance has regressed.
citm_catalog/sonic_rs::from_slice
time: [1.6559 ms 1.6572 ms 1.6587 ms]
change: [+67.728% +67.938% +68.127%] (p = 0.00 < 0.05)
Performance has regressed.
Found 3 outliers among 100 measurements (3.00%)
2 (2.00%) high mild
1 (1.00%) high severe
canada/sonic_rs::from_slice_unchecked
time: [4.5318 ms 4.5332 ms 4.5349 ms]
change: [+20.602% +20.659% +20.716%] (p = 0.00 < 0.05)
Performance has regressed.
Found 7 outliers among 100 measurements (7.00%)
3 (3.00%) high mild
4 (4.00%) high severe
canada/sonic_rs::from_slice
time: [4.5934 ms 4.5951 ms 4.5970 ms]
change: [+20.546% +20.602% +20.659%] (p = 0.00 < 0.05)
Performance has regressed.
Found 6 outliers among 100 measurements (6.00%)
2 (2.00%) high mild
4 (4.00%) high severe
That's weird. On my local machine, the change is somewhat in the ballpark of ~3-4%, which is acceptable (I'd need to profile it to get a closer idea of what's going on; where the performance is lost. Maybe some optimization opportunities that are too opaque for the compiler with all the generics)
Running benches/deserialize_struct.rs (target/release/deps/deserialize_struct-b15a6d4de21d32b1)
Gnuplot not found, using plotters backend
twitter/sonic_rs::from_slice_unchecked
time: [438.48 µs 440.11 µs 442.07 µs]
change: [+4.0017% +4.6062% +5.2032%] (p = 0.00 < 0.05)
Performance has regressed.
Found 13 outliers among 100 measurements (13.00%)
3 (3.00%) high mild
10 (10.00%) high severe
twitter/sonic_rs::from_slice
time: [445.74 µs 447.52 µs 449.94 µs]
change: [+3.0931% +3.6056% +4.2171%] (p = 0.00 < 0.05)
Performance has regressed.
Found 12 outliers among 100 measurements (12.00%)
4 (4.00%) high mild
8 (8.00%) high severe
citm_catalog/sonic_rs::from_slice_unchecked
time: [856.18 µs 856.77 µs 857.50 µs]
change: [+1.2295% +1.3723% +1.5158%] (p = 0.00 < 0.05)
Performance has regressed.
Found 6 outliers among 100 measurements (6.00%)
4 (4.00%) high mild
2 (2.00%) high severe
could you remove or comment the config in .cargo/config.toml
Already wasn't active due to my global Cargo config. But for the benches above I set -C target-cpu=native. I can somewhat reproduce your findings when I toggle -C target-cpu=native for the main branch, and disable it for the runtime detection branch. But that ignores that all the optimizations the Rust compiler by default doesn't do if it isn't aware of the target CPU model.
I added debug statements and the runtime correctly detects that my CPU supports AVX2, with and without target-cpu=native.
So it is much slower without target-cpu=native but that is to be expected with all the CPU-specific optimizations LLVM can do. But the runtime detection only sets performance back by under 5%, which is IMO acceptable as an opt-in feature.
Never mind, I get what you mean. Let me look into it.
maybe we can try to compare more benchmarks
any updates?
Sorry, I've been busy the last two weeks, but I can hopefully do some work today at the airport
thanks for your contrib. the code has too many conflicts and we will refactor the parser.