sonic-rs feat: Runtime detection, take 2

What type of PR is this?

feat: A new feature

Check the PR title.

[x] This PR title match the format: <type>(optional scope): <description>
[x] The description of this PR title is user-oriented and clear enough for others to understand.
[ ] Attach the PR updating the user documentation if the current PR requires user awareness at the usage level. User docs repo

(Optional) More detailed description for this PR(en: English/zh: Chinese).

en:

This PR adds runtime detection of SIMD features but, unlike in #55, not on the level of SIMD instructions, but instead implements enum dispatch over multiple inner parsers that each either use AVX2, SSE2, or NEON (or the scalar fallback).

(Optional) Which issue(s) this PR fixes:

Closes #14

(optional) The PR that updates user documentation:

May 25 '24 15:05 aumetra

This isn't conditionally doing the improvements for the NEON backend via the NeonBits struct yet. Thinking about also adding support for that.

May 25 '24 19:05 aumetra

Thanks a lot, i need time to review this

May 30 '24 05:05 liuq19

That commit should get rid of a bunch of compile issues related to adding generic types to structs that don't take them.
I was testing this on an x86 system, not passing any compiler options that would enable architecture-specific optimizations, meaning it only compiled with SSE2 support. I didn't even get to see the errors.

Jun 01 '24 14:06 aumetra

I'm still trying to figure out what the best way forward is to make to_bitmask64 on NEON CPUs work.

Jun 03 '24 13:06 aumetra

Thanks a lot, I will review the PR this week. I will benchmark the performance at first~

Jun 25 '24 11:06 liuq19

@liuq19 Would you be up to benching the speed of an implementation for NEON that runtime dispatches the bitmask creation? We could technically cache the result whether NEON (or any other feature, really) is supported in globals. That way the performance loss shouldn't be too bad.

Because hacking in runtime dispatch for bitmask creation otherwise is really tricky.

Aug 20 '24 14:08 aumetra

Hacked in the version that dispatches on each bitmask call. Maybe the performance hit is too severe to justify..

Aug 20 '24 14:08 aumetra

Okay, I'm not sure why this is broken? It's on ARM64, right? I guess I'll have to whip out the cross-compilation for now. I don't own a suitable ARM machine for testing.

Aug 21 '24 10:08 aumetra

Now I just need to find a way to properly express this in trait form, preferrably very generic.

Aug 21 '24 12:08 aumetra

I benched in x86 and maybe the simd is not work in runtime-detect. There maybe some function missed and not working in simd

     Running benches/deserialize_struct.rs (target/release/deps/deserialize_struct-016910da72a3ea13)
twitter/sonic_rs::from_slice_unchecked
                        time:   [861.50 µs 862.23 µs 863.02 µs]
                        change: [+76.254% +76.499% +76.735%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 5 outliers among 100 measurements (5.00%)
  4 (4.00%) high mild
  1 (1.00%) high severe
twitter/sonic_rs::from_slice
                        time:   [885.21 µs 885.86 µs 886.62 µs]
                        change: [+73.328% +73.607% +73.868%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 5 outliers among 100 measurements (5.00%)
  4 (4.00%) high mild
  1 (1.00%) high severe

citm_catalog/sonic_rs::from_slice_unchecked
                        time:   [1.6269 ms 1.6298 ms 1.6327 ms]
                        change: [+71.170% +71.493% +71.800%] (p = 0.00 < 0.05)
                        Performance has regressed.
citm_catalog/sonic_rs::from_slice
                        time:   [1.6559 ms 1.6572 ms 1.6587 ms]
                        change: [+67.728% +67.938% +68.127%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 3 outliers among 100 measurements (3.00%)
  2 (2.00%) high mild
  1 (1.00%) high severe

canada/sonic_rs::from_slice_unchecked
                        time:   [4.5318 ms 4.5332 ms 4.5349 ms]
                        change: [+20.602% +20.659% +20.716%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 7 outliers among 100 measurements (7.00%)
  3 (3.00%) high mild
  4 (4.00%) high severe
canada/sonic_rs::from_slice
                        time:   [4.5934 ms 4.5951 ms 4.5970 ms]
                        change: [+20.546% +20.602% +20.659%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 6 outliers among 100 measurements (6.00%)
  2 (2.00%) high mild
  4 (4.00%) high severe

Aug 21 '24 14:08 liuq19

That's weird. On my local machine, the change is somewhat in the ballpark of ~3-4%, which is acceptable (I'd need to profile it to get a closer idea of what's going on; where the performance is lost. Maybe some optimization opportunities that are too opaque for the compiler with all the generics)

     Running benches/deserialize_struct.rs (target/release/deps/deserialize_struct-b15a6d4de21d32b1)
Gnuplot not found, using plotters backend
twitter/sonic_rs::from_slice_unchecked
                        time:   [438.48 µs 440.11 µs 442.07 µs]
                        change: [+4.0017% +4.6062% +5.2032%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 13 outliers among 100 measurements (13.00%)
  3 (3.00%) high mild
  10 (10.00%) high severe
twitter/sonic_rs::from_slice
                        time:   [445.74 µs 447.52 µs 449.94 µs]
                        change: [+3.0931% +3.6056% +4.2171%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 12 outliers among 100 measurements (12.00%)
  4 (4.00%) high mild
  8 (8.00%) high severe

citm_catalog/sonic_rs::from_slice_unchecked
                        time:   [856.18 µs 856.77 µs 857.50 µs]
                        change: [+1.2295% +1.3723% +1.5158%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 6 outliers among 100 measurements (6.00%)
  4 (4.00%) high mild
  2 (2.00%) high severe

Aug 21 '24 14:08 aumetra

could you remove or comment the config in .cargo/config.toml

Aug 21 '24 14:08 liuq19

Already wasn't active due to my global Cargo config. But for the benches above I set -C target-cpu=native. I can somewhat reproduce your findings when I toggle -C target-cpu=native for the main branch, and disable it for the runtime detection branch. But that ignores that all the optimizations the Rust compiler by default doesn't do if it isn't aware of the target CPU model.

I added debug statements and the runtime correctly detects that my CPU supports AVX2, with and without target-cpu=native.

Aug 21 '24 15:08 aumetra

So it is much slower without target-cpu=native but that is to be expected with all the CPU-specific optimizations LLVM can do. But the runtime detection only sets performance back by under 5%, which is IMO acceptable as an opt-in feature.

Aug 21 '24 15:08 aumetra

Never mind, I get what you mean. Let me look into it.

Aug 21 '24 15:08 aumetra

maybe we can try to compare more benchmarks

Aug 22 '24 03:08 liuq19

any updates?

Sep 03 '24 05:09 liuq19

Sorry, I've been busy the last two weeks, but I can hopefully do some work today at the airport

Sep 03 '24 05:09 aumetra

thanks for your contrib. the code has too many conflicts and we will refactor the parser.

Jan 17 '25 03:01 liuq19