json icon indicating copy to clipboard operation
json copied to clipboard

Introduce ARM Neon SIMD.

Open samyron opened this issue 10 months ago • 10 comments

Version 2 of the introduction of ARM Neon SIMD.

There are currently two implementations:

  1. "Rules" based.
  2. Lookup Table based. This is effectively an SIMD accelerated version of the scalar implementation.

Benchmarks (Lookup table)

== Encoding mixed utf8 (5003001 bytes)
ruby 3.4.1 (2024-12-25 revision 48d4efcb85) +PRISM [arm64-darwin24]
Warming up --------------------------------------
                json    62.000 i/100ms
          json_coder    67.000 i/100ms
                  oj    30.000 i/100ms
Calculating -------------------------------------
                json    628.035 (±12.7%) i/s    (1.59 ms/i) -      3.162k in   5.118636s
          json_coder    626.843 (±15.8%) i/s    (1.60 ms/i) -      3.082k in   5.079836s
                  oj    352.174 (± 9.4%) i/s    (2.84 ms/i) -      1.740k in   5.005929s

Comparison:
                json:      628.0 i/s
          json_coder:      626.8 i/s - same-ish: difference falls within error
                  oj:      352.2 i/s - 1.78x  slower


== Encoding mostly utf8 (5001001 bytes)
ruby 3.4.1 (2024-12-25 revision 48d4efcb85) +PRISM [arm64-darwin24]
Warming up --------------------------------------
                json    50.000 i/100ms
          json_coder    56.000 i/100ms
                  oj    36.000 i/100ms
Calculating -------------------------------------
                json    632.784 (±27.0%) i/s    (1.58 ms/i) -      3.000k in   5.063991s
          json_coder    628.328 (±16.7%) i/s    (1.59 ms/i) -      3.080k in   5.034271s
                  oj    351.466 (± 9.7%) i/s    (2.85 ms/i) -      1.728k in   5.003977s

Comparison:
                json:      632.8 i/s
          json_coder:      628.3 i/s - same-ish: difference falls within error
                  oj:      351.5 i/s - 1.80x  slower

Benchmarks (Rules based)

== Encoding mixed utf8 (5003001 bytes)
ruby 3.4.1 (2024-12-25 revision 48d4efcb85) +PRISM [arm64-darwin24]
Warming up --------------------------------------
                json    69.000 i/100ms
          json_coder    78.000 i/100ms
                  oj    33.000 i/100ms
Calculating -------------------------------------
                json    758.135 (±22.7%) i/s    (1.32 ms/i) -      3.657k in   5.114664s
          json_coder    800.957 (±11.5%) i/s    (1.25 ms/i) -      3.978k in   5.044465s
                  oj    343.750 (±11.9%) i/s    (2.91 ms/i) -      1.683k in   5.004571s

Comparison:
                json:      758.1 i/s
          json_coder:      801.0 i/s - same-ish: difference falls within error
                  oj:      343.7 i/s - 2.21x  slower


== Encoding mostly utf8 (5001001 bytes)
ruby 3.4.1 (2024-12-25 revision 48d4efcb85) +PRISM [arm64-darwin24]
Warming up --------------------------------------
                json    59.000 i/100ms
          json_coder    53.000 i/100ms
                  oj    37.000 i/100ms
Calculating -------------------------------------
                json    828.807 (±15.1%) i/s    (1.21 ms/i) -      4.071k in   5.060739s
          json_coder    799.688 (±20.1%) i/s    (1.25 ms/i) -      3.816k in   5.019480s
                  oj    364.514 (± 7.1%) i/s    (2.74 ms/i) -      1.850k in   5.100773s

Comparison:
                json:      828.8 i/s
          json_coder:      799.7 i/s - same-ish: difference falls within error
                  oj:      364.5 i/s - 2.27x  slower

I am still working on this but I wanted to share progress.

Edit: Looks like I missed one commit so I'll have to resolve some merge conflicts.

samyron avatar Feb 03 '25 03:02 samyron

The gain seem to be 7% on real word benchmarks:

== Encoding activitypub.json (52595 bytes)
ruby 3.4.1 (2024-12-25 revision 48d4efcb85) +YJIT +PRISM [arm64-darwin23]
Warming up --------------------------------------
               after     2.438k i/100ms
Calculating -------------------------------------
               after     24.763k (± 0.8%) i/s   (40.38 μs/i) -    124.338k in   5.021560s

Comparison:
              before:    23166.2 i/s
               after:    24762.5 i/s - 1.07x  faster


== Encoding twitter.json (466906 bytes)
ruby 3.4.1 (2024-12-25 revision 48d4efcb85) +YJIT +PRISM [arm64-darwin23]
Warming up --------------------------------------
               after   254.000 i/100ms
Calculating -------------------------------------
               after      2.600k (± 1.3%) i/s  (384.61 μs/i) -     13.208k in   5.080852s

Comparison:
              before:     2439.5 i/s
               after:     2600.0 i/s - 1.07x  faster

Also note that I did one more refactoring to make the introduction of SIMD easier, so you still have a conflict.

byroot avatar Feb 03 '25 09:02 byroot

Can you just include the implementation for the regular escaping? I'm not sure the script safe version is quite worth it.

byroot avatar Feb 03 '25 09:02 byroot

Comparison between master and this branch in real world benchmarks. This is for the lookup table implementation.

== Encoding activitypub.json (52595 bytes)
ruby 3.4.1 (2024-12-25 revision 48d4efcb85) +PRISM [arm64-darwin24]
Warming up --------------------------------------
               after     2.027k i/100ms
Calculating -------------------------------------
               after     21.413k (± 1.6%) i/s   (46.70 μs/i) -    107.431k in   5.018339s

Comparison:
              before:    14448.8 i/s
               after:    21412.9 i/s - 1.48x  faster


== Encoding citm_catalog.json (500298 bytes)
ruby 3.4.1 (2024-12-25 revision 48d4efcb85) +PRISM [arm64-darwin24]
Warming up --------------------------------------
               after   110.000 i/100ms
Calculating -------------------------------------
               after      1.098k (± 1.2%) i/s  (910.41 μs/i) -      5.500k in   5.007977s

Comparison:
              before:      993.9 i/s
               after:     1098.4 i/s - 1.11x  faster


== Encoding twitter.json (466906 bytes)
ruby 3.4.1 (2024-12-25 revision 48d4efcb85) +PRISM [arm64-darwin24]
Warming up --------------------------------------
               after   216.000 i/100ms
Calculating -------------------------------------
               after      2.086k (± 8.9%) i/s  (479.31 μs/i) -     10.368k in   5.034983s

Comparison:
              before:     1642.1 i/s
               after:     2086.3 i/s - 1.27x  faster

Running it a second time:

== Encoding activitypub.json (52595 bytes)
ruby 3.4.1 (2024-12-25 revision 48d4efcb85) +PRISM [arm64-darwin24]
Warming up --------------------------------------
               after     2.042k i/100ms
Calculating -------------------------------------
               after     21.400k (± 1.7%) i/s   (46.73 μs/i) -    108.226k in   5.058877s

Comparison:
              before:    15039.4 i/s
               after:    21399.7 i/s - 1.42x  faster


== Encoding citm_catalog.json (500298 bytes)
ruby 3.4.1 (2024-12-25 revision 48d4efcb85) +PRISM [arm64-darwin24]
Warming up --------------------------------------
               after   109.000 i/100ms
Calculating -------------------------------------
               after      1.094k (± 1.2%) i/s  (913.67 μs/i) -      5.559k in   5.079778s

Comparison:
              before:     1005.4 i/s
               after:     1094.5 i/s - 1.09x  faster


== Encoding twitter.json (466906 bytes)
ruby 3.4.1 (2024-12-25 revision 48d4efcb85) +PRISM [arm64-darwin24]
Warming up --------------------------------------
               after   215.000 i/100ms
Calculating -------------------------------------
               after      2.137k (± 5.5%) i/s  (467.91 μs/i) -     10.750k in   5.050467s

Comparison:
              before:     1639.0 i/s
               after:     2137.1 i/s - 1.30x  faster

samyron avatar Feb 06 '25 02:02 samyron

Not sure why but it's way more modest on my machine (Air M3):

== Encoding activitypub.json (52595 bytes)
ruby 3.4.1 (2024-12-25 revision 48d4efcb85) +YJIT +PRISM [arm64-darwin23]
Warming up --------------------------------------
               after     2.603k i/100ms
Calculating -------------------------------------
               after     26.544k (± 1.8%) i/s   (37.67 μs/i) -    132.753k in   5.002890s

Comparison:
              before:    23370.1 i/s
               after:    26543.7 i/s - 1.14x  faster


== Encoding citm_catalog.json (500298 bytes)
ruby 3.4.1 (2024-12-25 revision 48d4efcb85) +YJIT +PRISM [arm64-darwin23]
Warming up --------------------------------------
               after   136.000 i/100ms
Calculating -------------------------------------
               after      1.368k (± 0.7%) i/s  (730.98 μs/i) -      6.936k in   5.070329s

Comparison:
              before:     1369.9 i/s
               after:     1368.0 i/s - same-ish: difference falls within error


== Encoding twitter.json (466906 bytes)
ruby 3.4.1 (2024-12-25 revision 48d4efcb85) +YJIT +PRISM [arm64-darwin23]
Warming up --------------------------------------
               after   269.000 i/100ms
Calculating -------------------------------------
               after      2.702k (± 0.3%) i/s  (370.11 μs/i) -     13.719k in   5.077550s

Comparison:
              before:     2475.0 i/s
               after:     2701.9 i/s - 1.09x  faster

byroot avatar Feb 06 '25 07:02 byroot

Apologies for going dark for a while. I've been trying to make incremental improvements on a different branch (found here). My hope was using a move mask would be faster than vmaxvq_u8 to determine if any byte needs to be escaped. It also has the benefit of not needing to store all of the candidate matches as all that would be needed is a uint64_t which indicates which bytes need to be escaped. Unfortunately on my machine, it didn't seem to make much of a difference.

Feel free to try it out though.

samyron avatar Feb 10 '25 03:02 samyron

Apologies for going dark for a while

That's no worries at all. I want to release a 2.10.0 with the current change on master, but I'm pairing with Étienne on making sure we have no blind spots on JSON::Coder. So probably gonna happen this week.

After that I think I can start merging some SIMD stuff. I'd like to go with the smaller possible useful SIMD acceleration to ensure it doesn't cause issues with people. If it works well, we can then go farther. So yeah, no rush.

byroot avatar Feb 10 '25 08:02 byroot

@byroot if you have a few minutes, would you be able to checkout this branch and benchmark it against master. You'll have to tweak your compare script a bit to compile this branch with cmd("bundle", "exec", "rake", "clean", "compile", "--", "--disable-generator-use-simd"). I want to see how your M3 compares with my M1.

This branch uses the bit twiddling sort of platform agnostic SIMD code if the SIMD code is disabled via aextconf.rb flag.

The results on my M1:

== Encoding activitypub.json (52595 bytes)
ruby 3.4.1 (2024-12-25 revision 48d4efcb85) +PRISM [arm64-darwin24]
Warming up --------------------------------------
               after     1.944k i/100ms
Calculating -------------------------------------
               after     19.671k (± 2.5%) i/s   (50.84 μs/i) -     99.144k in   5.043309s

Comparison:
              before:    15135.7 i/s
               after:    19670.9 i/s - 1.30x  faster


== Encoding citm_catalog.json (500298 bytes)
ruby 3.4.1 (2024-12-25 revision 48d4efcb85) +PRISM [arm64-darwin24]
Warming up --------------------------------------
               after   113.000 i/100ms
Calculating -------------------------------------
               after      1.109k (± 2.1%) i/s  (901.49 μs/i) -      5.650k in   5.095561s

Comparison:
              before:     1040.1 i/s
               after:     1109.3 i/s - 1.07x  faster


== Encoding twitter.json (466906 bytes)
ruby 3.4.1 (2024-12-25 revision 48d4efcb85) +PRISM [arm64-darwin24]
Warming up --------------------------------------
               after   204.000 i/100ms
Calculating -------------------------------------
               after      2.006k (± 3.8%) i/s  (498.51 μs/i) -     10.200k in   5.092718s

Comparison:
              before:     1687.4 i/s
               after:     2006.0 i/s - 1.19x  faster

samyron avatar Feb 11 '25 13:02 samyron

With that compilation flag and compared to master:

== Encoding activitypub.json (52595 bytes)
ruby 3.4.1 (2024-12-25 revision 48d4efcb85) +YJIT +PRISM [arm64-darwin23]
Warming up --------------------------------------
               after     2.326k i/100ms
Calculating -------------------------------------
               after     23.218k (± 1.6%) i/s   (43.07 μs/i) -    116.300k in   5.010271s

Comparison:
              before:    22460.3 i/s
               after:    23218.0 i/s - 1.03x  faster


== Encoding citm_catalog.json (500298 bytes)
ruby 3.4.1 (2024-12-25 revision 48d4efcb85) +YJIT +PRISM [arm64-darwin23]
Warming up --------------------------------------
               after   132.000 i/100ms
Calculating -------------------------------------
               after      1.290k (± 1.4%) i/s  (775.38 μs/i) -      6.468k in   5.016121s

Comparison:
              before:     1323.6 i/s
               after:     1289.7 i/s - same-ish: difference falls within error


== Encoding twitter.json (466906 bytes)
ruby 3.4.1 (2024-12-25 revision 48d4efcb85) +YJIT +PRISM [arm64-darwin23]
Warming up --------------------------------------
               after   242.000 i/100ms
Calculating -------------------------------------
               after      2.495k (± 0.6%) i/s  (400.84 μs/i) -     12.584k in   5.044306s

Comparison:
              before:     2449.6 i/s
               after:     2494.8 i/s - 1.02x  faster

byroot avatar Feb 12 '25 12:02 byroot

From a co-worker with an M4 Pro:

== Encoding activitypub.json (52595 bytes)
ruby 3.2.6 (2024-10-30 revision 63aeb018eb) [arm64-darwin24]
Warming up --------------------------------------
               after     2.876k i/100ms
Calculating -------------------------------------
               after     28.251k (± 3.0%) i/s   (35.40 μs/i) -    143.800k in   5.095128s

Comparison:
              before:    24938.2 i/s
               after:    28251.0 i/s - 1.13x  faster


== Encoding citm_catalog.json (500298 bytes)
ruby 3.2.6 (2024-10-30 revision 63aeb018eb) [arm64-darwin24]
Warming up --------------------------------------
               after   154.000 i/100ms
Calculating -------------------------------------
               after      1.516k (± 2.9%) i/s  (659.57 μs/i) -      7.700k in   5.083078s

Comparison:
              before:     1575.4 i/s
               after:     1516.1 i/s - same-ish: difference falls within error


== Encoding twitter.json (466906 bytes)
ruby 3.2.6 (2024-10-30 revision 63aeb018eb) [arm64-darwin24]
Warming up --------------------------------------
               after   295.000 i/100ms
Calculating -------------------------------------
               after      2.933k (± 3.3%) i/s  (340.94 μs/i) -     14.750k in   5.034796s

Comparison:
              before:     2678.2 i/s
               after:     2933.0 i/s - 1.10x  faster

samyron avatar Feb 25 '25 02:02 samyron

From another co-worker with an M1 Pro:

== Encoding activitypub.json (52595 bytes)
ruby 3.4.2 (2025-02-15 revision d2930f8e7a) +PRISM [arm64-darwin24]
Warming up --------------------------------------
               after     2.166k i/100ms
Calculating -------------------------------------
               after     21.521k (± 1.2%) i/s   (46.47 μs/i) -    108.300k in   5.032957s

Comparison:
              before:    15231.1 i/s
               after:    21521.3 i/s - 1.41x  faster


== Encoding citm_catalog.json (500298 bytes)
ruby 3.4.2 (2025-02-15 revision d2930f8e7a) +PRISM [arm64-darwin24]
Warming up --------------------------------------
               after   108.000 i/100ms
Calculating -------------------------------------
               after      1.062k (± 5.5%) i/s  (941.69 μs/i) -      5.400k in   5.103989s

Comparison:
              before:     1013.4 i/s
               after:     1061.9 i/s - same-ish: difference falls within error


== Encoding twitter.json (466906 bytes)
ruby 3.4.2 (2025-02-15 revision d2930f8e7a) +PRISM [arm64-darwin24]
Warming up --------------------------------------
               after   219.000 i/100ms
Calculating -------------------------------------
               after      2.061k (±12.8%) i/s  (485.22 μs/i) -     10.074k in   5.040974s

Comparison:
              before:     1677.4 i/s
               after:     2060.9 i/s - 1.23x  faster

samyron avatar Feb 26 '25 15:02 samyron

@samyron

I just pushed a PR #769 to this repo which also employs SIMD to speed up string escapes. I am really sorry that we both worked in that area at the same time; after I started my work I didn't check back with this repo for a while (and I should have done that.)

I believe the main difference between my PR and yours seem that mine supports x86 as well. It is doing this by using a cross-platform shim simd.h from Postgres, which comes with implementations on AVX, Neon, (and also on plain C). Still, on Neon I see somewhat higher gains than those reported here; however I don't understand where that difference comes from.

I want to suggest to collaborate on getting SIMD support in one way or another. :wave:

radiospiel avatar Mar 16 '25 21:03 radiospiel

Hi @radiospiel, I'll take a look at #769. I originally started working on https://github.com/ruby/json/pull/730 which supports Neon, SSE 4.2 and AVX2 with runtime detection support. The PR got a bit big so I decided to close it and implement each instruction set individually.

Additionally, @byroot refactored the code quite a bit to make the SIMD implementation quite a bit easier. There are two implementations in this PR, one uses a lookup table and the other is rule-based. Both seem to have similar performance on my machine.

On my machine I see a 11%-48% improvement depending on the benchmark. A few of my co-workers saw various speedups depending on their machine.

I should probably mark this PR as "Ready for Review". However, I'm happy to collaborate either on this or your PR.

Edit: oh yeah, there is an old-school bit-twiddling SIMD approach in pure C: https://github.com/ruby/json/pull/738

samyron avatar Mar 18 '25 01:03 samyron

Thank you, @samyron .

I became painfully aware of the work you did when I tried to merge master into my branch, because the interface's of the escape functions had been changed; my implementation relies on a "escape me a uchar[] array into an fbuffer" which is no longer available with whats in master today :)

The main difference between your approach and mine is that you switch out the search functionality, depending on the availability of SIMD, while I switch out the SIMD primitives instead. This allows me to have working implementations for X86, ARM, and bit-twiddling; but only a handful of primitives are available because NEON and AVX are different, so your approach should allow for per-hardware type optimal implementations.

I have a busy week ahead of me, but I will definitively take a look end of the week. I will also benchmark on Graviton instances; most ARM server workloads are probably not on a Apple Silicon CPU after all :) Happy to benchmark this PR as well.

Can you share a benchmark script that produces the most useful output for you? I would be especially interested in understanding how you get the "before" and "after" entries in the benchmark output :)

Speaking of benchmarks:

On my machine I see a 11x-48x improvement depending on the benchmark.

This is magnitudes more than the numbers posted here. I have seen a 48% posted above (on the activitypub testcase), so is this a typo x%? The activitypub testcase, apparently, lends itself particularly well to SIMD; I see a speedup of ~82% on that (Apple M1)

radiospiel avatar Mar 18 '25 08:03 radiospiel

This is magnitudes more than the numbers posted here. I have seen a 48% posted above (on the activitypub testcase), so is this a typo x%? The activitypub testcase, apparently, lends itself particularly well to SIMD; I see a speedup of ~82% on that (Apple M1)

Apologies, yes, that was a typo. I'll fix it in the comment above

samyron avatar Mar 18 '25 16:03 samyron

@samyron I reran benchmarks (link). Both our PRs show a substantial improvement over the baseline, the only significant difference is on short strings.

Encoding Type json 2.10.2 samyron radiospiel
strings.ascii 13.046k (± 1.6%) 29.681k (± 1.9%) 33.583k (± 3.0%)
strings.escapes 4.608k (± 1.9%) 10.765k (± 2.2%) 9.681k (± 2.5%)
strings.mixed 32.971k (± 1.4%) 88.580k (± 2.1%) 90.133k (± 3.2%)
strings.multibyte 32.836k (± 2.0%) 89.385k (± 3.0%) 89.475k (± 2.1%)
strings.short 91.819k (± 9.8%) 95.388k (± 2.5%) 133.008k (± 2.6%)
strings.tests 21.350k (± 4.1%) 22.538k (± 2.7%) 22.600k (± 2.5%)

strings.short is a test on a 13-byte string ("b" * 5) + "€" + ("a" * 5), which is shorter than the size of the SIMD buffer (which in my case is 16 byte.).

I believe such short strings are relevant, because JSON object keys are probably quite often shorter than 16 byte; my PR applies SIMD for strings of 8 byte and more (link). (The value of 8 seemed beneficial and looked nice, but I should probably retest this with smaller values.)

Maybe you could be able to support that as well?

radiospiel avatar Mar 23 '25 19:03 radiospiel

@byroot we have two competing implementations of the same approach. While mine is probably more beneficial in the short term (because it also supports x86), I believe that @samyron 's approach has more future potential, because it allows handcrafted SIMD implementations that are fundamentally different between NEON and SSE2. (and it certainly can be extended to also support shorter strings, see comment above.)

Also, transplanting a x86 implementation from my PR into @samyron 's shouldn't be too hard to achieve.

I see the following alternatives:

  • we scrap mine, @.samyron adds support for shorter strings, and, in a follow up we transplant SSE2 into @.samyron's;
  • we merge mine, with the understanding that @.samyron's will be merged in at a later point, with SSE2 support right out of the box; mine will be removed again

What do you all think about that? ☝️

radiospiel avatar Mar 23 '25 19:03 radiospiel

* we scrap mine, @.samyron adds support for shorter strings, and, in a follow up we transplant SSE2 into @.samyron's;

I'm not saying we have to scrap your PR but I did incorporate your idea of using SIMD for fewer than 16 characters remaining. I tried this earlier using temporary storage and scrapped the idea as it made things slower. However, your idea of using the output FBuffer as temporary storage seemed to be the trick. If it turns out that nothing in that chunk needs to be escaped, it's a considerable speedup.

Additionally, at the moment the "rules-based" approach seems to be faster than the "lookup table" based approach.

samyron avatar Mar 24 '25 01:03 samyron

What do you all think about that? ☝️

I think I like this PR architecture a bit better.

That being said, to me the big decider for me is x86 support with runtime support detection. For now ARM in production is rare enough that doing SIMD with only ARM support isn't worth the extra complexity.

byroot avatar Mar 24 '25 08:03 byroot

Also we should have a way to disable SIMD with a compile flag, so that all codepaths can be exercised on CI.

byroot avatar Mar 24 '25 08:03 byroot

x86 support with runtime support detection.

There are no 64bit x86 CPUs without SSE2, I think runtime support detection is not necessary. (link)

For now ARM in production is rare enough that doing SIMD with only ARM support isn't worth the extra complexity.

We are near exclusively running our servers on Graviton; ARM support makes a large difference to me.

radiospiel avatar Mar 24 '25 12:03 radiospiel

There are no 64bit x86 CPUs without SSE2

Ah, interesting. I definitely expected most CPU made in the last 20 years to have it, but was worried some low-power stuff like Atom may not have it.

That definitely simplify things. I guess we'll only need runtime check for newer stuff like AVX or SVE2.

We are near exclusively running our servers on Graviton

Yeah, I know it's a possibility, just saying I have to arbitrate between added complexity and benefits to the majority of users.

So if you're OK with consolidating your PR with this one, let's do that. I'd just like to re-iterate that I'd like to take things slow, try to go with a simple SIMD feature first, and make sure it doesn't cause any issues, and then we can iterate and optimize more routines.

byroot avatar Mar 24 '25 12:03 byroot

Also we should have a way to disable SIMD with a compile flag, so that all codepaths can be exercised on CI.

This is already supported by this PR. Running rake -- --disable-generator-use-simd will disable all SIMD and fallback to the scalar code path. I'm happy to rename this and/or make the SIMD code path opt-in instead of opt-out.

samyron avatar Mar 24 '25 14:03 samyron

and/or make the SIMD code path opt-in instead of opt-out.

Opt-out is fine, my only concern is the non-SIMD codepath being tested on CI. Because on GitHub Actions we only have x86 and ARM, so we'll always end up on a SIMD path, but ruby-core CI has many other archs, so I don't want to discover bugs there.

byroot avatar Mar 24 '25 15:03 byroot

So to recap what Jean is saying: this is a checklist to finish this PR:

    1. support for X86/SSE2
    1. runtime detection SSE2: not necessary
    1. runtime detection NEON: I am not sure if this is convinced that this is really necessary
    1. make sure SIMD is only enabled on 64bit (because x86/ARM 32bit would require runtime detection, but do we really care?)
    1. extend CI to also run the pure C code path.

Did I miss something from that list?

I can take up 1. as soon as I find a couple of hours to do so; this should be possible in the next 10 days. @samyron can you take up 3. and 4.? And 5. probably lies with @byroot ?

Thanks folks!

radiospiel avatar Mar 25 '25 07:03 radiospiel

* 3. runtime detection NEON: I am not sure if this is convinced that this is really necessary

* 4. make sure SIMD is only enabled on 64bit (because x86/ARM 32bit would require runtime detection, but do we really care?)

I can take up 1. as soon as I find a couple of hours to do so; this should be possible in the next 10 days. @samyron can you take up 3. and 4.? And 5. probably lies with @byroot ?

Apologies for the delay, I was traveling.. and I'm traveling again next week. I should have some time to work on this though.

With respect to runtime detection of Neon, it looks like this may be a good reference. On Linux, at least, it looks like we must read from /proc/cpuinfo or /proc/self/auxv.

I'll need to do some investigation to figure out how to do runtime detection on MacOS and/or if it's necessary at all.

With respect to runtime detection on x86, at least with GCC and clang both support __builtin_cpu_supports to determine if a target ISA is supported. See this closed PR. I'm happy to take that TODO. I'm also happy to take the action of adding the SSE2 support too. I believe I already have that code working (again from the previous PR). That previous code may now target SSE4.2 but at one point it did target SSE2.

samyron avatar Mar 28 '25 02:03 samyron

I'm happy to take that TODO. I'm also happy to take the action of adding the SSE2 support too. I believe I already have that code working (again from the previous PR). That previous code may now target SSE4.2 but at one point it did target SSE2.

Thanks, Scott, that sounds amazing. I am happy to assist you with benchmarking or code review, ping me if I can be of any help. :wave:

radiospiel avatar Mar 28 '25 07:03 radiospiel

On Linux, at least, it looks like we must read from /proc/cpuinfo or /proc/self/auxv.

As we discussed previously, we can probably assume NEON support is there. We'd only need runtime detection on ARM if we try to use SVE2.

byroot avatar Mar 28 '25 08:03 byroot

@byroot @radiospiel

I worked on this quite a bit this weekend. The current status:

  • ARM Neon Implementation
  • x86-64 SSE2 Implementation with processor support detected at runtime. This is supported on CLang and GCC.
  • Compile flag to completely disable SIMD. Example: rake -- --disable-generator-use-simd.
  • Default ARM Neon implementation is the "rules based" or "direct comparison" implementation. The lookup table based approach can be enabled via --enable-generator-use-neon-lut
  • Updated both the Neon and SSE2 implementation a bit to use as matches mask to detect the position of matching characters.
  • Quite a bit of refactoring..

Would love your thoughts and feedback. I'm pretty happy with this in it's current state. There are probably some cleanups and/or naming consistency issues to address.

samyron avatar Apr 07 '25 02:04 samyron

Comparison between master and this branch in real-world benchmarks as of the latest commit on my M1 Macbook Air.

== Encoding activitypub.json (52595 bytes)
ruby 3.3.6 (2024-11-05 revision 75015d4c1f) [arm64-darwin24]
Warming up --------------------------------------
               after     2.132k i/100ms
Calculating -------------------------------------
               after     21.819k (± 3.1%) i/s   (45.83 μs/i) -    110.864k in   5.086121s

Comparison:
              before:    14380.4 i/s
               after:    21818.9 i/s - 1.52x  faster


== Encoding citm_catalog.json (500298 bytes)
ruby 3.3.6 (2024-11-05 revision 75015d4c1f) [arm64-darwin24]
Warming up --------------------------------------
               after   103.000 i/100ms
Calculating -------------------------------------
               after      1.070k (± 2.2%) i/s  (934.17 μs/i) -      5.356k in   5.005904s

Comparison:
              before:      966.4 i/s
               after:     1070.5 i/s - 1.11x  faster


== Encoding twitter.json (466906 bytes)
ruby 3.3.6 (2024-11-05 revision 75015d4c1f) [arm64-darwin24]
Warming up --------------------------------------
               after   217.000 i/100ms
Calculating -------------------------------------
               after      2.275k (± 3.0%) i/s  (439.49 μs/i) -     11.501k in   5.059238s

Comparison:
              before:     1624.7 i/s
               after:     2275.4 i/s - 1.40x  faster

samyron avatar Apr 07 '25 02:04 samyron

Surprisingly the gains are much more modest on my M3:

== Encoding activitypub.json (52595 bytes)
ruby 3.4.2 (2025-02-15 revision d2930f8e7a) +YJIT +PRISM [arm64-darwin24]
Warming up --------------------------------------
               after     2.761k i/100ms
Calculating -------------------------------------
               after     28.272k (± 1.3%) i/s   (35.37 μs/i) -    143.572k in   5.079079s

Comparison:
              before:    23347.1 i/s
               after:    28272.1 i/s - 1.21x  faster


== Encoding citm_catalog.json (500298 bytes)
ruby 3.4.2 (2025-02-15 revision d2930f8e7a) +YJIT +PRISM [arm64-darwin24]
Warming up --------------------------------------
               after   137.000 i/100ms
Calculating -------------------------------------
               after      1.392k (± 0.7%) i/s  (718.59 μs/i) -      6.987k in   5.021027s

Comparison:
              before:     1456.2 i/s
               after:     1391.6 i/s - 1.05x  slower


== Encoding twitter.json (466906 bytes)
ruby 3.4.2 (2025-02-15 revision d2930f8e7a) +YJIT +PRISM [arm64-darwin24]
Warming up --------------------------------------
               after   276.000 i/100ms
Calculating -------------------------------------
               after      2.803k (± 1.6%) i/s  (356.71 μs/i) -     14.076k in   5.022315s

Comparison:
              before:     2515.3 i/s
               after:     2803.4 i/s - 1.11x  faster

Still nice though, and x86 is where I'd expect the most gain (I don't have one to test though).

byroot avatar Apr 07 '25 09:04 byroot