oj icon indicating copy to clipboard operation
oj copied to clipboard

Optimize `oj_dump_cstr` using SSE4.2 and SSSE3.

Open samyron opened this issue 6 months ago • 6 comments

Hello! Me again.. this time optimizing oj_dump_cstr on x86-64 platforms using SSE4.2 and SSSE3.

This PR is not yet ready for a thorough review but I wanted to get the discussion started which direction to the implementation.

There is still a bit of work to be done, particularly around reducing the duplication between the X86-64/AMD64 and ARM64 SIMD implementations. I haven't spent too much time trying to reduce this duplication. Additionally this only supports the :compat mode but :rails should be quick to support.

I noticed in the oj/ext/extconf.rb there is an option --with-sse42 to compile SSE4.2 support. It looks like this is currently only used in the parser. This PR, however, currently uses runtime instruction set detection to determine if it can use the SSE4.2/SSSE3 functions. I use the __builtin_cpu_supports function which is supported by GCC and clang to determine if the CPU supports the necessary instructions. I'm happy to continue down the path to support runtime CPU detection or switch to compile time support. The benefit of runtime CPU detection is anyone receiving a binary distribution of this library gets the SSE4.2 support so long as the platform compiling the library can compile SSE4.2/SSSE3 instructions (and currently using either GCC or clang). Additionally, consumers of the library don't need to provide any configure options to get the support. However, I'm not set on this approach. If you'd rather me switch this to an explicit compiler flag and/or use the existing --with-sse42 flag, I'm happy to do so.

Additionally, even with runtime instruction set detection, I can provide a compiler flag to disable the feature (or enable.. depends if we want opt-in or opt-out).

Update 2025/06/25

As of commit 37dc86450aa833bf8c2ab768b573b0e28f0b1f95 the SSE4.2/SSSE3 code is behind the --with-sse42 flag. The benchmarks below have been updated too.

Benchmarks

CPU: Intel(R) Core(TM) i7-8850H

Real world benchmarks

develop commit: 2e95f15d9207c18a4ee1eccfb1a2259ccda9c3a8 optimize-oj_dump_cstr-sse4 commit: 37dc86450aa833bf8c2ab768b573b0e28f0b1f95

== Encoding activitypub.json (52595 bytes)
ruby 3.4.1 (2024-12-25 revision 48d4efcb85) +PRISM [x86_64-linux]
Warming up --------------------------------------
               after     1.257k i/100ms
Calculating -------------------------------------
               after     12.603k (± 5.0%) i/s   (79.35 μs/i) -     62.850k in   5.001904s

Comparison:
              before:     9014.7 i/s
               after:    12602.6 i/s - 1.40x  faster


== Encoding citm_catalog.json (500298 bytes)
ruby 3.4.1 (2024-12-25 revision 48d4efcb85) +PRISM [x86_64-linux]
Warming up --------------------------------------
               after    72.000 i/100ms
Calculating -------------------------------------
               after    652.720 (± 8.9%) i/s    (1.53 ms/i) -      3.312k in   5.119064s

Comparison:
              before:      703.9 i/s
               after:      652.7 i/s - same-ish: difference falls within error


== Encoding twitter.json (466906 bytes)
ruby 3.4.1 (2024-12-25 revision 48d4efcb85) +PRISM [x86_64-linux]
Warming up --------------------------------------
               after   137.000 i/100ms
Calculating -------------------------------------
               after      1.387k (± 6.1%) i/s  (721.05 μs/i) -      6.987k in   5.058261s

Comparison:
              before:     1224.0 i/s
               after:     1386.9 i/s - same-ish: difference falls within error


== Encoding ohai.json (20145 bytes)
ruby 3.4.1 (2024-12-25 revision 48d4efcb85) +PRISM [x86_64-linux]
Warming up --------------------------------------
               after     1.864k i/100ms
Calculating -------------------------------------
               after     20.398k (± 6.4%) i/s   (49.02 μs/i) -    102.520k in   5.050563s

Comparison:
              before:    20667.6 i/s
               after:    20398.2 i/s - same-ish: difference falls within error

Synthetic (happiest-of-happy paths - but still needs a single escape)

# benchmark_encoding "bytes.128.single-escape-at-end", ([(('a' * 127 + '\n'))] * 10000)

== Encoding bytes.128.single-escape-at-end (1330001 bytes)
ruby 3.4.1 (2024-12-25 revision 48d4efcb85) +PRISM [x86_64-linux]
Warming up --------------------------------------
               after    66.000 i/100ms
Calculating -------------------------------------
               after    756.543 (± 3.8%) i/s    (1.32 ms/i) -      3.828k in   5.067620s

Comparison:
              before:      232.3 i/s
               after:      756.5 i/s - 3.26x  faster

samyron avatar Jun 11 '25 02:06 samyron

Is this still a WIP?

ohler55 avatar Jun 17 '25 16:06 ohler55

Is this still a WIP?

Yes.. I at least need to clean up the warnings. Two questions for you though:

  1. Are you okay you with runtime CPU detection to determine which method to use (on x86-64)? Or would you prefer using a compiler flag to enable/disable the feature?
  2. If using runtime CPU detection, would you like me to use that for the SSE4.2 code in the parser? If you would prefer the compile-time flag, would you like me to use the same --with-sse42 flag for this code too?

samyron avatar Jun 18 '25 03:06 samyron

If the overhead of runtime is trivial then that would be fine but compile time is best if possible.

ohler55 avatar Jun 18 '25 13:06 ohler55

Apologies for the delays... it's been a busy week. I should be able to wrap this up within the next few days.

samyron avatar Jun 22 '25 03:06 samyron

No worries. Same has happens to me on more than one occasion.

ohler55 avatar Jun 22 '25 10:06 ohler55

This should be ready for review. There is still a bit of duplication between the NEON and SSE4.2 code which can probably be made a bit more generic. I'm not entirely sure it's worth it but I'm happy to do so if you'd like.

samyron avatar Jun 26 '25 02:06 samyron