Use bit twiddling to speed up JSON generation.
Create this as separate from the SIMD branch.
Use bit twiddling to speed up JSON generation.
This effectively inlines memchr(ptr, '"', len) and memchr(ptr, '\\', len) as well as a <each byte in chunk> < 0x20 comparison.
Benchmarks
Macbook Air M1
This Branch
== Encoding small mixed (34 bytes)
ruby 3.4.1 (2024-12-25 revision 48d4efcb85) +PRISM [arm64-darwin24]
Warming up --------------------------------------
json 442.754k i/100ms
json_coder 474.965k i/100ms
oj 419.655k i/100ms
Calculating -------------------------------------
json 4.380M (± 5.3%) i/s (228.29 ns/i) - 22.138M in 5.071138s
json_coder 4.721M (± 3.2%) i/s (211.84 ns/i) - 23.748M in 5.036884s
oj 4.223M (± 1.6%) i/s (236.80 ns/i) - 21.402M in 5.069275s
Comparison:
json: 4380401.0 i/s
json_coder: 4720500.3 i/s - same-ish: difference falls within error
oj: 4223023.2 i/s - same-ish: difference falls within error
== Encoding small nested array (121 bytes)
ruby 3.4.1 (2024-12-25 revision 48d4efcb85) +PRISM [arm64-darwin24]
Warming up --------------------------------------
json 214.704k i/100ms
json_coder 217.984k i/100ms
oj 172.417k i/100ms
Calculating -------------------------------------
json 2.134M (± 2.8%) i/s (468.71 ns/i) - 10.735M in 5.036328s
json_coder 2.119M (±12.8%) i/s (471.89 ns/i) - 10.463M in 5.074812s
oj 1.728M (± 0.4%) i/s (578.56 ns/i) - 8.793M in 5.087548s
Comparison:
json: 2133536.6 i/s
json_coder: 2119147.6 i/s - same-ish: difference falls within error
oj: 1728423.1 i/s - 1.23x slower
== Encoding small hash (65 bytes)
ruby 3.4.1 (2024-12-25 revision 48d4efcb85) +PRISM [arm64-darwin24]
Warming up --------------------------------------
json 448.998k i/100ms
json_coder 480.577k i/100ms
oj 485.067k i/100ms
Calculating -------------------------------------
json 4.485M (± 0.4%) i/s (222.96 ns/i) - 22.450M in 5.005564s
json_coder 4.729M (± 2.0%) i/s (211.45 ns/i) - 24.029M in 5.082987s
oj 4.847M (± 0.5%) i/s (206.30 ns/i) - 24.253M in 5.003676s
Comparison:
json: 4485075.4 i/s
oj: 4847239.0 i/s - 1.08x faster
json_coder: 4729330.3 i/s - 1.05x faster
== Encoding mixed utf8 (5003001 bytes)
ruby 3.4.1 (2024-12-25 revision 48d4efcb85) +PRISM [arm64-darwin24]
Warming up --------------------------------------
json 60.000 i/100ms
json_coder 59.000 i/100ms
oj 34.000 i/100ms
Calculating -------------------------------------
json 538.363 (±14.9%) i/s (1.86 ms/i) - 2.640k in 5.008175s
json_coder 544.629 (±12.3%) i/s (1.84 ms/i) - 2.714k in 5.059221s
oj 357.057 (± 3.6%) i/s (2.80 ms/i) - 1.802k in 5.053765s
Comparison:
json: 538.4 i/s
json_coder: 544.6 i/s - same-ish: difference falls within error
oj: 357.1 i/s - 1.51x slower
== Encoding mostly utf8 (5001001 bytes)
ruby 3.4.1 (2024-12-25 revision 48d4efcb85) +PRISM [arm64-darwin24]
Warming up --------------------------------------
json 53.000 i/100ms
json_coder 46.000 i/100ms
oj 34.000 i/100ms
Calculating -------------------------------------
json 524.858 (± 7.4%) i/s (1.91 ms/i) - 2.650k in 5.077503s
json_coder 543.170 (± 7.0%) i/s (1.84 ms/i) - 2.714k in 5.020620s
oj 351.649 (± 3.7%) i/s (2.84 ms/i) - 1.768k in 5.034501s
Comparison:
json: 524.9 i/s
json_coder: 543.2 i/s - same-ish: difference falls within error
oj: 351.6 i/s - 1.49x slower
== Encoding integers (8009 bytes)
ruby 3.4.1 (2024-12-25 revision 48d4efcb85) +PRISM [arm64-darwin24]
Warming up --------------------------------------
json 8.123k i/100ms
json_coder 7.976k i/100ms
oj 7.332k i/100ms
Calculating -------------------------------------
json 80.441k (± 1.1%) i/s (12.43 μs/i) - 406.150k in 5.049727s
json_coder 80.854k (± 1.3%) i/s (12.37 μs/i) - 406.776k in 5.031830s
oj 73.209k (± 0.8%) i/s (13.66 μs/i) - 366.600k in 5.007896s
Comparison:
json: 80440.5 i/s
json_coder: 80853.6 i/s - same-ish: difference falls within error
oj: 73208.8 i/s - 1.10x slower
== Encoding activitypub.json (52595 bytes)
ruby 3.4.1 (2024-12-25 revision 48d4efcb85) +PRISM [arm64-darwin24]
Warming up --------------------------------------
json 2.045k i/100ms
json_coder 1.999k i/100ms
oj 1.582k i/100ms
Calculating -------------------------------------
json 20.584k (± 5.4%) i/s (48.58 μs/i) - 104.295k in 5.082094s
json_coder 21.065k (± 3.3%) i/s (47.47 μs/i) - 105.947k in 5.035064s
oj 15.678k (± 2.5%) i/s (63.78 μs/i) - 79.100k in 5.048520s
Comparison:
json: 20584.0 i/s
json_coder: 21065.5 i/s - same-ish: difference falls within error
oj: 15678.2 i/s - 1.31x slower
== Encoding citm_catalog.json (500298 bytes)
ruby 3.4.1 (2024-12-25 revision 48d4efcb85) +PRISM [arm64-darwin24]
Warming up --------------------------------------
json 109.000 i/100ms
json_coder 109.000 i/100ms
oj 92.000 i/100ms
Calculating -------------------------------------
json 1.108k (± 2.2%) i/s (902.53 μs/i) - 5.559k in 5.019493s
json_coder 1.105k (± 2.7%) i/s (904.61 μs/i) - 5.559k in 5.032499s
oj 914.626 (± 1.9%) i/s (1.09 ms/i) - 4.600k in 5.031155s
Comparison:
json: 1108.0 i/s
json_coder: 1105.5 i/s - same-ish: difference falls within error
oj: 914.6 i/s - 1.21x slower
== Encoding twitter.json (466906 bytes)
ruby 3.4.1 (2024-12-25 revision 48d4efcb85) +PRISM [arm64-darwin24]
Warming up --------------------------------------
json 201.000 i/100ms
json_coder 210.000 i/100ms
oj 188.000 i/100ms
Calculating -------------------------------------
json 2.117k (± 2.6%) i/s (472.28 μs/i) - 10.653k in 5.034844s
json_coder 2.169k (± 3.0%) i/s (460.95 μs/i) - 10.920k in 5.038295s
oj 1.915k (± 2.9%) i/s (522.32 μs/i) - 9.588k in 5.012425s
Comparison:
json: 2117.4 i/s
json_coder: 2169.4 i/s - same-ish: difference falls within error
oj: 1914.5 i/s - 1.11x slower
== Encoding canada.json (2090234 bytes)
ruby 3.4.1 (2024-12-25 revision 48d4efcb85) +PRISM [arm64-darwin24]
Warming up --------------------------------------
json 1.000 i/100ms
json_coder 1.000 i/100ms
oj 1.000 i/100ms
Calculating -------------------------------------
json 10.820 (± 9.2%) i/s (92.42 ms/i) - 54.000 in 5.017790s
json_coder 10.958 (± 0.0%) i/s (91.26 ms/i) - 55.000 in 5.019486s
oj 10.684 (± 0.0%) i/s (93.60 ms/i) - 54.000 in 5.054718s
Comparison:
json: 10.8 i/s
json_coder: 11.0 i/s - same-ish: difference falls within error
oj: 10.7 i/s - same-ish: difference falls within error
== Encoding many #to_json calls (2701 bytes)
json_coder unsupported (Object not allowed in JSON)
ruby 3.4.1 (2024-12-25 revision 48d4efcb85) +PRISM [arm64-darwin24]
Warming up --------------------------------------
json 2.369k i/100ms
oj 1.970k i/100ms
Calculating -------------------------------------
json 22.647k (±11.0%) i/s (44.16 μs/i) - 111.343k in 5.007278s
oj 19.705k (± 0.8%) i/s (50.75 μs/i) - 100.470k in 5.099096s
Comparison:
json: 22646.9 i/s
oj: 19704.8 i/s - 1.15x slower
Master
== Encoding small mixed (34 bytes)
ruby 3.4.1 (2024-12-25 revision 48d4efcb85) +PRISM [arm64-darwin24]
Warming up --------------------------------------
json 408.721k i/100ms
json_coder 463.041k i/100ms
oj 427.971k i/100ms
Calculating -------------------------------------
json 4.374M (± 1.1%) i/s (228.63 ns/i) - 22.071M in 5.046598s
json_coder 4.594M (± 3.9%) i/s (217.68 ns/i) - 23.152M in 5.048782s
oj 4.207M (± 1.8%) i/s (237.71 ns/i) - 21.399M in 5.088352s
Comparison:
json: 4373951.5 i/s
json_coder: 4593891.7 i/s - same-ish: difference falls within error
oj: 4206829.7 i/s - 1.04x slower
== Encoding small nested array (121 bytes)
ruby 3.4.1 (2024-12-25 revision 48d4efcb85) +PRISM [arm64-darwin24]
Warming up --------------------------------------
json 204.636k i/100ms
json_coder 208.879k i/100ms
oj 164.519k i/100ms
Calculating -------------------------------------
json 2.025M (± 1.5%) i/s (493.93 ns/i) - 10.232M in 5.054997s
json_coder 2.079M (± 1.7%) i/s (480.97 ns/i) - 10.444M in 5.024722s
oj 1.728M (± 1.0%) i/s (578.84 ns/i) - 8.720M in 5.047656s
Comparison:
json: 2024578.5 i/s
json_coder: 2079136.1 i/s - same-ish: difference falls within error
oj: 1727606.7 i/s - 1.17x slower
== Encoding small hash (65 bytes)
ruby 3.4.1 (2024-12-25 revision 48d4efcb85) +PRISM [arm64-darwin24]
Warming up --------------------------------------
json 449.534k i/100ms
json_coder 480.253k i/100ms
oj 480.245k i/100ms
Calculating -------------------------------------
json 4.464M (± 0.7%) i/s (224.02 ns/i) - 22.477M in 5.035527s
json_coder 4.748M (± 1.2%) i/s (210.61 ns/i) - 24.013M in 5.058006s
oj 4.633M (± 3.5%) i/s (215.83 ns/i) - 23.532M in 5.085593s
Comparison:
json: 4463831.3 i/s
json_coder: 4748140.8 i/s - 1.06x faster
oj: 4633342.9 i/s - same-ish: difference falls within error
== Encoding mixed utf8 (5003001 bytes)
ruby 3.4.1 (2024-12-25 revision 48d4efcb85) +PRISM [arm64-darwin24]
Warming up --------------------------------------
json 34.000 i/100ms
json_coder 35.000 i/100ms
oj 35.000 i/100ms
Calculating -------------------------------------
json 348.297 (± 8.0%) i/s (2.87 ms/i) - 1.734k in 5.013098s
json_coder 362.582 (± 7.4%) i/s (2.76 ms/i) - 1.820k in 5.049010s
oj 352.399 (± 3.7%) i/s (2.84 ms/i) - 1.785k in 5.072121s
Comparison:
json: 348.3 i/s
json_coder: 362.6 i/s - same-ish: difference falls within error
oj: 352.4 i/s - same-ish: difference falls within error
== Encoding mostly utf8 (5001001 bytes)
ruby 3.4.1 (2024-12-25 revision 48d4efcb85) +PRISM [arm64-darwin24]
Warming up --------------------------------------
json 37.000 i/100ms
json_coder 34.000 i/100ms
oj 35.000 i/100ms
Calculating -------------------------------------
json 356.095 (± 5.3%) i/s (2.81 ms/i) - 1.776k in 5.002047s
json_coder 352.925 (± 6.2%) i/s (2.83 ms/i) - 1.768k in 5.029325s
oj 354.508 (± 3.4%) i/s (2.82 ms/i) - 1.785k in 5.040838s
Comparison:
json: 356.1 i/s
oj: 354.5 i/s - same-ish: difference falls within error
json_coder: 352.9 i/s - same-ish: difference falls within error
== Encoding integers (8009 bytes)
ruby 3.4.1 (2024-12-25 revision 48d4efcb85) +PRISM [arm64-darwin24]
Warming up --------------------------------------
json 7.849k i/100ms
json_coder 7.886k i/100ms
oj 7.325k i/100ms
Calculating -------------------------------------
json 78.319k (± 1.4%) i/s (12.77 μs/i) - 392.450k in 5.011962s
json_coder 78.569k (± 1.1%) i/s (12.73 μs/i) - 394.300k in 5.019102s
oj 72.923k (± 0.8%) i/s (13.71 μs/i) - 366.250k in 5.022750s
Comparison:
json: 78319.0 i/s
json_coder: 78569.2 i/s - same-ish: difference falls within error
oj: 72922.6 i/s - 1.07x slower
== Encoding activitypub.json (52595 bytes)
ruby 3.4.1 (2024-12-25 revision 48d4efcb85) +PRISM [arm64-darwin24]
Warming up --------------------------------------
json 1.718k i/100ms
json_coder 1.748k i/100ms
oj 1.545k i/100ms
Calculating -------------------------------------
json 17.558k (± 3.1%) i/s (56.95 μs/i) - 89.336k in 5.093146s
json_coder 17.814k (± 3.3%) i/s (56.13 μs/i) - 89.148k in 5.009813s
oj 15.292k (± 3.6%) i/s (65.40 μs/i) - 77.250k in 5.058386s
Comparison:
json: 17558.0 i/s
json_coder: 17814.3 i/s - same-ish: difference falls within error
oj: 15291.5 i/s - 1.15x slower
== Encoding citm_catalog.json (500298 bytes)
ruby 3.4.1 (2024-12-25 revision 48d4efcb85) +PRISM [arm64-darwin24]
Warming up --------------------------------------
json 103.000 i/100ms
json_coder 107.000 i/100ms
oj 88.000 i/100ms
Calculating -------------------------------------
json 1.023k (±10.7%) i/s (977.19 μs/i) - 5.047k in 5.046689s
json_coder 1.068k (± 3.4%) i/s (936.16 μs/i) - 5.350k in 5.014254s
oj 895.747 (± 3.0%) i/s (1.12 ms/i) - 4.488k in 5.014907s
Comparison:
json: 1023.3 i/s
json_coder: 1068.2 i/s - same-ish: difference falls within error
oj: 895.7 i/s - same-ish: difference falls within error
== Encoding twitter.json (466906 bytes)
ruby 3.4.1 (2024-12-25 revision 48d4efcb85) +PRISM [arm64-darwin24]
Warming up --------------------------------------
json 189.000 i/100ms
json_coder 185.000 i/100ms
oj 178.000 i/100ms
Calculating -------------------------------------
json 1.952k (± 2.9%) i/s (512.35 μs/i) - 9.828k in 5.039805s
json_coder 1.975k (± 2.3%) i/s (506.32 μs/i) - 9.990k in 5.060905s
oj 1.929k (± 2.4%) i/s (518.51 μs/i) - 9.790k in 5.079165s
Comparison:
json: 1951.8 i/s
json_coder: 1975.0 i/s - same-ish: difference falls within error
oj: 1928.6 i/s - same-ish: difference falls within error
== Encoding canada.json (2090234 bytes)
ruby 3.4.1 (2024-12-25 revision 48d4efcb85) +PRISM [arm64-darwin24]
Warming up --------------------------------------
json 1.000 i/100ms
json_coder 1.000 i/100ms
oj 1.000 i/100ms
Calculating -------------------------------------
json 10.785 (± 0.0%) i/s (92.72 ms/i) - 55.000 in 5.111775s
json_coder 10.845 (± 0.0%) i/s (92.21 ms/i) - 55.000 in 5.072654s
oj 10.705 (± 0.0%) i/s (93.41 ms/i) - 54.000 in 5.044590s
Comparison:
json: 10.8 i/s
json_coder: 10.8 i/s - 1.01x faster
oj: 10.7 i/s - 1.01x slower
== Encoding many #to_json calls (2701 bytes)
json_coder unsupported (Object not allowed in JSON)
ruby 3.4.1 (2024-12-25 revision 48d4efcb85) +PRISM [arm64-darwin24]
Warming up --------------------------------------
json 2.356k i/100ms
oj 1.967k i/100ms
Calculating -------------------------------------
json 22.902k (± 7.6%) i/s (43.66 μs/i) - 115.444k in 5.081692s
oj 19.756k (± 1.1%) i/s (50.62 μs/i) - 100.317k in 5.078382s
Comparison:
json: 22902.2 i/s
oj: 19756.4 i/s - 1.16x slower
Relative gains on M3 compared to master on the macro benchmarks:
== Encoding activitypub.json (52595 bytes)
ruby 3.4.1 (2024-12-25 revision 48d4efcb85) +YJIT +PRISM [arm64-darwin23]
Warming up --------------------------------------
after 2.529k i/100ms
Calculating -------------------------------------
after 25.594k (± 0.9%) i/s (39.07 μs/i) - 128.979k in 5.039874s
Comparison:
before: 22137.7 i/s
after: 25594.0 i/s - 1.16x faster
== Encoding citm_catalog.json (500298 bytes)
ruby 3.4.1 (2024-12-25 revision 48d4efcb85) +YJIT +PRISM [arm64-darwin23]
Warming up --------------------------------------
after 135.000 i/100ms
Calculating -------------------------------------
after 1.365k (± 0.4%) i/s (732.82 μs/i) - 6.885k in 5.045549s
Comparison:
before: 1371.5 i/s
after: 1364.6 i/s - same-ish: difference falls within error
== Encoding twitter.json (466906 bytes)
ruby 3.4.1 (2024-12-25 revision 48d4efcb85) +YJIT +PRISM [arm64-darwin23]
Warming up --------------------------------------
after 266.000 i/100ms
Calculating -------------------------------------
after 2.679k (± 0.6%) i/s (373.22 μs/i) - 13.566k in 5.063340s
Comparison:
before: 2379.6 i/s
after: 2679.4 i/s - 1.13x faster
This one is interesting, as it doesn't require any of the annoying feature detection SIMD impose.
But if we end up going with SIMD anyway, might as well not bother with this, right?
This one is interesting, as it doesn't require any of the annoying feature detection SIMD impose.
But if we end up going with SIMD anyway, might as well not bother with this, right?
That is a judgement call. It's nice to have a pure C implementation that doesn't require any special instructions.
If we do go the SIMD route, assuming ARM Neon (Mac m* chips, AWS Graviton (according to Wikipedia)) and x86-64 are the vast majority of CPUs running ruby/json, this is probably unnecessary. However, it's nice to have alternatives.
Edit: This assumes this code is faster on other architectures as well. I have not tested on any other than my M1 and Intel-based Laptop.
It's nice to have a pure C implementation that doesn't require any special instructions.
True. I guess my only real reservation with this PR (and also with the SIMD ones) is the huge PROCESS_BYTE macro.
I haven't looked too much into it, but I'd really like if such huge macro wasn't necessary. So I need to take some time to experiment with some refactoring.
I have not tested on any other than my M1 and Intel-based Laptop.
It's likely enough. x86 alone is likely 95% of Ruby usage if not more, we're probably super close to 100% if you add ARM. For other platform correctness is sufficient.
It's nice to have a pure C implementation that doesn't require any special instructions.
True. I guess my only real reservation with this PR (and also with the SIMD ones) is the huge
PROCESS_BYTEmacro.I haven't looked too much into it, but I'd really like if such huge macro wasn't necessary. So I need to take some time to experiment with some refactoring.
It's not necessary. It's the existing conditional. I just didn't want to copy and paste it multiple times.
if (RB_UNLIKELY(ch_len)) {
switch(ch_len) {
...
}
} else {
pos++
}
It's not necessary. It's the existing conditional. I just didn't want to copy and paste it multiple times.
Yes, I mean not having that big macro without copy-pasting either.
What I have in mind right now, but I don't know if it's really possible, would be to move the "search" part in another function, and let it having some state with a stack allocated struct so it can resume. Something very much like https://lemire.me/blog/2024/07/20/scan-html-even-faster-with-simd-instructions-c-and-c/
So the pseudo-code would look like:
scanner_state state = {0};
while (ptr = scan(&state, ptr)) {
// process one byte
ptr++;
}
This way all the aligment consideration and such are moved in that scan function, and it would become the natural place where to use SIMD etc.
NB: I'm not asking you to do this. If you wish to feel free to, but otherwise I want to find some time to try it before I merge this PR.
Is this really worth it now that we merged SIMD code? I suspect not?
I'll reopen if you think it does.
I think this is fine to close. I need to figure out how to build Ruby on Windows with Visual Studio so I can ensure the SIMD code build with that configuration. This PR would benefit those users.
As you linked in the other PR.. this implementation uses a similar technique without the macros.