truffleruby
truffleruby copied to clipboard
Fast blank is slower than Slow blank on Truffleruby
https://github.com/SamSaffron/fast_blank
================== Test String Length: 0 ==================
Warming up --------------------------------------
Fast Blank 93.707k i/100ms
Fast ActiveSupport 510.726k i/100ms
Slow Blank 76.211k i/100ms
New Slow Blank 22.244M i/100ms
Calculating -------------------------------------
Fast Blank 5.361M (± 5.0%) i/s - 26.706M in 4.997583s
Fast ActiveSupport 5.451M (± 2.9%) i/s - 27.579M in 5.063809s
Slow Blank 1.297M (±66.5%) i/s - 3.125M in 5.072733s
New Slow Blank 220.498M (± 8.1%) i/s - 1.090B in 5.007588s
Comparison:
New Slow Blank: 220497717.1 i/s
Fast ActiveSupport: 5451195.9 i/s - 40.45x (± 0.00) slower
Fast Blank: 5360771.1 i/s - 41.13x (± 0.00) slower
Slow Blank: 1296628.6 i/s - 170.05x (± 0.00) slower
================== Test String Length: 6 ==================
Warming up --------------------------------------
Fast Blank 168.425k i/100ms
Fast ActiveSupport 183.225k i/100ms
Slow Blank 78.492k i/100ms
New Slow Blank 2.231M i/100ms
Calculating -------------------------------------
Fast Blank 1.866M (± 3.8%) i/s - 9.432M in 5.063119s
Fast ActiveSupport 1.847M (± 3.5%) i/s - 9.344M in 5.066771s
Slow Blank 1.283M (±66.5%) i/s - 3.061M in 5.141647s
New Slow Blank 20.021M (±12.9%) i/s - 98.156M in 5.011524s
Comparison:
New Slow Blank: 20021393.6 i/s
Fast Blank: 1865661.2 i/s - 10.73x (± 0.00) slower
Fast ActiveSupport: 1846637.6 i/s - 10.84x (± 0.00) slower
Slow Blank: 1282619.8 i/s - 15.61x (± 0.00) slower
================== Test String Length: 14 ==================
Warming up --------------------------------------
Fast Blank 383.840k i/100ms
Fast ActiveSupport 396.259k i/100ms
Slow Blank 2.476M i/100ms
New Slow Blank 83.883k i/100ms
Calculating -------------------------------------
Fast Blank 3.711M (± 6.8%) i/s - 18.808M in 5.097805s
Fast ActiveSupport 3.893M (± 4.5%) i/s - 19.813M in 5.100463s
Slow Blank 24.881M (± 5.0%) i/s - 126.293M in 5.090137s
New Slow Blank 1.221M (±68.5%) i/s - 3.188M in 5.196091s
Comparison:
Slow Blank: 24880544.8 i/s
Fast ActiveSupport: 3892783.7 i/s - 6.39x (± 0.00) slower
Fast Blank: 3711304.1 i/s - 6.70x (± 0.00) slower
New Slow Blank: 1220939.0 i/s - 20.38x (± 0.00) slower
================== Test String Length: 24 ==================
Warming up --------------------------------------
Fast Blank 127.566k i/100ms
Fast ActiveSupport 211.474k i/100ms
Slow Blank 2.129M i/100ms
New Slow Blank 55.351k i/100ms
Calculating -------------------------------------
Fast Blank 2.145M (± 8.9%) i/s - 10.588M in 5.000647s
Fast ActiveSupport 2.077M (± 6.3%) i/s - 10.362M in 5.010825s
Slow Blank 19.853M (± 5.5%) i/s - 100.044M in 5.057535s
New Slow Blank 1.337M (±61.4%) i/s - 3.100M in 5.086929s
Comparison:
Slow Blank: 19852906.7 i/s
Fast Blank: 2144579.7 i/s - 9.26x (± 0.00) slower
Fast ActiveSupport: 2076706.4 i/s - 9.56x (± 0.00) slower
New Slow Blank: 1337390.2 i/s - 14.84x (± 0.00) slower
================== Test String Length: 136 ==================
Warming up --------------------------------------
Fast Blank 68.189k i/100ms
Fast ActiveSupport 111.426k i/100ms
Slow Blank 1.927M i/100ms
New Slow Blank 72.075k i/100ms
Calculating -------------------------------------
Fast Blank 1.280M (± 8.8%) i/s - 6.342M in 5.019817s
Fast ActiveSupport 1.113M (± 3.3%) i/s - 5.571M in 5.011350s
Slow Blank 19.025M (± 2.4%) i/s - 96.352M in 5.067432s
New Slow Blank 1.098M (±59.3%) i/s - 3.027M in 5.149704s
Comparison:
Slow Blank: 19024883.4 i/s
Fast Blank: 1279946.1 i/s - 14.86x (± 0.00) slower
Fast ActiveSupport: 1112982.8 i/s - 17.09x (± 0.00) slower
New Slow Blank: 1098161.1 i/s - 17.32x (± 0.00) slower
@SamSaffron
A minimal thing we can do with the gem @eregon is have a special switch of "if truffle, do no harm" happy for a PR in that department if the overhead here can not be removed.
The benchmark is from https://github.com/SamSaffron/fast_blank/blob/master/benchmark The first 2 (lengths 0 and 6) are actually blank, the last 3 are not.
For length 0 & 6 (actually blank), New Slow Blank is fastest (expected) and Slow Blank is slowest (unexpected).
For length 14-136 (not blank), Slow Blank seems fastest, and New Slow Blank is actually slowest (unexpected).
Note that latest ActiveSupport actually does something different a mix of Slow Blank and New Slow Blank + match?:
https://github.com/rails/rails/blob/78e402077d4d69b08db7daf8944a32eef0cf6ac4/activesupport/lib/active_support/core_ext/object/blank.rb#L104-L127
BLANK_RE = /\A[[:space:]]*\z/
BLANK_RE.match?(self)
(TruffleRuby doesn't optimize match? specifically currently, maybe it should)
It would be worth measuring with what ActiveSupport does.
In general it's kind of expected the C extension is slower here as it does RSTRING_PTR (forces the string to native memory) + rb_enc_codepoint_len (likely not super fast to iterate the string), and OTOH Regexp are now just-in-time compiled on TruffleRuby with TRegex.
So probably we should indeed skip the C extension.
But I think we should figure out if what is used in ActiveSupport is faster for all cases, and ensure it's as fast or faster than CRuby in all cases.
this is what I've got:
================== Test String Length: 0 ==================
Warming up --------------------------------------
Fast Blank 60.278k i/100ms
Fast ActiveSupport 487.533k i/100ms
Slow Blank 69.425k i/100ms
New Slow Blank 23.118M i/100ms
Latest AR Blank 207.506M i/100ms
Calculating -------------------------------------
Fast Blank 5.409M (± 8.0%) i/s - 26.824M in 4.999122s
Fast ActiveSupport 5.410M (± 4.0%) i/s - 27.302M in 5.054813s
Slow Blank 1.279M (±61.7%) i/s - 2.846M in 5.077555s
New Slow Blank 224.580M (± 9.7%) i/s - 1.110B in 5.025978s
Latest AR Blank 2.053B (± 5.0%) i/s - 10.375B in 5.067722s
Comparison:
Latest AR Blank: 2053059733.4 i/s
New Slow Blank: 224580213.4 i/s - 9.14x (± 0.00) slower
Fast ActiveSupport: 5410117.6 i/s - 379.49x (± 0.00) slower
Fast Blank: 5408563.2 i/s - 379.59x (± 0.00) slower
Slow Blank: 1279073.0 i/s - 1605.12x (± 0.00) slower
================== Test String Length: 6 ==================
Warming up --------------------------------------
Fast Blank 171.133k i/100ms
Fast ActiveSupport 190.379k i/100ms
Slow Blank 72.840k i/100ms
New Slow Blank 2.293M i/100ms
Latest AR Blank 64.434k i/100ms
Calculating -------------------------------------
Fast Blank 1.739M (±15.9%) i/s - 8.386M in 5.046484s
Fast ActiveSupport 1.925M (± 3.9%) i/s - 9.709M in 5.051623s
Slow Blank 1.226M (±59.6%) i/s - 3.059M in 5.227664s
New Slow Blank 18.898M (±14.1%) i/s - 94.021M in 5.091155s
Latest AR Blank 1.560M (±58.7%) i/s - 3.351M in 5.010152s
Comparison:
New Slow Blank: 18898243.6 i/s
Fast ActiveSupport: 1924981.5 i/s - 9.82x (± 0.00) slower
Fast Blank: 1738948.3 i/s - 10.87x (± 0.00) slower
Latest AR Blank: 1559954.3 i/s - 12.11x (± 0.00) slower
Slow Blank: 1226034.3 i/s - 15.41x (± 0.00) slower
================== Test String Length: 14 ==================
Warming up --------------------------------------
Fast Blank 170.897k i/100ms
Fast ActiveSupport 368.944k i/100ms
Slow Blank 2.523M i/100ms
New Slow Blank 75.363k i/100ms
Latest AR Blank 2.609M i/100ms
Calculating -------------------------------------
Fast Blank 3.805M (± 7.2%) i/s - 18.970M in 5.014839s
Fast ActiveSupport 3.765M (± 5.3%) i/s - 18.816M in 5.013121s
Slow Blank 23.542M (± 3.2%) i/s - 118.585M in 5.043153s
New Slow Blank 1.217M (±68.0%) i/s - 2.939M in 5.006626s
Latest AR Blank 22.963M (±13.8%) i/s - 112.194M in 5.013917s
Comparison:
Slow Blank: 23542338.1 i/s
Latest AR Blank: 22963001.3 i/s - same-ish: difference falls within error
Fast Blank: 3804589.0 i/s - 6.19x (± 0.00) slower
Fast ActiveSupport: 3764823.5 i/s - 6.25x (± 0.00) slower
New Slow Blank: 1217249.4 i/s - 19.34x (± 0.00) slower
================== Test String Length: 24 ==================
Warming up --------------------------------------
Fast Blank 199.905k i/100ms
Fast ActiveSupport 212.155k i/100ms
Slow Blank 2.228M i/100ms
New Slow Blank 54.231k i/100ms
Latest AR Blank 2.375M i/100ms
Calculating -------------------------------------
Fast Blank 2.141M (± 4.9%) i/s - 10.795M in 5.054385s
Fast ActiveSupport 2.148M (± 4.2%) i/s - 10.820M in 5.046501s
Slow Blank 19.710M (± 3.4%) i/s - 100.279M in 5.093634s
New Slow Blank 1.271M (±58.1%) i/s - 2.983M in 5.021893s
Latest AR Blank 21.529M (±10.2%) i/s - 106.857M in 5.043917s
Comparison:
Latest AR Blank: 21528759.2 i/s
Slow Blank: 19710353.1 i/s - same-ish: difference falls within error
Fast ActiveSupport: 2147989.0 i/s - 10.02x (± 0.00) slower
Fast Blank: 2141145.7 i/s - 10.05x (± 0.00) slower
New Slow Blank: 1270581.8 i/s - 16.94x (± 0.00) slower
================== Test String Length: 136 ==================
Warming up --------------------------------------
Fast Blank 131.289k i/100ms
Fast ActiveSupport 127.899k i/100ms
Slow Blank 1.882M i/100ms
New Slow Blank 87.353k i/100ms
Latest AR Blank 1.958M i/100ms
Calculating -------------------------------------
Fast Blank 1.308M (± 3.3%) i/s - 6.564M in 5.024930s
Fast ActiveSupport 1.251M (± 5.4%) i/s - 6.267M in 5.024276s
Slow Blank 18.412M (± 4.8%) i/s - 92.209M in 5.020928s
New Slow Blank 1.021M (±67.0%) i/s - 2.795M in 5.207333s
Latest AR Blank 19.142M (±10.9%) i/s - 95.918M in 5.091316s
Comparison:
Latest AR Blank: 19142242.3 i/s
Slow Blank: 18411568.1 i/s - same-ish: difference falls within error
Fast Blank: 1307847.7 i/s - 14.64x (± 0.00) slower
Fast ActiveSupport: 1251278.8 i/s - 15.30x (± 0.00) slower
New Slow Blank: 1021134.0 i/s - 18.75x (± 0.00) slower
benchmark
#!/usr/bin/env ruby
$: << File.dirname(__FILE__)+'/lib'
require 'benchmark/ips'
require 'fast_blank'
class String
# active support implementation
BLANK_RE = /\A[[:space:]]*\z/
def slow_blank?
/\A[[:space:]]*\z/ === self
end
def new_slow_blank?
empty? || !(/[[:^space:]]/ === self)
end
def latest_ar_blank?
empty? || BLANK_RE.match?(self)
end
end
test_strings = [
"",
"\r\n\r\n ",
"this is a test",
" this is a longer test",
" this is a longer test
this is a longer test
this is a longer test
this is a longer test
this is a longer test"
]
test_strings.each do |s|
raise "failed on #{s.inspect}" if s.blank? != s.slow_blank?
raise "failed on #{s.inspect}" if s.blank? != s.new_slow_blank?
end
test_strings.each do |s|
puts "\n================== Test String Length: #{s.length} =================="
Benchmark.ips do |x|
x.report("Fast Blank") do |times|
i = 0
while i < times
s.blank?
i += 1
end
end
x.report("Fast ActiveSupport") do |times|
i = 0
while i < times
s.blank_as?
i += 1
end
end
x.report("Slow Blank") do |times|
i = 0
while i < times
s.slow_blank?
i += 1
end
end
x.report("New Slow Blank") do |times|
i = 0
while i < times
s.new_slow_blank?
i += 1
end
end
x.report("Latest AR Blank") do |times|
i = 0
while i < times
s.latest_ar_blank?
i += 1
end
end
x.compare!
end
end
Thanks, I'll take a look.
Self-note: the while i < times in the benchmark might actually make things less stable, I'll try without it to check.
Might also want to run without the C ext cases, or by .dup the string for each impl in case those affect the performance of other methods.
That is likely actually, the C ext will RSTRING_PTR which converts the String to native, and if later we need it managed (to match a Regexp), we'll need to convert back and include that extra logic in the JITed code.
I'm looking at String#blank? from ActiveSupport independently of this issue as part of an in-depth look at optimizing regexps in TruffleRuby. I had planned on looking at fast_blank, too. I'm happy to collaborate with anyone on it.
I've begun looking at this a bit. As Benoit noted, the new_slow_blank? method does not match what ActiveSupport does. I dug into the history and it looks like the definition used to match, but changed five years ago. I'll open up a PR to fix that in fast_blank. While that doesn't have any bearing on TruffleRuby's fast_blank extension performance, it does help ensure we're comparing the right things.
With that change, the difference between slow_blank? and new_slow_blank? basically disappears, except for the empty input case. By way of explanation, the TRegex engine is able to optimize regex patterns that use anchors considerably better than those just looking for a pattern at an arbitrary position in the match string.
Oddly, my latest run doesn't align with the one from @gogainda in https://github.com/oracle/truffleruby/issues/2439#issuecomment-913655560. In particular, I'm seeing the "Fast ActiveSupport" method win out with the 6 character long string. I'm going to spend more time investigating why the pure Ruby implementations are slower on smaller strings. As Benoit noted, they shouldn't be. From there, I'll look into the C extension methods.
================== Test String Length: 0 ==================
Warming up --------------------------------------
Fast Blank 120.423k i/100ms
Fast ActiveSupport 260.257k i/100ms
Slow Blank 199.128k i/100ms
New Slow Blank 31.024M i/100ms
Calculating -------------------------------------
Fast Blank 9.307M (± 7.2%) i/s - 46.242M in 5.002693s
Fast ActiveSupport 8.180M (± 7.1%) i/s - 40.860M in 5.029876s
Slow Blank 2.454M (±58.2%) i/s - 8.164M in 5.053282s
New Slow Blank 307.158M (± 3.5%) i/s - 1.551B in 5.056869s
Comparison:
New Slow Blank: 307157704.4 i/s
Fast Blank: 9306826.4 i/s - 33.00x (± 0.00) slower
Fast ActiveSupport: 8179769.4 i/s - 37.55x (± 0.00) slower
Slow Blank: 2454167.7 i/s - 125.16x (± 0.00) slower
================== Test String Length: 6 ==================
Warming up --------------------------------------
Fast Blank 275.636k i/100ms
Fast ActiveSupport 67.318k i/100ms
Slow Blank 192.713k i/100ms
New Slow Blank 100.374k i/100ms
Calculating -------------------------------------
Fast Blank 3.226M (±13.7%) i/s - 15.987M in 5.074911s
Fast ActiveSupport 3.516M (± 6.1%) i/s - 17.503M in 5.002017s
Slow Blank 1.743M (±75.4%) i/s - 6.360M in 5.124090s
New Slow Blank 2.040M (±57.7%) i/s - 6.324M in 5.035279s
Comparison:
Fast ActiveSupport: 3515607.1 i/s
Fast Blank: 3226493.1 i/s - same-ish: difference falls within error
New Slow Blank: 2039591.2 i/s - 1.72x (± 0.00) slower
Slow Blank: 1743440.2 i/s - 2.02x (± 0.00) slower
================== Test String Length: 14 ==================
Warming up --------------------------------------
Fast Blank 473.156k i/100ms
Fast ActiveSupport 165.767k i/100ms
Slow Blank 5.907M i/100ms
New Slow Blank 5.047M i/100ms
Calculating -------------------------------------
Fast Blank 5.719M (± 3.2%) i/s - 28.863M in 5.052826s
Fast ActiveSupport 5.596M (± 2.1%) i/s - 28.015M in 5.008551s
Slow Blank 57.355M (± 6.5%) i/s - 289.433M in 5.071272s
New Slow Blank 51.431M (± 7.6%) i/s - 257.373M in 5.039350s
Comparison:
Slow Blank: 57354724.3 i/s
New Slow Blank: 51430894.9 i/s - same-ish: difference falls within error
Fast Blank: 5718579.8 i/s - 10.03x (± 0.00) slower
Fast ActiveSupport: 5596006.4 i/s - 10.25x (± 0.00) slower
================== Test String Length: 24 ==================
Warming up --------------------------------------
Fast Blank 320.204k i/100ms
Fast ActiveSupport 314.719k i/100ms
Slow Blank 3.323M i/100ms
New Slow Blank 4.490M i/100ms
Calculating -------------------------------------
Fast Blank 3.518M (± 3.7%) i/s - 17.611M in 5.015159s
Fast ActiveSupport 3.340M (± 0.7%) i/s - 16.995M in 5.088683s
Slow Blank 53.065M (± 1.7%) i/s - 265.839M in 5.011134s
New Slow Blank 45.636M (± 9.4%) i/s - 229.013M in 5.074606s
Comparison:
Slow Blank: 53064631.2 i/s
New Slow Blank: 45636432.1 i/s - 1.16x (± 0.00) slower
Fast Blank: 3517504.2 i/s - 15.09x (± 0.00) slower
Fast ActiveSupport: 3339912.3 i/s - 15.89x (± 0.00) slower
================== Test String Length: 136 ==================
Warming up --------------------------------------
Fast Blank 176.592k i/100ms
Fast ActiveSupport 159.082k i/100ms
Slow Blank 3.347M i/100ms
New Slow Blank 2.845M i/100ms
Calculating -------------------------------------
Fast Blank 1.887M (± 4.9%) i/s - 9.536M in 5.067579s
Fast ActiveSupport 1.879M (± 6.7%) i/s - 9.386M in 5.021470s
Slow Blank 36.291M (± 8.4%) i/s - 180.758M in 5.020799s
New Slow Blank 33.766M (± 7.9%) i/s - 167.879M in 5.010004s
Comparison:
Slow Blank: 36291313.4 i/s
New Slow Blank: 33766400.9 i/s - same-ish: difference falls within error
Fast Blank: 1887041.7 i/s - 19.23x (± 0.00) slower
Fast ActiveSupport: 1879048.6 i/s - 19.31x (± 0.00) slower
As a follow-up, I did extract the code out into simpler benchmarks without the embedded while loops and used the benchmark-ips hold! method to ensure different benchmarks weren't adversely affecting each other. The slow_blank? and new_slow_blank? methods see a fairly large improvement:
String length: 136
Warming up --------------------------------------
slow_blank? 10.381M i/100ms
Calculating -------------------------------------
slow_blank? 103.737M (± 2.5%) i/s - 1.038B in 10.013468s
Pausing here -- run Ruby again to measure the next benchmark...
String length: 136
Warming up --------------------------------------
new_slow_blank? 9.502M i/100ms
Calculating -------------------------------------
new_slow_blank? 94.746M (± 2.4%) i/s - 950.232M in 10.035413s
Comparison:
slow_blank?: 103736848.6 i/s
new_slow_blank?: 94745893.2 i/s - 1.09x (± 0.00) slower
Using frozen strings, the IPS value jumps even higher:
String length: 136
Warming up --------------------------------------
slow_blank? 12.221M i/100ms
Calculating -------------------------------------
slow_blank? 124.859M (± 3.0%) i/s - 1.259B in 10.091475s
Pausing here -- run Ruby again to measure the next benchmark...
String length: 136
Warming up --------------------------------------
new_slow_blank? 13.120M i/100ms
Calculating -------------------------------------
new_slow_blank? 131.366M (± 3.8%) i/s - 1.325B in 10.104700s
Comparison:
new_slow_blank?: 131365974.9 i/s
slow_blank?: 124858962.0 i/s - same-ish: difference falls within error
I'm reticent to go changing the way benchmarks are run while trying to improve that benchmark performance. Especially where this block form of report is supposed to reduce overhead. But, it looks like that form of the call is adding overhead that's impacting the benchmark execution. I'll look at that in more detail later.
IMHO benchmark-ips is most reliable in the regular block form (no while inside), so if it doesn't make much difference for CRuby I think it's best to use that.
I think it's fair and good to measure both blanks and non-blanks strings in the same process, but ideally there should be a way to run with a single blank implementation in the process, since otherwise they can have side effects like mutating the string.
Just s = s.dup before each report block might be a way to avoid the interference.
# frozen_string_literal: true has the effect to make the strings ImmutableRubyString and those always remain managed (good for Regexp-based impls), at the cost of a map lookup when needing the native char* for it. That's not ideal for measuring the C ext performance.
We could add an inline cache in StringToNativeNode in single-context mode for ImmutableRubyString though.
I think s = s.dup is the easiest fix to still run everything in the same process and run a fair comparison.
IMHO benchmark-ips is most reliable in the regular block form (no
whileinside), so if it doesn't make much difference for CRuby I think it's best to use that.
If we can make a case for it, then maybe it's worth getting it changed in the benchmark-ips README. The docs have an example showing the inner while loop with the comment:
# To reduce overhead, the number of iterations is passed in
# and the block must run the code the specific number of times.
# Used for when the workload is very small and any overhead
# introduces incorrectable errors
I think it's fair and good to measure both blanks and non-blanks strings in the same process, but ideally there should be a way to run with a single blank implementation in the process, since otherwise they can have side effects like mutating the string. Just
s = s.dupbefore each report block might be a way to avoid the interference.
Agreed. I'll give that a try.
# frozen_string_literal: truehas the effect to make the strings ImmutableRubyString and those always remain managed (good for Regexp-based impls), at the cost of a map lookup when needing the nativechar*for it. That's not ideal for measuring the C ext performance. We could add an inline cache in StringToNativeNode in single-context mode for ImmutableRubyString though.I think
s = s.dupis the easiest fix to still run everything in the same process and run a fair comparison.
Thanks for the explanation. I forgot about the extra look-up.
I'm just circling back to this. We made some big improvements to regex processing that should ship in TruffleRuby 22.0.0. Running benchmarks on a native Ryzen 3700X system with Ubuntu 21.04, TruffleRuby's execution of ActiveSupport's String#blank? was ~5x the speed of _fast_blank's native extension with MRI 3.0.3. We should still make this extension run faster, but as Sam suggested, it might just make sense to have _fast_blank use Ruby rather than C on TruffleRuby.
In any event, with some of the regex stuff that we knew was problematic out of the way, it'll be easier to look at just the native extension execution in TruffleRuby now.