fast-ruby
fast-ruby copied to clipboard
Some benchmarks are measured with wrong approach
tl;dr When we measure small things we have to remove all overheads which have impact. Wrapping with a method and block call is quite big overhead for most of benchmarks in this repository.
Suppose you want compare performance of "2 + 3" vs "2 * 3" calls. If you do it with a approach used in this repository you will get this results:
require "benchmark/ips"
def slow
2 * 2
end
def fast
2 + 2
end
Benchmark.ips do |x|
x.report("2 * 2") { slow }
x.report("2 + 2") { fast }
x.compare!
end
(simplified output)
Comparison:
2 + 2: 8304760.0 i/s
2 * 2: 7535516.6 i/s - 1.10x slower
But there is one problem. Calling a method + surrounding block ({ slow }) has bigger overhead than calling Fixnum#+ or Fixnum#* itself. Thea easiest way to observe it is to repeat benchmarked operations in one call. Like this:
require "benchmark/ips"
def slow
2*2; 2*2; 2*2; 2*2; 2*2; 2*2; 2*2; 2*2; 2*2; 2*2;
end
def fast
2+2; 2+2; 2+2; 2+2; 2+2; 2+2; 2+2; 2+2; 2+2; 2+2;
end
Benchmark.ips do |x|
x.report("2 * 2") { slow }
x.report("2 + 2") { fast }
x.compare!
end
Comparison:
2 + 2: 4680545.3 i/s
2 * 2: 3468681.3 i/s - 1.35x slower
See how results changed? Writing our benchmarks in this way would be quite problematic. Fortunately benchmark-ips gem has answer for that. Benchmark::IPS::Job#report method allows to pass string which will be compiled before benchmark is run. Passing right string allows to measure it properly:
require "benchmark/ips"
Benchmark.ips do |x|
x.report("2 * 2", "2 * 2;" * 1_000)
x.report("2 + 2", "2 + 2;" * 1_000)
x.compare!
end
Comparison:
2 + 2: 91567.5 i/s
2 * 2: 57994.4 i/s - 1.58x slower
This is greatly explained here: https://docs.omniref.com/ruby/2.2.1/symbols/Benchmark/bm#annotation=4095926&line=182
How does this affect fast-ruby benchmarks? All benchmarks that call small things are flawed. One of them is Array#length vs Array#size vs Array#count benchmark. Here is the original code and result obtained on my computer:
require 'benchmark/ips'
ARRAY = [*1..100]
Benchmark.ips do |x|
x.report("Array#length") { ARRAY.length }
x.report("Array#size") { ARRAY.size }
x.report("Array#count") { ARRAY.count }
x.compare!
end
Comparison:
Array#size: 8679483.2 i/s
Array#length: 8664450.7 i/s - 1.00x slower
Array#count: 7237299.5 i/s - 1.20x slower
The same benchmark measure with described approach gives different numbers:
require 'benchmark/ips'
ARRAY = [*1..100]
Benchmark.ips do |x|
x.report("Array#length", "ARRAY.length;" * 1_000)
x.report("Array#size", "ARRAY.size;" * 1_000)
x.report("Array#count", "ARRAY.count;" * 1_000)
x.compare!
end
Comparison:
Array#size: 113902.4 i/s
Array#length: 113655.9 i/s - 1.00x slower
Array#count: 28753.4 i/s - 3.96x slower
Difference: 1.20x slower vs 3.96x slower.
My guess is that it affects most of benchmarks.
tl;dr Some benchmarks have other overhead. For example call vs send vs method_missing claims that send is only 1.14x slower and method_missing 1.42x slower than direct call. After removing all overheads we ended up with a result saying that send is 2.48x slower and method_missing 4.73x slower.
Another thing is that some benchmarks adds even more overhead. Let's consider call vs send vs method_missing. Current version is following:
require "benchmark/ips"
class MethodCall
def method
end
def method_missing(_method,*args)
method
end
end
def fastest
method = MethodCall.new
method.method
end
def slow
method = MethodCall.new
method.send(:method)
end
def slowest
method = MethodCall.new
method.not_exist
end
Benchmark.ips do |x|
x.report("call") { fastest }
x.report("send") { slow }
x.report("method_missing") { slowest }
x.compare!
end
And result on my computer is:
Comparison:
call: 3998871.4 i/s
send: 3497710.2 i/s - 1.14x slower
method_missing: 2807780.4 i/s - 1.42x slower
First problem is that every call creates new object. Let's remove it. The easiest way is to add class methods instead of instance.
require "benchmark/ips"
class MethodCall
class<<self
def method
end
def method_missing(_method,*args)
method
end
end
end
def fastest
MethodCall.method
end
def slow
MethodCall.send(:method)
end
def slowest
MethodCall.not_exist
end
Benchmark.ips do |x|
x.report("call") { fastest }
x.report("send") { slow }
x.report("method_missing") { slowest }
x.compare!
end
Comparison:
call: 7291953.3 i/s
send: 5741528.9 i/s - 1.27x slower
method_missing: 4081021.3 i/s - 1.79x slower
Next step is to use trick described in first comment (pass code as string to report method).
require "benchmark/ips"
class MethodCall
class<<self
def method
end
def method_missing(_method,*args)
method
end
end
end
Benchmark.ips do |x|
x.report("call", "MethodCall.method;" * 1_000)
x.report("send", "MethodCall.send(:method);" * 1_000)
x.report("method_missing", "MethodCall.not_exist;" * 1_000)
x.compare!
end
Comparison:
call: 32605.5 i/s
send: 14935.5 i/s - 2.18x slower
method_missing: 7816.2 i/s - 4.17x slower
We can go even further. Every call of a form Foo.bar needs constant lookup. We can remove it using global variable, which is a little bit faster than constant lookup:
require "benchmark/ips"
class MethodCall
class<<self
def method
end
def method_missing(_method,*args)
method
end
end
end
$MethodCall = MethodCall
Benchmark.ips do |x|
x.report("call", "$MethodCall.method;" * 1_000)
x.report("send", "$MethodCall.send(:method);" * 1_000)
x.report("method_missing", "$MethodCall.not_exist;" * 1_000)
x.compare!
end
Comparison:
call: 38591.5 i/s
send: 15539.0 i/s - 2.48x slower
method_missing: 8165.6 i/s - 4.73x slower
We started with original benchmark claiming that send is only 1.14x slower and method_missing 1.42x slower than direct call. After removing all overheads we ended up with a result saying that send is 2.48x slower and method_missing 4.73x slower. Big difference, right?
:clap: This is great work! I'll wait for the other collaborators to weigh in here, but I think this is great research. PRs welcome with these changes.
Something you might also consider is CPU instruction cache. Although I'm not an expert in the black art of optimizing code for the CPU cache, I do know cached instructions can skew benchmarks. I typically sidestep this issue by calling the method in question on multiple different pieces of data.
To further your first example:
require "benchmark/ips"
NUMS = (1..1000)
FAST = NUMS.map{|n| "#{n} + #{n}" }.join(';')
SLOW = NUMS.map{|n| "#{n} * 2" }.join(';')
Benchmark.ips do |x|
x.report("n + n", FAST)
x.report("n * 2", SLOW)
x.compare!
end
This results in:
Calculating -------------------------------------
n + n 68.816k (± 0.3%) i/s - 350.829k in 5.098103s
n * 2 57.513k (± 0.2%) i/s - 290.680k in 5.054154s
Comparison:
n + n: 68816.4 i/s
n * 2: 57513.3 i/s - 1.20x slower
:clap: :bow: Great work!
@radarek Sounds like a good way to reduce overhead. Of course, overhead for Fixnum#+ will always be whatever is around (or it should be billions i/s) :)
$MethodCall = MethodCall Why using a global variable here instead of the constant?
I think a constant would be both more practical and convey the fact it does not change.
Every call of a form Foo.bar needs constant lookup. We can remove it using global variable, which is a little bit faster than constant lookup
@radarek I missed that, sorry. This is interesting. I would not count on it for other implementations or future improvements though, reading constants ought to be as good as reading global variables. Global variables are also not the right design tool here.
reading constants ought to be as good as reading global variables.
I don't think so. Constants will always have to go through the lookup chain, because constants can be nested inside of each other, whereas globals will not. This will be true regardless of implementation.
I agree that in normal practice, constants would make more sense, but in the case of benchmarking, we're trying to eliminate as much overhead as possible.
I don't think so. Constants will always have to go through the lookup chain, because constants can be nested inside of each other, whereas globals will not. This will be true regardless of implementation.
AFAIK this is completely backwards for JRuby+Truffle. I think this was an intentional design choice, constants are supposed to be constant (eg redefining a const emits a warning, you can't lexically do X=1 in normal methods etc.) There are no comparable constraints on globals, therefore JRuby+Truffle emits code that expects globals to change.
Re nesting, I would expect there to be no perf difference between A::B::C::D::F::G.foo and X.foo on JRuby+Truffle.
(This is of course only true if the code in question isn't continually changing constants.)
@nateberkopec
I don't think so. Constants will always have to go through the lookup chain, because constants can be nested inside of each other, whereas globals will not. This will be true regardless of implementation.
This seems a limitation currently in MRI, but there is no reason why constants cannot have an inline cache too. Then the only difference is a cheap version check. JIT implementations with deoptimization can actually fold Ruby constants as actual constants in the machine code. Nesting is just a different way to do the lookup, but it should not affect the caching (it just makes the invalidation more complex).
Actually, I am most worried that the benchmarks would use totally unidiomatic code.
@thedarkone Maybe JRuby+Truffle should do the same trick as JRuby and speculate on globals being not modified. Then it would be the same performance if such assumption holds, but it just seems the wrong tool in the first place.
@thedarkone TIL. Paging @chrisseaton on JRuby+Truffle. I thought the way constant lookup worked in Ruby was a language-level limitation, but it looks like it may be wrong!
In any case, if constant lookup and global lookup are equally fast in some implementations, and constant lookup is slower in some implementations, we should still use globals.
JRuby+Truffle's constant lookup and caching uses almost exactly the same mechanism and code as method calls, and technically it's perfect as far as I'm aware - it leaves no guards, no checks, no indirection, in the generated code.
JRuby+Truffle's globals however are fields on a hidden globals object. We read them like an instance variable and, as @eregon says, we don't do anything to turn them into constants at the moment. It doesn't happen for free as the globals object is, unsurprisingly, globally visible. We should fix that now and make globals constant. It would be a good beginner task if anyone is interested.
To comment more generally on this thread, the problem you're trying to solve is a really tricky one. I've submitted a few PRs to benchmark-ips about how deep the rabbit hole goes https://github.com/evanphx/benchmark-ips/pull/58 when you start to involve a sophisticated JIT.
@chrisseaton I vaguely remember @pitr-ch changing globals into direct volatiles (as per proposed mem model). I though my reasoning above is why JRubyTruffle is not speculating on "constant-ness" of globals.
Maybe JRuby+Truffle should do the same trick as JRuby and speculate on globals being not modified. Then it would be the same performance if such assumption holds, but it just seems the wrong tool in the first place.
@eregon I would propose treating globals as constants (compilation constant), until proven otherwise (have a mod counter, after 2 or 10 changes switching to repeated volatile reads).
Yes the modification counter is probably the key.
We're also missing the feature in Ruby that lets you hook into globals to observe reads and writes - so we should really dig into the globals implementation in general.
@thedarkone For the memory model, speculating them as constants would still be fine as invalidation is thread-safe (it waits for every thread to throw the compiled code before continuing).
This seems a limitation currently in MRI, but there is no reason why constants cannot have an inline cache too. Then the only difference is a cheap version check.
It seems I was wrong about MRI :) There is an inline cache for constants as one would expect:
ruby --dump=insns -e 'A=42; loop {A}'
...
0000 trace 256 ( 1)
0002 trace 1
0004 getinlinecache 11, <is:0>
0007 getconstant :A
0009 setinlinecache <is:0>
0011 trace 512
0013 leave
Instruction 0004 is the inline cache (but getconstant itself has no cache), checking a couple things and jumping to 0011 if it succeeds. This might be a bit more expensive than getglobal which only calls a function pointer which itself reads the global's value.
The difference between the two seems fairly small on my machine, less than 5% on 40M i/s.
Another type of overhead is unintentional allocations, like this
# each {} is the new object - {}.object_id == {}.object_id => false
h.fetch(:a, {}).fetch(:b, {}).fetch(:c, {}).fetch(:d, {}).fetch(:e, nil)
# freeze is to make sure we only read from object, but not write to it
EMPTY_OBJECT = {}.freeze
# same code without unintentional allocations
h.fetch(:a, EMPTY_OBJECT).fetch(:b, EMPTY_OBJECT).fetch(:c, EMPTY_OBJECT).fetch(:d, EMPTY_OBJECT).fetch(:e, nil)
Those objects can trigger GC, which will make benchmark slower
I guess this is a bit stale for me, in terms of topics, but I didn't want to create a new one. This example IMO is also a bit misleading: https://github.com/JuanitoFatas/fast-ruby#normal-way-to-apply-method-vs-method-code
The difference claimed (4x) goes down quite a when you remove the method call. (for me it went from 2.58x to 2.17x) for:
def do_something(n)
4*n + 2
end
def fast
[1, 2, 3].map { |n| do_something(n) }
end
METHOD_HANDLE = method(:do_something)
def slow
[1, 2, 3].map(&METHOD_HANDLE)
end
Benchmark.ips do |x|
x.report("normal") { fast }
x.report("&method") { slow }
x.compare!
end
I guess the above suffers from the issue described originally in this thread, but I didn't want to compare the method-calling block-using version with method call caching string-using version.
Also when I did something like this:
def fast
[1, 2, 3].map { |n| n.even? }
end
def slow
[1, 2, 3].map(&:even?)
end
Benchmark.ips do |x|
x.report("normal") { fast }
x.report("&method") { slow }
x.compare!
end
the diff went down to 1.25x (on ruby 2.7.4)