fast-ruby Some benchmarks are measured with wrong approach

tl;dr When we measure small things we have to remove all overheads which have impact. Wrapping with a method and block call is quite big overhead for most of benchmarks in this repository.

Suppose you want compare performance of "2 + 3" vs "2 * 3" calls. If you do it with a approach used in this repository you will get this results:

require "benchmark/ips"

def slow
  2 * 2
end

def fast
  2 + 2
end

Benchmark.ips do |x|
  x.report("2 * 2") { slow }
  x.report("2 + 2") { fast }
  x.compare!
end

(simplified output)

Comparison:
               2 + 2:  8304760.0 i/s
               2 * 2:  7535516.6 i/s - 1.10x slower

But there is one problem. Calling a method + surrounding block ({ slow }) has bigger overhead than calling Fixnum#+ or Fixnum#* itself. Thea easiest way to observe it is to repeat benchmarked operations in one call. Like this:

require "benchmark/ips"

def slow
  2*2; 2*2; 2*2; 2*2; 2*2; 2*2; 2*2; 2*2; 2*2; 2*2;
end

def fast
  2+2; 2+2; 2+2; 2+2; 2+2; 2+2; 2+2; 2+2; 2+2; 2+2;
end

Benchmark.ips do |x|
  x.report("2 * 2") { slow }
  x.report("2 + 2") { fast }
  x.compare!
end

Comparison:
               2 + 2:  4680545.3 i/s
               2 * 2:  3468681.3 i/s - 1.35x slower

See how results changed? Writing our benchmarks in this way would be quite problematic. Fortunately benchmark-ips gem has answer for that. Benchmark::IPS::Job#report method allows to pass string which will be compiled before benchmark is run. Passing right string allows to measure it properly:

require "benchmark/ips"

Benchmark.ips do |x|
  x.report("2 * 2", "2 * 2;" * 1_000)
  x.report("2 + 2", "2 + 2;" * 1_000)
  x.compare!
end

Comparison:
               2 + 2:    91567.5 i/s
               2 * 2:    57994.4 i/s - 1.58x slower

This is greatly explained here: https://docs.omniref.com/ruby/2.2.1/symbols/Benchmark/bm#annotation=4095926&line=182

How does this affect fast-ruby benchmarks? All benchmarks that call small things are flawed. One of them is Array#length vs Array#size vs Array#count benchmark. Here is the original code and result obtained on my computer:

require 'benchmark/ips'

ARRAY = [*1..100]

Benchmark.ips do |x|
  x.report("Array#length") { ARRAY.length }
  x.report("Array#size") { ARRAY.size }
  x.report("Array#count") { ARRAY.count }
  x.compare!
end

Comparison:
          Array#size:  8679483.2 i/s
        Array#length:  8664450.7 i/s - 1.00x slower
         Array#count:  7237299.5 i/s - 1.20x slower

The same benchmark measure with described approach gives different numbers:

require 'benchmark/ips'

ARRAY = [*1..100]

Benchmark.ips do |x|
  x.report("Array#length", "ARRAY.length;" * 1_000)
  x.report("Array#size",   "ARRAY.size;"   * 1_000)
  x.report("Array#count",  "ARRAY.count;"  * 1_000)
  x.compare!
end

Comparison:
          Array#size:   113902.4 i/s
        Array#length:   113655.9 i/s - 1.00x slower
         Array#count:    28753.4 i/s - 3.96x slower

Difference: 1.20x slower vs 3.96x slower.

My guess is that it affects most of benchmarks.

Jan 25 '16 10:01 radarek

tl;dr Some benchmarks have other overhead. For example call vs send vs method_missing claims that send is only 1.14x slower and method_missing 1.42x slower than direct call. After removing all overheads we ended up with a result saying that send is 2.48x slower and method_missing 4.73x slower.

Another thing is that some benchmarks adds even more overhead. Let's consider call vs send vs method_missing. Current version is following:

require "benchmark/ips"

class MethodCall
  def method
  end

  def method_missing(_method,*args)
    method
  end
end

def fastest
  method = MethodCall.new
  method.method
end

def slow
  method = MethodCall.new
  method.send(:method)
end

def slowest
  method = MethodCall.new
  method.not_exist
end

Benchmark.ips do |x|
  x.report("call")           { fastest }
  x.report("send")           { slow    }
  x.report("method_missing") { slowest }
  x.compare!
end

And result on my computer is:

Comparison:
                call:  3998871.4 i/s
                send:  3497710.2 i/s - 1.14x slower
      method_missing:  2807780.4 i/s - 1.42x slower

First problem is that every call creates new object. Let's remove it. The easiest way is to add class methods instead of instance.

require "benchmark/ips"

class MethodCall
  class<<self
    def method
    end

    def method_missing(_method,*args)
      method
    end
  end
end

def fastest
  MethodCall.method
end

def slow
  MethodCall.send(:method)
end

def slowest
  MethodCall.not_exist
end

Benchmark.ips do |x|
  x.report("call")           { fastest }
  x.report("send")           { slow    }
  x.report("method_missing") { slowest }
  x.compare!
end

Comparison:
                call:  7291953.3 i/s
                send:  5741528.9 i/s - 1.27x slower
      method_missing:  4081021.3 i/s - 1.79x slower

Next step is to use trick described in first comment (pass code as string to report method).

require "benchmark/ips"

class MethodCall
  class<<self
    def method
    end

    def method_missing(_method,*args)
      method
    end
  end
end

Benchmark.ips do |x|
  x.report("call",           "MethodCall.method;"        * 1_000)
  x.report("send",           "MethodCall.send(:method);" * 1_000)
  x.report("method_missing", "MethodCall.not_exist;"     * 1_000)
  x.compare!
end

Comparison:
                call:    32605.5 i/s
                send:    14935.5 i/s - 2.18x slower
      method_missing:     7816.2 i/s - 4.17x slower

We can go even further. Every call of a form Foo.bar needs constant lookup. We can remove it using global variable, which is a little bit faster than constant lookup:

require "benchmark/ips"

class MethodCall
  class<<self
    def method
    end

    def method_missing(_method,*args)
      method
    end
  end
end

$MethodCall = MethodCall

Benchmark.ips do |x|
  x.report("call",           "$MethodCall.method;"        * 1_000)
  x.report("send",           "$MethodCall.send(:method);" * 1_000)
  x.report("method_missing", "$MethodCall.not_exist;"     * 1_000)
  x.compare!
end

Comparison:
                call:    38591.5 i/s
                send:    15539.0 i/s - 2.48x slower
      method_missing:     8165.6 i/s - 4.73x slower

We started with original benchmark claiming that send is only 1.14x slower and method_missing 1.42x slower than direct call. After removing all overheads we ended up with a result saying that send is 2.48x slower and method_missing 4.73x slower. Big difference, right?

Jan 25 '16 11:01 radarek

:clap: This is great work! I'll wait for the other collaborators to weigh in here, but I think this is great research. PRs welcome with these changes.

Jan 25 '16 12:01 nateberkopec

Something you might also consider is CPU instruction cache. Although I'm not an expert in the black art of optimizing code for the CPU cache, I do know cached instructions can skew benchmarks. I typically sidestep this issue by calling the method in question on multiple different pieces of data.

To further your first example:

require "benchmark/ips"

NUMS = (1..1000)
FAST = NUMS.map{|n| "#{n} + #{n}" }.join(';')
SLOW = NUMS.map{|n| "#{n} * 2" }.join(';')

Benchmark.ips do |x|
  x.report("n + n", FAST)
  x.report("n * 2", SLOW)
  x.compare!
end

This results in:

Calculating -------------------------------------
               n + n     68.816k (± 0.3%) i/s -    350.829k in   5.098103s
               n * 2     57.513k (± 0.2%) i/s -    290.680k in   5.054154s

Comparison:
               n + n:    68816.4 i/s
               n * 2:    57513.3 i/s - 1.20x slower

May 25 '16 00:05 jonathanhefner

:clap: :bow: Great work!

May 25 '16 00:05 ixti

@radarek Sounds like a good way to reduce overhead. Of course, overhead for Fixnum#+ will always be whatever is around (or it should be billions i/s) :)

$MethodCall = MethodCall Why using a global variable here instead of the constant? I think a constant would be both more practical and convey the fact it does not change.

Jun 21 '16 07:06 eregon

Every call of a form Foo.bar needs constant lookup. We can remove it using global variable, which is a little bit faster than constant lookup

@radarek I missed that, sorry. This is interesting. I would not count on it for other implementations or future improvements though, reading constants ought to be as good as reading global variables. Global variables are also not the right design tool here.

Jun 22 '16 12:06 eregon

reading constants ought to be as good as reading global variables.

I don't think so. Constants will always have to go through the lookup chain, because constants can be nested inside of each other, whereas globals will not. This will be true regardless of implementation.

I agree that in normal practice, constants would make more sense, but in the case of benchmarking, we're trying to eliminate as much overhead as possible.

Jun 22 '16 18:06 nateberkopec

I don't think so. Constants will always have to go through the lookup chain, because constants can be nested inside of each other, whereas globals will not. This will be true regardless of implementation.

AFAIK this is completely backwards for JRuby+Truffle. I think this was an intentional design choice, constants are supposed to be constant (eg redefining a const emits a warning, you can't lexically do X=1 in normal methods etc.) There are no comparable constraints on globals, therefore JRuby+Truffle emits code that expects globals to change.

Re nesting, I would expect there to be no perf difference between A::B::C::D::F::G.foo and X.foo on JRuby+Truffle.

(This is of course only true if the code in question isn't continually changing constants.)

Jun 22 '16 19:06 thedarkone

@nateberkopec

I don't think so. Constants will always have to go through the lookup chain, because constants can be nested inside of each other, whereas globals will not. This will be true regardless of implementation.

This seems a limitation currently in MRI, but there is no reason why constants cannot have an inline cache too. Then the only difference is a cheap version check. JIT implementations with deoptimization can actually fold Ruby constants as actual constants in the machine code. Nesting is just a different way to do the lookup, but it should not affect the caching (it just makes the invalidation more complex).

Actually, I am most worried that the benchmarks would use totally unidiomatic code.

@thedarkone Maybe JRuby+Truffle should do the same trick as JRuby and speculate on globals being not modified. Then it would be the same performance if such assumption holds, but it just seems the wrong tool in the first place.

Jun 22 '16 20:06 eregon

@thedarkone TIL. Paging @chrisseaton on JRuby+Truffle. I thought the way constant lookup worked in Ruby was a language-level limitation, but it looks like it may be wrong!

In any case, if constant lookup and global lookup are equally fast in some implementations, and constant lookup is slower in some implementations, we should still use globals.

Jun 22 '16 20:06 nateberkopec

JRuby+Truffle's constant lookup and caching uses almost exactly the same mechanism and code as method calls, and technically it's perfect as far as I'm aware - it leaves no guards, no checks, no indirection, in the generated code.

JRuby+Truffle's globals however are fields on a hidden globals object. We read them like an instance variable and, as @eregon says, we don't do anything to turn them into constants at the moment. It doesn't happen for free as the globals object is, unsurprisingly, globally visible. We should fix that now and make globals constant. It would be a good beginner task if anyone is interested.

To comment more generally on this thread, the problem you're trying to solve is a really tricky one. I've submitted a few PRs to benchmark-ips about how deep the rabbit hole goes https://github.com/evanphx/benchmark-ips/pull/58 when you start to involve a sophisticated JIT.

Jun 22 '16 20:06 chrisseaton

@chrisseaton I vaguely remember @pitr-ch changing globals into direct volatiles (as per proposed mem model). I though my reasoning above is why JRubyTruffle is not speculating on "constant-ness" of globals.

Maybe JRuby+Truffle should do the same trick as JRuby and speculate on globals being not modified. Then it would be the same performance if such assumption holds, but it just seems the wrong tool in the first place.

@eregon I would propose treating globals as constants (compilation constant), until proven otherwise (have a mod counter, after 2 or 10 changes switching to repeated volatile reads).

Jun 22 '16 20:06 thedarkone

Yes the modification counter is probably the key.

We're also missing the feature in Ruby that lets you hook into globals to observe reads and writes - so we should really dig into the globals implementation in general.

Jun 22 '16 20:06 chrisseaton

@thedarkone For the memory model, speculating them as constants would still be fine as invalidation is thread-safe (it waits for every thread to throw the compiled code before continuing).

Jun 22 '16 21:06 eregon

This seems a limitation currently in MRI, but there is no reason why constants cannot have an inline cache too. Then the only difference is a cheap version check.

It seems I was wrong about MRI :) There is an inline cache for constants as one would expect:

 ruby --dump=insns -e 'A=42; loop {A}'
...
0000 trace            256                                             (   1)
0002 trace            1
0004 getinlinecache   11, <is:0>
0007 getconstant      :A
0009 setinlinecache   <is:0>
0011 trace            512
0013 leave

Instruction 0004 is the inline cache (but getconstant itself has no cache), checking a couple things and jumping to 0011 if it succeeds. This might be a bit more expensive than getglobal which only calls a function pointer which itself reads the global's value. The difference between the two seems fairly small on my machine, less than 5% on 40M i/s.

Jun 22 '16 22:06 eregon

Another type of overhead is unintentional allocations, like this

# each {} is the new object - {}.object_id == {}.object_id => false
h.fetch(:a, {}).fetch(:b, {}).fetch(:c, {}).fetch(:d, {}).fetch(:e, nil)

# freeze is to make sure we only read from object, but not write to it
EMPTY_OBJECT = {}.freeze

# same code without unintentional allocations
h.fetch(:a, EMPTY_OBJECT).fetch(:b, EMPTY_OBJECT).fetch(:c, EMPTY_OBJECT).fetch(:d, EMPTY_OBJECT).fetch(:e, nil)

Those objects can trigger GC, which will make benchmark slower

Oct 20 '17 18:10 stereobooster

I guess this is a bit stale for me, in terms of topics, but I didn't want to create a new one. This example IMO is also a bit misleading: https://github.com/JuanitoFatas/fast-ruby#normal-way-to-apply-method-vs-method-code

The difference claimed (4x) goes down quite a when you remove the method call. (for me it went from 2.58x to 2.17x) for:

def do_something(n)
  4*n + 2
end

def fast
  [1, 2, 3].map { |n| do_something(n) }
end

METHOD_HANDLE = method(:do_something)
def slow
  [1, 2, 3].map(&METHOD_HANDLE)
end

Benchmark.ips do |x|
  x.report("normal")  { fast }
  x.report("&method") { slow }
  x.compare!
end

I guess the above suffers from the issue described originally in this thread, but I didn't want to compare the method-calling block-using version with method call caching string-using version.

Also when I did something like this:

def fast
  [1, 2, 3].map { |n| n.even? }
end

def slow
  [1, 2, 3].map(&:even?)
end

Benchmark.ips do |x|
  x.report("normal")  { fast }
  x.report("&method") { slow }
  x.compare!
end

the diff went down to 1.25x (on ruby 2.7.4)

Dec 31 '21 15:12 scooler