inlined-generic-function icon indicating copy to clipboard operation
inlined-generic-function copied to clipboard

Benchmark behaviour on defined class is slow.

Open bon opened this issue 8 years ago • 8 comments

The benchmarks provided are for methods on the built-in lisp types number, fixnum and double-float. To test the behaviour on defined classes we added a simple boxing class and found that peformance degraded when using inlined-generic-functions, inlined. We found the following numbers of processor cycles for the four methods in playground.lisp, respectively:

     333,033
     331,839
   2,144,814
     585,272

Experiment on sbcl 1.3.5.24

See https://github.com/bon/inlined-generic-function/commit/8b6e4d5b10cace47de4343e6dde8455f21dfd579

So my question is whether this indicates that inlined-generic-functions only speed up on built-in types and not on defined classes?

bon avatar Sep 01 '16 14:09 bon

it seems normal-plus is running w/o boxing, right?

guicho271828 avatar Sep 01 '16 17:09 guicho271828

Correct! Fixed in https://github.com/bon/inlined-generic-function/commit/76d1eb6e77ebc5433465b9afb2cdb84b6c4c3e4d

Processor cycles are now

    588,650
    586,253
  1,889,394
    550,351

bon avatar Sep 01 '16 17:09 bon

phew.

guicho271828 avatar Sep 01 '16 20:09 guicho271828

I just tested your version. On my machine, the result is still in favor of the inlined version.

Evaluation took:
  0.001 seconds of real time
  0.004000 seconds of total run time (0.004000 user, 0.000000 system)
  400.00% CPU
  638,640 processor cycles
  131,024 bytes consed

Evaluation took:
  0.000 seconds of real time
  0.000000 seconds of total run time (0.000000 user, 0.000000 system)
  100.00% CPU
  608,634 processor cycles
  163,808 bytes consed

Evaluation took:
  0.003 seconds of real time
  0.000000 seconds of total run time (0.000000 user, 0.000000 system)
  0.00% CPU
  4,543,020 processor cycles
  655,184 bytes consed

Evaluation took:
  0.000 seconds of real time
  0.000000 seconds of total run time (0.000000 user, 0.000000 system)
  100.00% CPU
  389,169 processor cycles
  163,808 bytes consed

What is this difference? In your result I-g-function is performing better, but not much better. I use SBCL 1.3.8 on roswell on

$ uname -a
Linux guicho-x61 4.4.0-36-generic #55-Ubuntu SMP Thu Aug 11 18:01:55 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux
$ cat /proc/cpuinfo
...
model name  : Intel(R) Core(TM)2 Duo CPU     T7100  @ 1.80GHz
...

guicho271828 avatar Sep 02 '16 22:09 guicho271828

For me the numbers of cycles vary wildly from run to run. Sometimes the igf gets a little quicker, sometimes slower. One example is shown below.

But the more interesting question is why the igf showed a 10x speedup on numbers but hardly any difference on defined classes? Of course I would be very happy to see a 10x speedup on defined classes too!

$ cat /proc/cpuinfo  | ag 'model name' | head -1
model name  : Intel(R) Core(TM) i7-3520M CPU @ 2.90GHz
$ uname -a
Linux tie 4.7.2-1-ARCH #1 SMP PREEMPT Sat Aug 20 23:02:56 CEST 2016 x86_64 GNU/Linux
$ ros use sbcl
$ ~/.roswell/impls/x86-64/linux/sbcl/1.3.9/bin/sbcl --version
SBCL 1.3.9
$ ros run
$ rlwrap ros run
* (ql:quickload :inlined-generic-function)

...

* (load "benchmark.lisp")

...

Evaluation took:
  0.000 seconds of real time
  0.000000 seconds of total run time (0.000000 user, 0.000000 system)
  100.00% CPU
  424,334 processor cycles
  131,024 bytes consed

Evaluation took:
  0.000 seconds of real time
  0.000000 seconds of total run time (0.000000 user, 0.000000 system)
  100.00% CPU
  362,358 processor cycles
  163,792 bytes consed

Evaluation took:
  0.001 seconds of real time
  0.000000 seconds of total run time (0.000000 user, 0.000000 system)
  0.00% CPU
  2,060,160 processor cycles
  655,200 bytes consed

Evaluation took:
  0.000 seconds of real time
  0.003333 seconds of total run time (0.003333 user, 0.000000 system)
  100.00% CPU
  493,287 processor cycles
  163,792 bytes consed

bon avatar Sep 03 '16 15:09 bon

the reason of not achieving 10x speedup is due to the type information and the cost of slot access.

  1. The contents slot of box is not typed, so the (+ (contents a) b) part is always calling a generic-+, not the optimized machine assembly. You should check the disassembly result.
  2. The accessor contents is a normal generic function. So the slot access is slow.

Imagine the total cost is 10X for normal GF and X for IGF. Above two factor adds two overheads, resulting in 10X+A+B vs X+A+B. Then obviously 10 times speedup is not achievable since A+B could be very large.

guicho271828 avatar Sep 03 '16 19:09 guicho271828

I updated the environment and noticed that the examples in playground.lisp getting slow. It looks like the function is prevented from inlining.

guicho271828 avatar Sep 16 '16 21:09 guicho271828

(push :inline-generic-function *features*) still successfully forces the functions being inlined, but I don't like this solution...

guicho271828 avatar Sep 16 '16 21:09 guicho271828