inlined-generic-function
inlined-generic-function copied to clipboard
Benchmark behaviour on defined class is slow.
The benchmarks provided are for methods on the built-in lisp types number, fixnum and double-float. To test the behaviour on defined classes we added a simple boxing class and found that peformance degraded when using inlined-generic-functions, inlined. We found the following numbers of processor cycles for the four methods in playground.lisp
, respectively:
333,033
331,839
2,144,814
585,272
Experiment on sbcl 1.3.5.24
See https://github.com/bon/inlined-generic-function/commit/8b6e4d5b10cace47de4343e6dde8455f21dfd579
So my question is whether this indicates that inlined-generic-functions only speed up on built-in types and not on defined classes?
it seems normal-plus is running w/o boxing, right?
Correct! Fixed in https://github.com/bon/inlined-generic-function/commit/76d1eb6e77ebc5433465b9afb2cdb84b6c4c3e4d
Processor cycles are now
588,650
586,253
1,889,394
550,351
phew.
I just tested your version. On my machine, the result is still in favor of the inlined version.
Evaluation took:
0.001 seconds of real time
0.004000 seconds of total run time (0.004000 user, 0.000000 system)
400.00% CPU
638,640 processor cycles
131,024 bytes consed
Evaluation took:
0.000 seconds of real time
0.000000 seconds of total run time (0.000000 user, 0.000000 system)
100.00% CPU
608,634 processor cycles
163,808 bytes consed
Evaluation took:
0.003 seconds of real time
0.000000 seconds of total run time (0.000000 user, 0.000000 system)
0.00% CPU
4,543,020 processor cycles
655,184 bytes consed
Evaluation took:
0.000 seconds of real time
0.000000 seconds of total run time (0.000000 user, 0.000000 system)
100.00% CPU
389,169 processor cycles
163,808 bytes consed
What is this difference? In your result I-g-function is performing better, but not much better. I use SBCL 1.3.8 on roswell on
$ uname -a
Linux guicho-x61 4.4.0-36-generic #55-Ubuntu SMP Thu Aug 11 18:01:55 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux
$ cat /proc/cpuinfo
...
model name : Intel(R) Core(TM)2 Duo CPU T7100 @ 1.80GHz
...
For me the numbers of cycles vary wildly from run to run. Sometimes the igf gets a little quicker, sometimes slower. One example is shown below.
But the more interesting question is why the igf showed a 10x speedup on numbers but hardly any difference on defined classes? Of course I would be very happy to see a 10x speedup on defined classes too!
$ cat /proc/cpuinfo | ag 'model name' | head -1
model name : Intel(R) Core(TM) i7-3520M CPU @ 2.90GHz
$ uname -a
Linux tie 4.7.2-1-ARCH #1 SMP PREEMPT Sat Aug 20 23:02:56 CEST 2016 x86_64 GNU/Linux
$ ros use sbcl
$ ~/.roswell/impls/x86-64/linux/sbcl/1.3.9/bin/sbcl --version
SBCL 1.3.9
$ ros run
$ rlwrap ros run
* (ql:quickload :inlined-generic-function)
...
* (load "benchmark.lisp")
...
Evaluation took:
0.000 seconds of real time
0.000000 seconds of total run time (0.000000 user, 0.000000 system)
100.00% CPU
424,334 processor cycles
131,024 bytes consed
Evaluation took:
0.000 seconds of real time
0.000000 seconds of total run time (0.000000 user, 0.000000 system)
100.00% CPU
362,358 processor cycles
163,792 bytes consed
Evaluation took:
0.001 seconds of real time
0.000000 seconds of total run time (0.000000 user, 0.000000 system)
0.00% CPU
2,060,160 processor cycles
655,200 bytes consed
Evaluation took:
0.000 seconds of real time
0.003333 seconds of total run time (0.003333 user, 0.000000 system)
100.00% CPU
493,287 processor cycles
163,792 bytes consed
the reason of not achieving 10x speedup is due to the type information and the cost of slot access.
- The
contents
slot ofbox
is not typed, so the(+ (contents a) b)
part is always calling ageneric-+
, not the optimized machine assembly. You should check the disassembly result. - The accessor
contents
is a normal generic function. So the slot access is slow.
Imagine the total cost is 10X for normal GF and X for IGF. Above two factor adds two overheads, resulting in 10X+A+B vs X+A+B. Then obviously 10 times speedup is not achievable since A+B could be very large.
I updated the environment and noticed that the examples in playground.lisp
getting slow. It looks like the function is prevented from inlining.
(push :inline-generic-function *features*)
still successfully forces the functions being inlined, but I don't like this solution...