fib
fib copied to clipboard
The fastest languages are cheating
When compiling Nim to C++, this is the simplified result:
#include <stdint.h>
#include <stdio.h>
uint64_t fib(uint64_t n){
if (n > 1){
uint64_t a = fib(n - 1);
uint64_t b = fib(n - 2);
return a + (b);
}
return 1;
}
extern "C" void foo(){
uint64_t result = fib(46);
printf("%lu\n", result);
}
int main(){
foo();
}
Nothing wrong here yet. But when compiled with g++
, the disassembled program will look similar to this:
; Run like so:
; nasm -felf64 fib.asm && gcc -no-pie fib.o -o fib && time ./fib
global main
extern printf
section .text
fib:
cmp rdi, 1
mov eax, 1
ja L15
ret
L15:
push r15
push r14
mov r14, -1
push r13
push r12
lea r12, [rdi-2]
push rbp
push rbx
mov rbp, rdi
xor ebx, ebx
sub rsp, 8
L3:
cmp rbp, 2
jne L9
cmp r12, 1
ja L7
mov r13d, 1
L5:
lea rax, [r13+1+rbx]
L1:
add rsp, 8
pop rbx
pop rbp
pop r12
pop r13
pop r14
pop r15
ret
L7:
mov rdi, r14
sub r12, 4
mov rbp, -2
call fib
lea rbx, [rax+1+rbx]
L9:
lea r15, [rbp-3]
mov rdi, r12
call fib
mov rdi, r15
mov r13, rax
call fib
add r13, rax
cmp r12, 1
jbe L5
mov rdi, r15
sub rbp, 4
sub r12, 4
call fib
add r13, rax
add rbx, r13
cmp rbp, 1
ja L3
lea rax, [rbx+1]
jmp L1
foo:
push r12
push rbp
mov edi, 40
push rbx
call fib
mov edi, 39
mov rdx, rax
call fib
mov edi, 38
mov rbp, rax
call fib
mov edi, 37
lea r11, [rax+rbp]
mov rcx, rax
call fib
mov edi, 36
add rcx, rax
mov rsi, rax
lea r10, [r11+rcx]
call fib
add rsi, rax
mov rbx, rax
mov edi, 35
add rcx, rsi
lea r9, [r10+rcx]
call fib
add rbx, rax
mov edi, 34
mov r12, rax
add rsi, rbx
add rcx, rsi
lea r8, [r9+rcx]
call fib
add rdx, rbp
mov edi, format
add rdx, r12
add rax, rdx
add rax, r11
add rax, rbx
add rax, r10
add rax, rsi
pop rbx
add rax, r9
pop rbp
pop r12
add rax, rcx
lea rsi, [rax+r8*2]
xor eax, eax
jmp printf
main:
sub rsp, 8
call foo
xor eax, eax
add rsp, 8
ret
format: db "%lu", 10, 0
Apparently, fib
is only called for the parameters 34, 35, 36, 37, 38, 39, 40
and then the results are added up. This is not a fair comparison, because Nim is not doing the same computations as the other languages. I believe that Nim should be moved to the section Optimized
. Otherwise, one would have to argue how much unrolling is ok and how much is too much, but there is no good answer for that.
Likewise, C and C++ partially unroll the fib function calls, although not as far. I have not checked the Fortran and Cython binaries, but they are probably doing the same thing.
It was suggested in another issue that the parameter n
should be configurable to prevent optimization. However, that is insufficient since it still allows to unroll the fib
function for small values of n
. The only solution seems to be to inspect the generated assembly code
Here is a fair assembly implementation to compare against. If a program is faster, it is likely cheating in some way or another.
; Run like so:
; nasm -felf64 fib.asm && gcc -no-pie fib.o -o fib && time ./fib
global main
extern printf
section .text
fib:
mov rax, 1
dec rdi
jle done
push rdi
call fib
pop rdi
push rax
dec rdi
call fib
pop rdx
add rax, rdx
done:
ret
main:
mov rdi, 46
call fib
mov rdi, format
mov rsi, rax
xor rax, rax
call printf
xor rax, rax
ret
format: db "%lu", 10, 0
@983 amazing investigative work on this. It's interesting to see how the compiler is optimizing the recursive calls. I am open to suggestions on how to improve the fairness of the benchmark.
I can see the following options:
- Look at the generated assembly code and if there is something fishy in there, move the language from the
Natively compiled, statically typed
to theOptimized
table. - Specify the
-fno-inline-small-functions
flag when compiling with gcc. - Prefix functions with
__attribute__((noinline))
.
The two last options do not completely fix the problem, because gcc will still inline the last recursive call. To get rid of that, also specify -fno-optimize-sibling-calls
.
The Nim, fortran and cython benchmark can be fixed the same way:
-
nim cpp -d:release --passC:-fno-inline-small-functions --passC:-fno-optimize-sibling-calls fib.nim
-
gfortran -fno-inline-small-functions -fno-optimize-sibling-calls -O3 -o fib fib.f03
-
cython --embed -o fib.pyx.c fib.pyx && gcc -fno-inline-functions -fno-optimize-sibling-calls -O3 -o fib fib.pyx.c
In case that this behavior should change in the future, here is my process to determine which flags were responsible:
- Compile with decreasing levels of optimization until inlining does not happen anymore (https://gcc.godbolt.org/z/9AWZMN is helpful for that)
- Find the responsible
-f<flag>
by trial and error until gcc optimizes too much. See https://gcc.gnu.org/onlinedocs/gcc/Optimize-Options.html for the list of compiler flags per optimization level. - Finally, compile with
-O3 -fno-<flag>
.
Nevertheless, I think a compiler which is very good at inlining should still be honored in some way. Maybe the running time with all optimizations enabled could be included in another column or table?
The average time 10 developers use to find the compiler flags to go from "regularly compiled code" to "optimized code" could be included in the table. This would make the results for the optimized Nim code shine.
I think the best solution is to implement #85 and change the exmamples to take a parameter . This should prevent some of the optimizations made here but still allow for languages to shine that have better inlining. For now, I have added the flag no-inline-small-functions
.
In a way I both agree and disagree with the comment: compiler optimization should not be restricted, if a language allows it, so much the better. I think however the optimized folder contains code that is "human" optimized (see my code with the lisp memoized function I opened in the issues); and that may be cheating. Maybe fibonacci func is not the best to benchmark for that specific reason?
I have moved back to allowing compiler optimizations. However, I will only allow compiler flags that are deemed safe and "release" quality i.e. -O3.
Maybe fibonacci func is not the best to benchmark for that specific reason?
Naive recursive Fibonacci, O(Fib(N)) is definitely not representative of most code. The fact that there's an O(N) algorithm (iterative x+=y; y+=x;
or x+=y; swap(x,y)
) means that there's room for sufficiently clever optimizers to find big improvements.
If you choose naive recursive Fibonacci as your test function in the first place, surely the point is to test how well compilers convert inefficient double-recursion to something like a loop containing single recursion, or even something better. And function-call overhead since a single addition is a tiny amount of work for a single function call.
Doubly-recursive functions for traversing binary trees are common in real programs, but they differ in not having redundancy / common-subexpressions with each other. So that recursion pattern is one that's important for compilers to know how to optimize some.
If you want to know how a language performs in general for a variety of computational problems, naive Fibonacci isn't a good choice, but it is an interesting and well-known problem to benchmark. (https://stackoverflow.com/questions/76611371/execution-time-of-recursive-fibonacci-function-is-slower-in-c-than-equivalent-ni is an example of someone being puzzled by Nim being faster than C on this benchmark.)
Perhaps some text somewhere pointing out that apparently trivial differences can lead to major compiler optimizations could avoid confusing people?