languages icon indicating copy to clipboard operation
languages copied to clipboard

Performance in the Fibonacci benchmark is almost entirely determined by factors other than function call overhead

Open dlnnlsn opened this issue 11 months ago • 3 comments

In the readme for the Fibonacci benchmark, we see the following requirements for different implementations of the benchmark:

ALL IMPLEMENTAITONS MUST...
Have a function that recursively compute a fibonacci number with this naive algorithm
Base case for input 0
Base case for input 1
Must make two recursive calls for each non-base invocation
No result caching, conversion to tail recursion, or iterative solutions.

This is because the benchmark Emphasizes function call overhead, stack pushing / popping, and recursion.

This is misleading. The recursive calls in the reference implementation are tail calls, and gcc at optimization levels higher than -O1 does optimize this by eliminating one of function calls. Thus the compiler produces assembly that is roughly equivalent to the following c code:

int32_t fibonacci(int32_t n) {
  int32_t result = 0;
  while (n > 1) {
    result += fibonacci(n - 1);
    n -= 2;
  }
  if (n == 1) result += 1;
  return result;
}

The Fortran compiler gfortran goes even further. It "unrolls" the recursive calls, and produces assembly that is roughly equivalent to the following c code:

int32_t fibonacci(int32_t n) {
  if (n == 0) return 0;
  if (n == 1) return 1;
  if (n == 2) return 1;
  if (n == 3) return 2;
  int32_t a = fibonacci(n - 3);
  int32_t b = fibonacci(n - 4);
  if (n == 4) return a + 2;
  a += b;
  int32_t c = fibonacci(n - 5);
  b += c;
  a += b;
  if (n == 5) return b + 1;
  return a + b + c + fibonacci(n - 6);
}

(See the output on Compiler Explorer)

If tail calls are not optimized, then the reference implementation requires 331160280 function calls to calculate fibonaaci(40), while the fortran version only requires 313883 function calls. This is calculated as follows:

  • If T_n is the number of function calls that a naive recursive solution requires to calculate the nth Fibonacci number, then we have that T_0 = T_1 = 0, and T_{n + 2} = 2 + T_{n + 1} + T_n.
  • If U_n is the number of function calls that the fortran optimization requires to calculate the nth Fibonacci number, then we have that U_0 = U_1 = U_2 = U_3 = 0, U_4 = 2, U_5 = 3, and U_{n + 6} = 4 + U_n + U_{n + 1} + U_{n + 2} + U_{n + 3}.

Thus, contrary to its stated goal, using this benchmark to compare c and fortran does not give us an accurate comparison of the overhead that each language introduces when making recursive function calls. Instead the difference in performance is almost entirely a result of the compilers' ability to remove the function calls altogether.

At the very least, the readme should clarify that while the source code for each implementation is not allowed to eliminate one of the recursive calls, and must make exactly two recursive calls on each iteration, and that these recursive calls must calculate the (n - 1)th and (n - 2)th Fibonacci numbers, the compiler may still make whatever optimizations it is capable of.

dlnnlsn avatar Jan 13 '25 21:01 dlnnlsn

Actually the fortran compiler eliminates the final fibonacci(n - 6) function call and turns it into a loop instead, resulting in even fewer function calls being necessary. And as mentioned, gcc also eliminates one of the function calls, so a compiled c version also uses fewer than 331160280 function calls to calculate fibonacci(40), but it still uses multiple orders of magnitude more function calls than the fortran version does.

dlnnlsn avatar Jan 13 '25 21:01 dlnnlsn

Thanks for this analysis, @dlnnlsn! I had sort of figured this out for the Fortran program, but for the C program I was a bit confused about why gcc would generate code that was so dramatically faster than what clang produced.

I think it's fine that some compilers find these optimizations and others don't. The benchmark rules are more for the program code, as I see it.

PEZ avatar Jan 13 '25 21:01 PEZ

Adding on to (and slightly contradicting) ChatGPT's analysis in your linked issue, I will mention that it is true that the fortran version is still exponential, but ChatGPT is incorrect when it claims that it will be slower than the c version for large values of n. The compiled fortran version I described above has time complexity O(1.3803...^n), while the naive version and the compiled c version are both O(1.6180...^n). The partial "unrolling" of the recursive calls gives an exponential speedup. I am confused about why gcc doesn't make this optimization since I assume that it uses the same or a similar backend to gfortran, but of course that's an issue with the relevant compilers, and not with the code in this repository.

dlnnlsn avatar Jan 13 '25 22:01 dlnnlsn