lean4
lean4 copied to clipboard
Do notation inhibits tail call optimization
The use of return
inside do
notation causes this tail-recursive loop to be compiled into two mutually recursive functions, which breaks TCO. The version using if-else
instead (bar
below) optimizes correctly.
set_option trace.compiler.ir.result true
partial def foo (n : Nat) : Nat := do
if n == 10 then return 0
foo (n+1)
-- def foo._lambda_1 (x_1 : @& obj) (x_2 : @& obj) : obj :=
-- let x_3 : obj := 1;
-- let x_4 : obj := Nat.add x_1 x_3;
-- let x_5 : obj := foo x_4;
-- dec x_4;
-- ret x_5
-- def foo (x_1 : @& obj) : obj :=
-- let x_2 : obj := 10;
-- let x_3 : u8 := Nat.decEq x_1 x_2;
-- case x_3 : u8 of
-- Bool.false →
-- let x_4 : obj := ctor_0[PUnit.unit];
-- let x_5 : obj := foo._lambda_1 x_1 x_4;
-- ret x_5
-- Bool.true →
-- let x_6 : obj := 0;
-- ret x_6
partial def bar (n : Nat) : Nat := do
if n == 10 then 0 else
bar (n+1)
-- def bar (x_1 : obj) : obj :=
-- let x_2 : obj := 10;
-- let x_3 : u8 := Nat.decEq x_1 x_2;
-- case x_3 : u8 of
-- Bool.false →
-- let x_4 : obj := 1;
-- let x_5 : obj := Nat.add x_1 x_4;
-- dec x_1;
-- let x_6 : obj := bar x_5;
-- ret x_6
-- Bool.true →
-- dec x_1;
-- let x_7 : obj := 0;
-- ret x_7
For reference, the do
desugaring produces a term that looks roughly like this:
let _do_jp : PUnit → Id Nat := fun (y : PUnit) => foo (n + 1);
if n == 10 = true then pure 0 else
do
let y ← pure PUnit.unit
_do_jp y
The foo._lambda_1
code corresponds to _do_jp
here.
To add on to this, if-else
with return
does compile the tail call correctly:
partial def foobar (n : Nat) : Nat := do
if n == 10 then return 0 else return foobar (n+1)
/-
def foobar (x_1 : obj) : obj :=
let x_2 : obj := 10;
let x_3 : u8 := Nat.decEq x_1 x_2;
case x_3 : u8 of
Bool.false →
let x_4 : obj := 1;
let x_5 : obj := Nat.add x_1 x_4;
dec x_1;
let x_6 : obj := foobar x_5;
ret x_6
Bool.true →
dec x_1;
let x_7 : obj := 0;
ret x_7
-/
As such, I would say this has less to do with the do
notation and return
, and more to do with if
and return
. It looks to me like Lean is not optimizing an if
statement ending with a return
(i.e., a terminal if
) into an if-else
. That is, ideally, the following optimization would exist:
if bool then
...
return x
...
return y
becomes
if bool then
...
return x
else
...
return y
Keep in mind that return
in do-notation is an early exit, it is not pure
. So it is reasonable to me that using return
, especially in if
conditions, will produce a join-point based desugaring. That said, it would probably be good to detect patterns like if cond then return x; y
and if cond then return x else return y
that can be modeled using just plain if-else and don't need the extra power of early exit, because the plain if-else version is generating better code even if we ignore the TCO issue.
But to really solve the TCO issue in general, including all the do sugar (the usual definition of "tail position" would not exclude these examples), what is needed is to support compilation of mutual tail calls, possibly using the mustcall
attribute as indicated by @Kha .
Keep in mind that
return
in do-notation is an early exit,
My apologies, but I am not sure what you are getting at here. I am aware that return
is an early exit, but I fail to see how it applies to the points I was making.
Ah, I misinterpreted your point. Ignore that first sentence.
But to really solve the TCO issue in general, including all the do sugar (the usual definition of "tail position" would not exclude these examples), what is needed is to support compilation of mutual tail calls, possibly using the
mustcall
attribute as indicated by @Kha.
This is also an interesting point. Even with the mutual recursive functions, aren't both functions preforming tail calls (i.e. foo
calls the lambda in tail position and lambda calls foo
in the tail position), so shouldn't they be optimized as such? It doesn't seem like the addition of the lambda should break that. So does the tail call optimization problems really have anything to do with syntax and the result IR? Doesn't it instead have to do with the later TCO?
OOPS! I just realized I made a major error in my previous foobar
and pure_bar
definitions. They are both using bar
rather than recursing on themselves. With the proper definitions my previous observations about the boxed/unboxed versions disappears (as the correct versions also only produce boxed versions). I have also updated/deleted my previous posts to help clean the thread and remove the noise. Sorry about this!
But to really solve the TCO issue in general, including all the do sugar (the usual definition of "tail position" would not exclude these examples), what is needed is to support compilation of mutual tail calls, possibly using the
mustcall
attribute as indicated by @Kha.This is also an interesting point. Even with the mutual recursive functions, aren't both functions preforming tail calls (i.e.
foo
calls the lambda in tail position and lambda callsfoo
in the tail position), so shouldn't they be optimized as such? It doesn't seem like the addition of the lambda should break that. So does the tail call optimization problems really have anything to do with syntax and the result IR? Doesn't it instead have to do with the later TCO?
Right, these are orthogonal issues. do
notation produces suboptimal desugaring involving lambdas when they aren't necessary, and mutual TCO doesn't work. The original example would be fixed if either of these is addressed, but I think addressing the latter problem is more likely to solve more general examples where eliding the lambda isn't possible.
Right, these are orthogonal issues.
do
notation produces suboptimal desugaring involving lambdas when they aren't necessary, and mutual TCO doesn't work.
I still find it interesting that you are laying the desugaring problem at the foot of the do
notation in general. I would blame, more specifically, the imperative if
. The if
produces the joint point lambda that leads to this problem. Many other uses of do
would not have this problem. I am also curious as to why the lambda isn't being inlined considering its small size (and, in this case, single usage) -- such inlining would also solve the TCO problem (for this example).
I would blame, more specifically, the imperative if .
Sure. The do notation desugaring is a big and complicated thing in lean 4, and there are several possible ways for join points to be introduced. At least unless
would also need to be optimized similarly (that was what I used in the real world example from which this MWE was extracted), but looking at the do-IR directly would probably yield a more general optimization pattern.
I am also curious as to why the lambda isn't being inlined considering its small size (and, in this case, single usage) -- such inlining would also solve the TCO problem (for this example).
I imagine that (mutual) recursive calls might be inhibiting inlining in this case. It can be tricky to determine when it's actually a win. But I don't really know anything about how the inliner works.