Add new Thunk-based runtime
:construction: This PR currently only contains updates to the runtime system. The macros still need to be updated in order for it to work properly. :construction:
This branch is an attempt to address #3 and #20 with a new Thunk-based trampoline runtime. This solution has been explored in libraries like trampoline-rs; however, there it is implemented in a way that requires heap allocation.
In this runtime, the Thunk lives in a Slot which is allocated on the stack and re-used for each subsequent call. This avoids the need for heap allocation, but comes with a few downsides:
- Since Rust does not yet support DSTs, the size (and alignment) of the
Slotmust be conservative (i.e. large enough to accommodate the data of the biggestThunk). - This introduces some unsafe code into the runtime of the crate (it may be possible to extract this part out into its own crate, but I have yet to find a "stack box" crate that tracks lifetimes in the way that we need for tail recursion while allowing reuse of the slot).
- The usage of the crate will likely need to change to two macros rather than one (although this can be seen as a feature in light of #14).
As mentioned above, the macros in the crate are not yet updated. To see an example of the runtime in action (what the macros would need to emit), see the factorial_in_new_runtime test case.
At a high-level, usage of this crate under the new runtime would change to using two macros, one attribute to mark a function as available for TOC, and another at each call-site where one would want to use/guarantee TOC:
#[tailcall]
fn gcd(a: u64, b: u64) -> u64 {
if b == 0 {
a
} else {
tailcall::call! { gcd(b, a % b) }
}
}
Macro Expansion Mockup
use tailcall::{slot, trampoline};
fn gcd(a: u64, b: u64) -> u64 {
trampoline::run(move |slot| build_gcd_action(slot, a, b))
}
#[doc(hidden)]
#[inline(always)]
fn build_gcd_action<'slot>(slot: &'slot mut slot::Slot, a: u64, b: u64) -> trampoline::Action<'slot, u64> {
trampoline::call(slot, move |slot| {
if b == 0 {
trampoline::done(slot, a)
} else {
build_gcd_action(slot, b, a % b)
}
})
}
This is something that I am working on in my freetime, so it may take me a while to dig into rewriting the macros. In the meantime, I'm open to feedback, suggestions, and contributions from others who are interested in this problem.
To help me understand what is going on with this slot-thunk trick, let me expand factorial and inline all the calls:
fn factorial(input: u64) -> u64 {
fn call_factorial_inner<'slot>(slot: &'slot mut slot::Slot, accumulator: u64, input: u64) -> trampoline::Action<'slot, u64> {
trampoline::call(slot, move |slot| {
if input == 0 {
return trampoline::done(slot, accumulator);
}
return call_factorial_inner(slot, accumulator * input, input - 1);
})
}
fn factorial_inner(accumulator: u64, input: u64) -> u64 {
trampoline::run(move |slot| call_factorial_inner(slot, accumulator, input))
}
factorial_inner(1, input)
}
Intermediate inlining process.
⇒
fn factorial(input: u64) -> u64 {
fn call_factorial_inner<'slot>(slot: &'slot mut slot::Slot, accumulator: u64, input: u64) -> trampoline::Action<'slot, u64> {
trampoline::call(slot, move |slot| {
if input == 0 {
return trampoline::done(slot, accumulator);
}
return call_factorial_inner(slot, accumulator * input, input - 1);
})
// ⇒
let fn_once = move |slot| {
if input == 0 {
return trampoline::done(slot, accumulator);
// ⇒
return Action::Done(accumulator);
}
return call_factorial_inner(slot, accumulator * input, input - 1);
};
trampoline::call(slot, fn_once);
// ⇒
Action::Call(Thunk::new_in(slot, fn_once));
// ⇒
let ptr = slot.put(fn_once);
// ⇒
let ptr = slot.cast().write(fn_once);
// ⇒
let slot_bytes: &mut MaybeUninit<_> =
unsafe { &mut *slot.bytes.as_mut_ptr().cast() };
let ptr = slot_bytes.write(fn_once);
Action::Call(Thunk { ptr })
}
fn factorial_inner(accumulator: u64, input: u64) -> u64 {
trampoline::run(move |slot| call_factorial_inner(slot, accumulator, input))
// ⇒
let slot = &mut Slot::new();
let mut action = (move |slot| call_factorial_inner(slot, accumulator, input))(slot);
// ⇒
let mut action = call_factorial_inner(slot, accumulator, input);
loop {
match action {
Action::Done(value) => return value,
Action::Call(thunk) => action = thunk.call(),
// ⇒
Action::Call(thunk) => {
let ptr: *mut dyn ThunkFn<'_, _> = thunk.ptr;
core::mem::forget(thunk);
action = unsafe { (*ptr).call_once_in_slot() };
// ⇒
action = unsafe {
let (fn_once, slot) = Slot::take(*ptr);
// ⇒
let in_slot: *mut _ = *ptr;
let slot: &mut Slot = &mut *in_slot.cast();
let fn_once = slot.cast().assume_init_read();
// ⇒
let slot_bytes_mut: &mut MaybeUninit<_> =
&mut *slot.bytes.as_mut_ptr().cast();
let fn_once = slot_bytes_mut.assume_init_read();
fn_once(slot)
};
},
}
}
}
factorial_inner(1, input)
}
⇒
fn factorial(input: u64) -> u64 {
fn call_factorial_inner<'slot>(slot: &'slot mut slot::Slot, accumulator: u64, input: u64) -> trampoline::Action<'slot, u64> {
let fn_once = move |slot| {
if input == 0 {
return Action::Done(accumulator);
}
return call_factorial_inner(slot, accumulator * input, input - 1);
};
let slot_bytes: &mut MaybeUninit<_> =
unsafe { &mut *slot.bytes.as_mut_ptr().cast() };
let ptr = slot_bytes.write(fn_once);
Action::Call(Thunk { ptr })
}
fn factorial_inner(accumulator: u64, input: u64) -> u64 {
let slot = &mut Slot::new();
let mut action = call_factorial_inner(slot, accumulator, input);
loop {
match action {
Action::Done(value) => return value,
Action::Call(thunk) => {
let ptr: *mut dyn ThunkFn<'_, _> = thunk.ptr;
core::mem::forget(thunk);
action = unsafe {
// This part does not really work because the type of
// `fn_once` cannot be written out.
// In the implementation, `dyn ThunkFn` is used to
// dynamically dispatch the call.
let in_slot: *mut _ = *ptr;
let slot: &mut Slot = &mut *in_slot.cast();
let slot_bytes_mut: &mut MaybeUninit<_> =
&mut *slot.bytes.as_mut_ptr().cast();
let fn_once = slot_bytes_mut.assume_init_read();
fn_once(slot)
};
},
}
}
}
factorial_inner(1, input)
}
Code that compiles: https://github.com/SichangHe/alecdotninja--tailcall/commit/d8570c432684fc70968093d06f24214675a2a284
If I try to explain this:
- Allocate a
Sloton the stack to store thefn_onceclosure.- ⇒
slot's size needs to be the size offn_once(Edit: I checked by printing it out. It's 16. I guess it's the closure size plus the vtable pointer size)? …which is pointer size in this case becausefn_oncecaptures&mut slot. slotis both mutably borrowed byfn_onceand used to write it, then how is the borrow checker happy 😵 (yes, it compiles)?
- ⇒
- Write
fn_onceintoslot, and then store the resulting pointer as adyn ThunkFnin aThunkso thatfn_oncecan be read out and called later.- In this example, the same closure (although re-created each time) is written into
slotagain and again (MaybeUninit::write). Can this be optimized? ThunkFnis the key that thefn_oncecan be read out and called later, but it also introduces dynamic dispatch.- This means a same
Thunkcan store variousfn_once, but how to use it for mutual tail calls? - The point of
slotis to storefn_onceon the stack, but couldn't we borrowfn_onceand store it on the caller's stack?
- In this example, the same closure (although re-created each time) is written into
- Iterate on
action, grab out the innerfn_onceusingslotand call it if needed.- Does
assume_init_readunnecessarily copy the content ofslot?
- Does
@SichangHe I've refactored a little and added some safety comments which should hopefully make this a little more clear.
I need to add more documentation to the types but here is a brief summary:
Slot- a place where values can be written. Importantly, it is safe to transmute between&mut Tand&mut Slot(provided that aThas been written into theSlot).Thunk<'slot, T>- an owned closure returning aTthat is stored in aSlot(basicallyBox<dyn FnOnce() -> T>but allocated on the stack and'slotlifetime for the data instead of'static).
slot is both mutably borrowed by fn_once and used to write it, then how is the borrow checker happy 😵 (yes, it compiles)?
Well, unsafe code is used, so this isn't fully borrow-checked. :sweat_smile: That said, I am fairly confident it is correct. Notice how in the code, there are never two &mut pointers to the same bytes at the same time. The &mut Slot pointer is reinterpreted as different concrete function types, but there is no aliasing of mutable values.
This is why the closures are passed a &mut Slot. That unique value moves through the trampoline and acts as a type-level guard that only one function is using the Slot at a time.
In this example, the same closure (although re-created each time) is written into slot again and again (MaybeUninit::write). Can this be optimized?
I don't think so, but I might not be fully understanding what you are saying.
The information about the state of the trampoline has to be stored somewhere. In the current enum-based implementation, we write to input "again and again". Keep in mind that the data being written into the Slot is not machine-code, it is just the data required to call the closure (which in my testing is the equivalent of the arguments to the function).
ThunkFn is the key that the fn_once can be read out and called later, but it also introduces dynamic dispatch.
This means a same Thunk can store various fn_once, but how to use it for mutual tail calls?
The dynamic dispatch is what allows for mutual tail calls! By performing type-erasure on the closures, we are able to make a trampoline that can run multiple functions without having to know the full list in advance. I think this will work even across different crates (provided they are using the same version of tailcall).
Mutual Recursion Example
#[tailcall]
fn is_even(x: u128) -> bool {
if x > 0 {
tailcall::call! { is_odd(x - 1) }
} else {
true
}
}
#[tailcall]
fn is_odd(x: u128) -> bool {
if x > 0 {
tailcall::call! { is_even(x - 1) }
} else {
false
}
}
Mutual Recursion Example (Expanded)
use tailcall::{slot, trampoline};
fn is_even(x: u128) -> bool {
trampoline::run(move |slot| build_is_even_action(slot, x))
}
#[doc(hidden)]
#[inline(always)]
fn build_is_even_action<'slot>(slot: &'slot mut slot::Slot, x: u128) -> trampoline::Action<'slot, bool> {
trampoline::call(slot, move |slot| {
if x > 0 {
build_is_odd_action(slot, x - 1)
} else {
trampoline::done(slot, true)
}
})
}
fn is_odd(x: u128) -> bool {
trampoline::run(move |slot| build_is_odd_action(slot, x))
}
#[doc(hidden)]
#[inline(always)]
fn build_is_odd_action<'slot>(slot: &'slot mut slot::Slot, x: u128) -> trampoline::Action<'slot, bool> {
trampoline::call(slot, move |slot| {
if x > 0 {
build_is_even_action(slot, x - 1)
} else {
trampoline::done(slot, false)
}
})
}
Does assume_init_read unnecessarily copy the content of slot?
I don't think so. We need to move the value out of the Slot in order to be able to re-use it for the next iteration anyway. The only alternative I can think of would be to alternate between two slots.
Well, unsafe code is used, so this isn't fully borrow-checked. 😅
Okay… Learning about the borrow checker all the time.
That said, I am fairly confident it is correct[…]
Yes, I also think it follows ownership rules.
The information about the state of the trampoline has to be stored somewhere. In the current enum-based implementation, we write to
input"again and again". Keep in mind that the data being written into theSlotis not machine-code, it is just the data required to call the closure (which in my testing is the equivalent of the arguments to the function).
Okay, now I get it.
- The
slotstores a closure (before we are done). - On each iteration, the closure is read out from
slotwithout being moved (MaybeUninit::assume_init_read), then called withslotas its argument; finally, the closure call output (which can again be a closure, even of the same type) is written back toslot. All copies are necessary.- Since a closure is an anonymous struct implementing
FnOnceand other function traits, and the closure's fields are the values it "captures" (which could be the arguments), then calling the closure should stack-copy all its fields onto a new call stack, and stack-copy the returned output back toslot. - So, the closure calls should introduce the creation and destruction of a new call stack per recursion call, and stack-copying of the closure's all captured values twice (enum-based implementations would also do this once). Maybe validate this with Godbolt.
- Intuitively, this is more overhead than the current single-tail-call implementation, which only writes the necessary value changes onto the same stack. But, the overhead is small, because stack copies are cheap and the function call jumps usually hit the cache.
- The closure call might be inline by Rustc, but unlikely due to the dynamic dispatch. Enum-based implementations are more likely to get inlined.
- Since a closure is an anonymous struct implementing
The dynamic dispatch is what allows for mutual tail calls! By performing type-erasure on the closures, we are able to make a trampoline that can run multiple functions without having to know the full list in advance. I think this will work even across different crates (provided they are using the same version of
tailcall).
Most of my use cases of TCO in Rust have been text parsing or tree walking where I know all the cases, so the dynamic dispatch would be unnecessary pointer redirections there compared to enum-based implementations.
I think the dynamic dispatch implementation is brilliant in its ability to perform tail call optimizations without knowing all the tail-call functions in advance. Though, I do not yet know what that would look like. Maybe we can find examples in loop-less languages like Erlang or Ocaml.
The more important question is how the user would leverage this mechanism. I think most people would be scared away from the slot trick and some sort of macro is again needed.
On each iteration, the closure is read out from slot without being moved (
MaybeUninit::assume_init_read)
The closure is copied out of the Slot by MaybeUninit::assume_init_read. I think this is required in order to make it safe to reuse the Slot on the next iteration (otherwise the closure would still be executing while the next closure is being written into the Slot).
the closure's fields are the values it "captures"
That is correct! I shouldn't have said "arguments" in my explanation. Note that for these closures though, the captures are always exactly the arguments of function it is representing.
So, the closure calls should introduce the creation and destruction of a new call stack per recursion call, and stack-copying of the closure's all captured values twice (enum-based implementations would also do this once).
The closure call might be inline by Rustc, but unlikely due to the dynamic dispatch. Enum-based implementations are more likely to get inlined.
Yes, this is a good point. Another downside of using the closures is that they will probably introduce one new stack frame above the trampoline loop itself. I would be very impressed if the compiler was able to inline the virtual calls.
Most of my use cases of TCO in Rust have been text parsing or tree walking where I know all the cases, so the dynamic dispatch would be unnecessary pointer redirections there compared to enum-based implementations.
This is another fair point. I like this virtual dispatch solution because it feels more flexible to me (cross crate mutual recursion), but that flexibility comes at a -- I think very small -- cost. That said, if no one needs that flexibility in practice, why bother paying it at all? (Especially when it requires adding unsafe code to the crate.)
On the other hand, an enum-based implementation would have much more complex macros, and the current enum-based solution for a single function already has a few bugs (see #13 , #18, and #19).
It's not easy to take a functions arguments and represent them as enum variants in Rust, especially when generics or a receiver are involved (and remember all references are generic in at least their lifetime). :dizzy_face:
The more important question is how the user would leverage this mechanism. I think most people would be scared away from the slot trick and some sort of macro is again needed.
I've gone over the code a few times, and the tests now pass on Miri, so I am relatively confident that the runtime is sound. My next step will be to rewrite the macros. :smile:
The closure is copied out of the
SlotbyMaybeUninit::assume_init_read. I think this is required in order to make it safe to reuse theSloton the next iteration (otherwise the closure would still be executing while the next closure is being written into theSlot).
I got this one wrong. It is again a stack-copy. If the closure creates a new call stack, then this stack copy is unnecessary, otherwise it is.
Most of my use cases of TCO in Rust have been text parsing or tree walking where I know all the cases, so the dynamic dispatch would be unnecessary pointer redirections there compared to enum-based implementations.
This is another fair point. I like this virtual dispatch solution because it feels more flexible to me (cross crate mutual recursion), but that flexibility comes at a -- I think very small -- cost. That said, if no one needs that flexibility in practice, why bother paying it at all? (Especially when it requires adding
unsafecode to the crate.)
This up to you. Though, I am fairly sure someone will use it if it exists 😆.
[…] and the current enum-based solution for a single function already has a few bugs (see #13 , #18, and #19).
The current implementation is not really "enum-based", right? It is just a tuple. And, those are bugs in the macro 🤦? This is the bigger problem. An implementation that does not work is not as good as one that is slower. Well, maybe I should try to learn proc macros and have a real look, although last time I failed to find any "guide" for that.
I've gone over the code a few times, and the tests now pass on Miri, so I am relatively confident that the runtime is sound. My next step will be to rewrite the macros. 😄
✌️