tailcall Add new Thunk-based runtime

:construction: This PR currently only contains updates to the runtime system. The macros still need to be updated in order for it to work properly. :construction:

This branch is an attempt to address #3 and #20 with a new Thunk-based trampoline runtime. This solution has been explored in libraries like trampoline-rs; however, there it is implemented in a way that requires heap allocation.

In this runtime, the Thunk lives in a Slot which is allocated on the stack and re-used for each subsequent call. This avoids the need for heap allocation, but comes with a few downsides:

Since Rust does not yet support DSTs, the size (and alignment) of the Slot must be conservative (i.e. large enough to accommodate the data of the biggest Thunk).
This introduces some unsafe code into the runtime of the crate (it may be possible to extract this part out into its own crate, but I have yet to find a "stack box" crate that tracks lifetimes in the way that we need for tail recursion while allowing reuse of the slot).
The usage of the crate will likely need to change to two macros rather than one (although this can be seen as a feature in light of #14).

As mentioned above, the macros in the crate are not yet updated. To see an example of the runtime in action (what the macros would need to emit), see the factorial_in_new_runtime test case.

At a high-level, usage of this crate under the new runtime would change to using two macros, one attribute to mark a function as available for TOC, and another at each call-site where one would want to use/guarantee TOC:

#[tailcall]
fn gcd(a: u64, b: u64) -> u64 {
    if b == 0 {
        a
    } else {
        tailcall::call! { gcd(b, a % b) }
    }
}

Macro Expansion Mockup

use tailcall::{slot, trampoline};

fn gcd(a: u64, b: u64) -> u64 {
    trampoline::run(move |slot| build_gcd_action(slot, a, b))    
}

#[doc(hidden)]
#[inline(always)]
fn build_gcd_action<'slot>(slot: &'slot mut slot::Slot, a: u64, b: u64) -> trampoline::Action<'slot, u64> {
    trampoline::call(slot, move |slot| {
        if b == 0 {
            trampoline::done(slot, a)
        } else {
            build_gcd_action(slot, b, a % b)
        }
    })
}

This is something that I am working on in my freetime, so it may take me a while to dig into rewriting the macros. In the meantime, I'm open to feedback, suggestions, and contributions from others who are interested in this problem.

May 28 '24 17:05 alecdotninja

To help me understand what is going on with this slot-thunk trick, let me expand factorial and inline all the calls:

fn factorial(input: u64) -> u64 {
    fn call_factorial_inner<'slot>(slot: &'slot mut slot::Slot, accumulator: u64, input: u64) -> trampoline::Action<'slot, u64> {
        trampoline::call(slot, move |slot| {
            if input == 0 {
                return trampoline::done(slot, accumulator);
            }

            return call_factorial_inner(slot, accumulator * input, input - 1);
        })
    }

    fn factorial_inner(accumulator: u64, input: u64) -> u64 {
        trampoline::run(move |slot| call_factorial_inner(slot, accumulator, input))
    }

    factorial_inner(1, input)
}

Intermediate inlining process.

⇒

fn factorial(input: u64) -> u64 {
    fn call_factorial_inner<'slot>(slot: &'slot mut slot::Slot, accumulator: u64, input: u64) -> trampoline::Action<'slot, u64> {
        trampoline::call(slot, move |slot| {
            if input == 0 {
                return trampoline::done(slot, accumulator);
            }

            return call_factorial_inner(slot, accumulator * input, input - 1);
        })
        // ⇒
            let fn_once = move |slot| {
                if input == 0 {
                    return trampoline::done(slot, accumulator);
                    // ⇒
                        return Action::Done(accumulator);
                }

                return call_factorial_inner(slot, accumulator * input, input - 1);
            };
            trampoline::call(slot, fn_once);
            // ⇒
                Action::Call(Thunk::new_in(slot, fn_once));
            // ⇒
                let ptr = slot.put(fn_once);
                // ⇒
                    let ptr = slot.cast().write(fn_once);
                // ⇒
                    let slot_bytes: &mut MaybeUninit<_> =
                        unsafe { &mut *slot.bytes.as_mut_ptr().cast() };
                    let ptr = slot_bytes.write(fn_once);
                Action::Call(Thunk { ptr })
    }

    fn factorial_inner(accumulator: u64, input: u64) -> u64 {
        trampoline::run(move |slot| call_factorial_inner(slot, accumulator, input))
        // ⇒
            let slot = &mut Slot::new();

            let mut action = (move |slot| call_factorial_inner(slot, accumulator, input))(slot);
            // ⇒
                let mut action = call_factorial_inner(slot, accumulator, input);

            loop {
                match action {
                    Action::Done(value) => return value,
                    Action::Call(thunk) => action = thunk.call(),
                    // ⇒
                        Action::Call(thunk) => {
                            let ptr: *mut dyn ThunkFn<'_, _> = thunk.ptr;
                            core::mem::forget(thunk);

                            action = unsafe { (*ptr).call_once_in_slot() };
                            // ⇒
                                action = unsafe {
                                    let (fn_once, slot) = Slot::take(*ptr);
                                    // ⇒
                                        let in_slot: *mut _ = *ptr;
                                        let slot: &mut Slot = &mut *in_slot.cast();
                                        let fn_once = slot.cast().assume_init_read();
                                        // ⇒
                                            let slot_bytes_mut: &mut MaybeUninit<_> =
                                                &mut *slot.bytes.as_mut_ptr().cast();
                                            let fn_once = slot_bytes_mut.assume_init_read();
                                    fn_once(slot)
                                };
                        },
                }
            }
    }

    factorial_inner(1, input)
}

⇒

fn factorial(input: u64) -> u64 {
    fn call_factorial_inner<'slot>(slot: &'slot mut slot::Slot, accumulator: u64, input: u64) -> trampoline::Action<'slot, u64> {
        let fn_once = move |slot| {
            if input == 0 {
                return Action::Done(accumulator);
            }

            return call_factorial_inner(slot, accumulator * input, input - 1);
        };
        let slot_bytes: &mut MaybeUninit<_> =
            unsafe { &mut *slot.bytes.as_mut_ptr().cast() };
        let ptr = slot_bytes.write(fn_once);
        Action::Call(Thunk { ptr })
    }

    fn factorial_inner(accumulator: u64, input: u64) -> u64 {
        let slot = &mut Slot::new();

        let mut action = call_factorial_inner(slot, accumulator, input);

        loop {
            match action {
                Action::Done(value) => return value,
                Action::Call(thunk) => {
                    let ptr: *mut dyn ThunkFn<'_, _> = thunk.ptr;
                    core::mem::forget(thunk);

                    action = unsafe {
                        // This part does not really work because the type of
                        // `fn_once` cannot be written out.
                        // In the implementation, `dyn ThunkFn` is used to
                        // dynamically dispatch the call.
                        let in_slot: *mut _ = *ptr;
                        let slot: &mut Slot = &mut *in_slot.cast();
                        let slot_bytes_mut: &mut MaybeUninit<_> =
                            &mut *slot.bytes.as_mut_ptr().cast();
                        let fn_once = slot_bytes_mut.assume_init_read();
                        fn_once(slot)
                    };
                },
            }
        }
    }

    factorial_inner(1, input)
}

Code that compiles: https://github.com/SichangHe/alecdotninja--tailcall/commit/d8570c432684fc70968093d06f24214675a2a284

If I try to explain this:

Allocate a Slot on the stack to store the fn_once closure.
- ⇒ slot's size needs to be the size of fn_once (Edit: I checked by printing it out. It's 16. I guess it's the closure size plus the vtable pointer size)? …which is pointer size in this case because fn_once captures &mut slot.
- slot is both mutably borrowed by fn_once and used to write it, then how is the borrow checker happy 😵 (yes, it compiles)?
Write fn_once into slot, and then store the resulting pointer as a dyn ThunkFn in a Thunk so that fn_once can be read out and called later.
- In this example, the same closure (although re-created each time) is written into slot again and again (MaybeUninit::write). Can this be optimized?
- ThunkFn is the key that the fn_once can be read out and called later, but it also introduces dynamic dispatch.
- This means a same Thunk can store various fn_once, but how to use it for mutual tail calls?
- The point of slot is to store fn_once on the stack, but couldn't we borrow fn_once and store it on the caller's stack?
Iterate on action, grab out the inner fn_once using slot and call it if needed.
- Does assume_init_read unnecessarily copy the content of slot?

May 30 '24 06:05 SichangHe

@SichangHe I've refactored a little and added some safety comments which should hopefully make this a little more clear.

I need to add more documentation to the types but here is a brief summary:

Slot - a place where values can be written. Importantly, it is safe to transmute between &mut T and &mut Slot (provided that a T has been written into the Slot).
Thunk<'slot, T> - an owned closure returning a T that is stored in a Slot (basically Box<dyn FnOnce() -> T> but allocated on the stack and 'slot lifetime for the data instead of 'static).

slot is both mutably borrowed by fn_once and used to write it, then how is the borrow checker happy 😵 (yes, it compiles)?

Well, unsafe code is used, so this isn't fully borrow-checked. :sweat_smile: That said, I am fairly confident it is correct. Notice how in the code, there are never two &mut pointers to the same bytes at the same time. The &mut Slot pointer is reinterpreted as different concrete function types, but there is no aliasing of mutable values.

This is why the closures are passed a &mut Slot. That unique value moves through the trampoline and acts as a type-level guard that only one function is using the Slot at a time.

In this example, the same closure (although re-created each time) is written into slot again and again (MaybeUninit::write). Can this be optimized?

I don't think so, but I might not be fully understanding what you are saying.

The information about the state of the trampoline has to be stored somewhere. In the current enum-based implementation, we write to input "again and again". Keep in mind that the data being written into the Slot is not machine-code, it is just the data required to call the closure (which in my testing is the equivalent of the arguments to the function).

ThunkFn is the key that the fn_once can be read out and called later, but it also introduces dynamic dispatch.

This means a same Thunk can store various fn_once, but how to use it for mutual tail calls?

The dynamic dispatch is what allows for mutual tail calls! By performing type-erasure on the closures, we are able to make a trampoline that can run multiple functions without having to know the full list in advance. I think this will work even across different crates (provided they are using the same version of tailcall).

Mutual Recursion Example

#[tailcall]
fn is_even(x: u128) -> bool {
    if x > 0 {
        tailcall::call! { is_odd(x - 1) }
    } else {
        true
    }
}

#[tailcall]
fn is_odd(x: u128) -> bool {
    if x > 0 {
        tailcall::call! { is_even(x - 1) }
    } else {
        false
    }
}

Mutual Recursion Example (Expanded)

use tailcall::{slot, trampoline};

fn is_even(x: u128) -> bool {
    trampoline::run(move |slot| build_is_even_action(slot, x))    
}

#[doc(hidden)]
#[inline(always)]
fn build_is_even_action<'slot>(slot: &'slot mut slot::Slot, x: u128) -> trampoline::Action<'slot, bool> {
    trampoline::call(slot, move |slot| {
        if x > 0 {
            build_is_odd_action(slot, x - 1)
        } else {
            trampoline::done(slot, true)
        }
    })
}

fn is_odd(x: u128) -> bool {
    trampoline::run(move |slot| build_is_odd_action(slot, x))
}

#[doc(hidden)]
#[inline(always)]
fn build_is_odd_action<'slot>(slot: &'slot mut slot::Slot, x: u128) -> trampoline::Action<'slot, bool> {
    trampoline::call(slot, move |slot| {
        if x > 0 {
            build_is_even_action(slot, x - 1)
        } else {
            trampoline::done(slot, false)
        }
    })
}

Does assume_init_read unnecessarily copy the content of slot?

I don't think so. We need to move the value out of the Slot in order to be able to re-use it for the next iteration anyway. The only alternative I can think of would be to alternate between two slots.

May 31 '24 05:05 alecdotninja

Well, unsafe code is used, so this isn't fully borrow-checked. 😅

Okay… Learning about the borrow checker all the time.

That said, I am fairly confident it is correct[…]

Yes, I also think it follows ownership rules.

The information about the state of the trampoline has to be stored somewhere. In the current enum-based implementation, we write to input "again and again". Keep in mind that the data being written into the Slot is not machine-code, it is just the data required to call the closure (which in my testing is the equivalent of the arguments to the function).

Okay, now I get it.

The slot stores a closure (before we are done).
On each iteration, the closure is read out from slot without being moved (MaybeUninit::assume_init_read), then called with slot as its argument; finally, the closure call output (which can again be a closure, even of the same type) is written back to slot. All copies are necessary.
- Since a closure is an anonymous struct implementing FnOnce and other function traits, and the closure's fields are the values it "captures" (which could be the arguments), then calling the closure should stack-copy all its fields onto a new call stack, and stack-copy the returned output back to slot.
- So, the closure calls should introduce the creation and destruction of a new call stack per recursion call, and stack-copying of the closure's all captured values twice (enum-based implementations would also do this once). Maybe validate this with Godbolt.
- Intuitively, this is more overhead than the current single-tail-call implementation, which only writes the necessary value changes onto the same stack. But, the overhead is small, because stack copies are cheap and the function call jumps usually hit the cache.
- The closure call might be inline by Rustc, but unlikely due to the dynamic dispatch. Enum-based implementations are more likely to get inlined.

The dynamic dispatch is what allows for mutual tail calls! By performing type-erasure on the closures, we are able to make a trampoline that can run multiple functions without having to know the full list in advance. I think this will work even across different crates (provided they are using the same version of tailcall).

Most of my use cases of TCO in Rust have been text parsing or tree walking where I know all the cases, so the dynamic dispatch would be unnecessary pointer redirections there compared to enum-based implementations.

I think the dynamic dispatch implementation is brilliant in its ability to perform tail call optimizations without knowing all the tail-call functions in advance. Though, I do not yet know what that would look like. Maybe we can find examples in loop-less languages like Erlang or Ocaml.

The more important question is how the user would leverage this mechanism. I think most people would be scared away from the slot trick and some sort of macro is again needed.

Jun 01 '24 12:06 SichangHe

On each iteration, the closure is read out from slot without being moved (MaybeUninit::assume_init_read)

The closure is copied out of the Slot by MaybeUninit::assume_init_read. I think this is required in order to make it safe to reuse the Slot on the next iteration (otherwise the closure would still be executing while the next closure is being written into the Slot).

the closure's fields are the values it "captures"

That is correct! I shouldn't have said "arguments" in my explanation. Note that for these closures though, the captures are always exactly the arguments of function it is representing.

So, the closure calls should introduce the creation and destruction of a new call stack per recursion call, and stack-copying of the closure's all captured values twice (enum-based implementations would also do this once).

The closure call might be inline by Rustc, but unlikely due to the dynamic dispatch. Enum-based implementations are more likely to get inlined.

Yes, this is a good point. Another downside of using the closures is that they will probably introduce one new stack frame above the trampoline loop itself. I would be very impressed if the compiler was able to inline the virtual calls.

Most of my use cases of TCO in Rust have been text parsing or tree walking where I know all the cases, so the dynamic dispatch would be unnecessary pointer redirections there compared to enum-based implementations.

This is another fair point. I like this virtual dispatch solution because it feels more flexible to me (cross crate mutual recursion), but that flexibility comes at a -- I think very small -- cost. That said, if no one needs that flexibility in practice, why bother paying it at all? (Especially when it requires adding unsafe code to the crate.)

On the other hand, an enum-based implementation would have much more complex macros, and the current enum-based solution for a single function already has a few bugs (see #13 , #18, and #19).

It's not easy to take a functions arguments and represent them as enum variants in Rust, especially when generics or a receiver are involved (and remember all references are generic in at least their lifetime). :dizzy_face:

The more important question is how the user would leverage this mechanism. I think most people would be scared away from the slot trick and some sort of macro is again needed.

I've gone over the code a few times, and the tests now pass on Miri, so I am relatively confident that the runtime is sound. My next step will be to rewrite the macros. :smile:

Jun 02 '24 04:06 alecdotninja

The closure is copied out of the Slot by MaybeUninit::assume_init_read. I think this is required in order to make it safe to reuse the Slot on the next iteration (otherwise the closure would still be executing while the next closure is being written into the Slot).

I got this one wrong. It is again a stack-copy. If the closure creates a new call stack, then this stack copy is unnecessary, otherwise it is.

Most of my use cases of TCO in Rust have been text parsing or tree walking where I know all the cases, so the dynamic dispatch would be unnecessary pointer redirections there compared to enum-based implementations.

This is another fair point. I like this virtual dispatch solution because it feels more flexible to me (cross crate mutual recursion), but that flexibility comes at a -- I think very small -- cost. That said, if no one needs that flexibility in practice, why bother paying it at all? (Especially when it requires adding unsafe code to the crate.)

This up to you. Though, I am fairly sure someone will use it if it exists 😆.

[…] and the current enum-based solution for a single function already has a few bugs (see #13 , #18, and #19).

The current implementation is not really "enum-based", right? It is just a tuple. And, those are bugs in the macro 🤦? This is the bigger problem. An implementation that does not work is not as good as one that is slower. Well, maybe I should try to learn proc macros and have a real look, although last time I failed to find any "guide" for that.

I've gone over the code a few times, and the tests now pass on Miri, so I am relatively confident that the runtime is sound. My next step will be to rewrite the macros. 😄

✌️

Jun 02 '24 08:06 SichangHe