llvm-project Clang fails to apply HALO when coroutine is destroyed from await

Here is a small example of a lazy-start coroutine promise, with a coroutine Bar that calls another coroutine (Baz) that can be inlined, where the inlined coroutine calls yet another one Qux that cannot.

#include <coroutine>

struct MyTask{
  struct promise_type {
    MyTask get_return_object() { return {std::coroutine_handle<promise_type>::from_promise(*this)}; }
    std::suspend_always initial_suspend() { return {}; }

    void unhandled_exception();

    void return_void() {} 

    auto await_transform(MyTask task) {
      struct Awaiter {
        bool await_ready() { return false; }
        std::coroutine_handle<promise_type> await_suspend(std::coroutine_handle<promise_type> h) {
          caller.resume_when_done = h;
          return std::coroutine_handle<promise_type>::from_promise(callee);
        }

        void await_resume() {
          std::coroutine_handle<promise_type>::from_promise(callee).destroy();
        }

        promise_type& caller;
        promise_type& callee;
      };

      return Awaiter{*this, task.handle.promise()};
    }
    
    auto final_suspend() noexcept {
      struct Awaiter {
        bool await_ready() noexcept { return false; }
        std::coroutine_handle<promise_type> await_suspend(std::coroutine_handle<promise_type> h) noexcept {
          return to_resume;
        }

        void await_resume() noexcept;

        std::coroutine_handle<promise_type> to_resume;
      };

      return Awaiter{resume_when_done};
    }

    // The coroutine to resume when we're done.
    std::coroutine_handle<promise_type> resume_when_done;
  };


  // A handle for the coroutine that returned this task.
  std::coroutine_handle<promise_type> handle;
};

MyTask __attribute__((noinline)) Qux() { co_return; }

MyTask Baz() { co_await Qux(); }

MyTask __attribute__((noinline)) Bar() { co_await Baz(); }

The awaited task's coroutine handle is destroyed immediately upon resumption in await_resume, so it should be possible to apply HALO and elide the allocation of the coroutine frame for Baz in Bar: the frame is both created and destroyed within Bar. But when compiled with -std=c++20 -O2 -fno-exceptions (compiler explorer), clang fails to do this:

Bar() [clone .resume]:                         # @Bar() [clone .resume]
        push    r14
        push    rbx
        push    rax
        mov     rbx, rdi
        cmp     byte ptr [rdi + 48], 0
        je      .LBB7_1
        mov     rdi, qword ptr [rbx + 32]
        call    operator delete(void*)@PLT
        mov     rdi, qword ptr [rbx + 16]
        mov     qword ptr [rbx + 40], rdi
        mov     qword ptr [rbx + 24], rdi
        mov     qword ptr [rbx], 0
        add     rsp, 8
        pop     rbx
        pop     r14
        jmp     qword ptr [rdi]                 # TAILCALL
.LBB7_1:
        mov     edi, 56
        call    operator new(unsigned long)@PLT
        mov     r14, rax
        mov     qword ptr [rbx + 32], rax
        lea     rax, [rip + Baz() [clone .resume]]
        mov     qword ptr [r14], rax
        lea     rax, [rip + Baz() [clone .destroy]]
        mov     qword ptr [r14 + 8], rax
        mov     qword ptr [r14 + 16], 0
        mov     byte ptr [r14 + 48], 0
        mov     byte ptr [rbx + 48], 1
        mov     qword ptr [rbx + 16], rbx
        call    Qux()
        mov     qword ptr [r14 + 32], rax
        mov     byte ptr [r14 + 48], 1
        mov     qword ptr [r14 + 16], r14
        mov     rdi, rax
        add     rsp, 8
        pop     rbx
        pop     r14
        jmp     qword ptr [rax]                 # TAILCALL

If on the other hand we make the minimal change so that the callee's handle is destroyed in ~MyTask (compiler explorer) HALO is applied and there are no allocations within Bar.resume.

Can clang be taught to apply HALO when the awaited coroutine's frame is destroyed in the awaiter's await_resume method?

If you are interested in why I don't want to destroy the handle in ~MyTask, it's because it causes much worse code to be generated for the frame destroy function. For example, here's what Bar.destroy looks like in my original example where I destroy the handle in await_resume:

jmp     operator delete(void*)@PLT                      # TAILCALL

But when the the handle is destroyed in ~MyTask there is a ton of code generated:

        push    rbx
        mov     rbx, rdi
        cmp     qword ptr [rdi], 0
        je      .LBB9_5
        cmp     byte ptr [rbx + 96], 0
        je      .LBB9_5
        cmp     qword ptr [rbx + 24], 0
        je      .LBB9_5
        cmp     byte ptr [rbx + 72], 0
        je      .LBB9_5
        mov     rdi, qword ptr [rbx + 56]
        call    qword ptr [rdi + 8]
.LBB9_5:
        mov     rdi, rbx
        pop     rbx
        jmp     operator delete(void*)@PLT                      # TAILCALL

I believe clang is forced to do this, since it must do different cleanup depending on where Bar is suspended when it is destroyed. But my library doesn't support destroying it anywhere but the final suspend point, so all of this code is unnecessary and we really do just want a tail call to delete.

Aug 06 '22 09:08 jacobsa

@llvm/issue-subscribers-coroutines

Aug 06 '22 14:08 llvmbot

Regarding the footnote about overly verbose code generation in the destroy function: I realized that this is not specific to ~Co, but is true of any object with a non-trivial destructor in the coroutine frame, and filed #56980 to cover that. So please forget that here.

I still think that HALO could apply in the case of the program above, and it's unfortunate that the optimization seems a little brittle. I wonder if it is specifically tied to destructors, despite the fact that await_resume happens before the destructor is run, and that is the problem?

Aug 07 '22 02:08 jacobsa

For this case, the reason why HALO fails to perform is that the compiler can't be sure the coroutine frame is guaranteed to be resumed. The key condition for HALO is that the lifetime of the coroutine frame is covered by the current function. However, it is pretty hard to do precise analysis in acceptable time. So it may be common that there are cases the programmer feel like it is OK to do coroutine elision but the compiler refuses to.

To solve the problem, I tried to add https://wg21.link/p2594r0 to the std. Besides that, there looks other solution which is based on attributes like:

__attribute__((always_elide))
MyTask Baz() { co_await Qux(); }

MyTask __attribute__((noinline)) Bar() { co_await Baz(); }

or

MyTask Baz() { co_await Qux(); }

MyTask __attribute__((noinline)) Bar() {
    [[always_elide]] co_await Baz();
}

Aug 16 '22 06:08 ChuanqiXu9

Thanks Chuanqi. Did anything come of your presentation?

I guess the problem with my example is that the compiler doesn't know if await_resume will actually be run—I didn't consider that before. In the case of my library it definitely will, because it's not possible to destroy coroutines early. So I suppose this is related to #56980.

Aug 16 '22 09:08 jacobsa

Thanks Chuanqi. Did anything come of your presentation?

Not yet. I don't have enough time that day.

Compiler are generally not good at IPO (Inter Procedural Optimization) and many things trivial to programmers are not trivial for compilers.

Aug 16 '22 09:08 ChuanqiXu9

FYI this is still an issue for me.

I had worked around this for a while by ensuring the coroutine's handle was destroyed in the destructor of the awaitable returned by await_transform when awaiting it, which seems to be what clang wants for HALO. But this requires a chain of coroutines (Foo awaits Bar awaits Baz) that is destroyed early due to cancellation to be destroyed in caller-before-callee order, relying on the awaitable's destructor to make this work out. But that is bad because if there is any other destructor in the coroutine frame that needs to run after the awaitable's, clang can't use tail calls and you wind up with a stack overflow if the chain of coroutines is long.

Instead I am now back to destroying the callee from await_resume when the caller is resumed, with cancellation taking a separate path that destroys a callee's coroutine and then the caller's coroutine, avoiding stack overflow even for unbounded chains of calls. But it means I no longer get HALO. (Perhaps it's actually impossible to have HALO with this arrangement, which is an unfortunate trade-off if true.)

Dec 27 '22 08:12 jacobsa

(Sorry, I don't mean to derail the thread, just to justify why I want to destroy in await_resume. If anybody has better ideas for avoiding stack overflow when destroying a long chain of suspended coroutines I'd love to hear them.)

Dec 27 '22 08:12 jacobsa

Does anyone have an example of a simple task coroutine (no generator) where clang does indeed do HALO? Not an example so simple that the whole code is replaced by a return of the result? I have not been able to create such a case.

Apr 23 '23 15:04 FunMiles

I have the same experience as FunMiles. I adapted the example from https://github.com/llvm/llvm-project/issues/56262 to compile in clang-17: https://godbolt.org/z/xWh9ssTb8

Without the std::cout line in a generator body, there are no allocations, but only because the entire coroutine chain is optimized out.

Just adding a log line inside the outermost generator causes all the nested generators to become dynamically-allocated as well. Putting the log line in the innermost generator (seq) triggers no allocations.

As for non-generator tasks, logging seems to inhibit HALO even for trivial coroutine implementations and functions that don't suspend: https://godbolt.org/z/3Pv5We4Kc

Jun 19 '23 17:06 cstanfill

Let's track the issue in https://github.com/llvm/llvm-project/issues/64586.

Aug 10 '23 09:08 ChuanqiXu9

llvm-project
llvm-project copied to clipboard

Clang fails to apply HALO when coroutine is destroyed from await_resume

llvm-project llvm-project copied to clipboard

Clang fails to apply HALO when coroutine is destroyed from await_resume

llvm-project
llvm-project copied to clipboard