sol2 icon indicating copy to clipboard operation
sol2 copied to clipboard

Crash when resuming coroutine

Open Spacechild1 opened this issue 6 months ago • 4 comments

Context: I've built a scheduler that takes Lua functions and dispatches them at specified time points. The functions can be wrapped coroutines.

I've tried to implement concurrency primitives like semaphores or condition variables. The idea is that the current task (= wrapped coroutine) puts itself on a wait list and yields back to the scheduler. Another task may then take a task from the wait list and resume it (= signalling).

Now, this works fine if the signalling task is started outside the waiting task. But when I schedule it from within the waiting task, I get a segfault when it tries to resume the waiting task.

Here's a minimal code example, tested with c1f95a773c6f8f4fde8ca3efe872e7286afe4444:

#include <sol/sol.hpp>

#include <iostream>
#include <vector>

std::vector<sol::function> schedQueue;

void sched(sol::function task)
{
    std::cout << "sched\n";
    schedQueue.push_back(std::move(task));
}

int main(int argc, const char** argv)
{
    sol::state state;
    state.open_libraries();

    state["sched"] = sched;

#if 1
    // this crashes!
    state.script(
R"(
sched(coroutine.wrap(function(task)
    print("current task:", task)
    -- schedule waiting task to be resumed *within itself*
    sched(task)

    print("yield")
    coroutine.yield()
    print("resumed")
end))
)");

#else
    // but this works!
    state.script(
R"(
local task = coroutine.wrap(function(task)
    print("current task:", task)

    print("yield")
    coroutine.yield()
    print("resumed")
end)

sched(task)
-- schedule waiting task to be resumed *outside itself*
sched(task)
)");
#endif

    std::cout << "start scheduler:\n";

    while (!schedQueue.empty()) {
        std::cout << "dispatch task\n";
        auto task = schedQueue.front();
        schedQueue.erase(schedQueue.begin());
        if (sol::protected_function_result result = task(task); !result.valid()) {
            sol::error err = result;
            std::cout << "ERROR: " << err.what() << "\n";
            return 1;
        }
    }

    std::cout << "finished scheduler\n";

    return 0;
}

Here's the console output and stacktrace (GCC 14.1 Msys2):

sched
start scheduler:
dispatch task
current task:   function: 0000000000debef0
sched
yield
dispatch task

Thread 1 received signal SIGSEGV, Segmentation fault.
0x00007ff7b0ae800f in resume (L=0xdebb38, ud=0x5ff8d0) at C:/Repos/game-over/deps/lua/src/ldo.c:636
636             n = (*ci->u.c.k)(L, LUA_YIELD, ci->u.c.ctx); /* call continuation */
(gdb) bt
#0  0x00007ff7b0ae800f in resume (L=0xdebb38, ud=0x5ff8d0)
    at C:/Repos/game-over/deps/lua/src/ldo.c:636
#1  0x00007ff7b0ae6bb4 in luaD_rawrunprotected (L=0xdebb38,
    f=0x7ff7b0ae7f1c <resume(lua_State*, void*)>, ud=0x5ff8d0)
    at C:/Repos/game-over/deps/lua/src/ldo.c:142
#2  0x00007ff7b0ae8170 in lua_resume (L=0xdebb38, from=0xdebb38, nargs=1)
    at C:/Repos/game-over/deps/lua/src/ldo.c:664
#3  0x00007ff7b0b04d6e in auxresume (L=0xdebb38, co=0xdebb38, narg=1)
    at C:/Repos/game-over/deps/lua/src/lcorolib.c:39
#4  0x00007ff7b0b04efe in luaB_auxwrap (L=0xdebb38)
    at C:/Repos/game-over/deps/lua/src/lcorolib.c:76
#5  0x00007ff7b0ae77f2 in luaD_precall (L=0xdebb38, func=0xdebc90, nresults=-1)
    at C:/Repos/game-over/deps/lua/src/ldo.c:434
#6  0x00007ff7b0ae7b68 in luaD_call (L=0xdebb38, func=0xdebc90, nResults=-1)
    at C:/Repos/game-over/deps/lua/src/ldo.c:498
#7  0x00007ff7b0ae7be0 in luaD_callnoyield (L=0xdebb38, func=0xdebc90, nResults=-1)
    at C:/Repos/game-over/deps/lua/src/ldo.c:509
#8  0x00007ff7b0ae4157 in f_call (L=0xdebb38, ud=0x5ffb40)
    at C:/Repos/game-over/deps/lua/src/lapi.c:943
#9  0x00007ff7b0ae6bb4 in luaD_rawrunprotected (L=0xdebb38,
    f=0x7ff7b0ae4122 <f_call(lua_State*, void*)>, ud=0x5ffb40)
    at C:/Repos/game-over/deps/lua/src/ldo.c:142
#10 0x00007ff7b0ae83e5 in luaD_pcall (L=0xdebb38, func=0x7ff7b0ae4122 <f_call(lua_State*, void*)>,
    u=0x5ffb40, old_top=80, ef=64) at C:/Repos/game-over/deps/lua/src/ldo.c:729
#11 0x00007ff7b0ae4223 in lua_pcallk (L=0xdebb38, nargs=1, nresults=-1, errfunc=1, ctx=0, k=0x0)
    at C:/Repos/game-over/deps/lua/src/lapi.c:969
--Type <RET> for more, q to quit, c to continue without paging--
#12 0x00007ff7b0b2725e in sol::basic_protected_function<sol::basic_reference<false>, false, sol::bas
ic_reference<false> >::luacall<true> (this=0x5ffda0, argcount=1, result_count_=-1, h=...)
    at C:/Repos/game-over/deps/sol2/include/sol/protected_function.hpp:315
#13 0x00007ff7b0b26e8a in sol::basic_protected_function<sol::basic_reference<false>, false, sol::bas
ic_reference<false> >::invoke<true>(sol::types<>, std::integer_sequence<unsigned long long>, long lo
ng, sol::detail::protected_handler<true, sol::basic_reference<false> >&) const (this=0x5ffda0,
    n=1, h=...) at C:/Repos/game-over/deps/sol2/include/sol/protected_function.hpp:346
#14 0x00007ff7b0b26919 in sol::basic_protected_function<sol::basic_reference<false>, false, sol::bas
ic_reference<false> >::call<, sol::basic_protected_function<sol::basic_reference<false>, false, sol:
:basic_reference<false> >&>(sol::basic_protected_function<sol::basic_reference<false>, false, sol::b
asic_reference<false> >&) const (this=0x5ffda0)
    at C:/Repos/game-over/deps/sol2/include/sol/protected_function.hpp:229
#15 0x00007ff7b0b272a6 in sol::basic_protected_function<sol::basic_reference<false>, false, sol::bas
ic_reference<false> >::operator()<sol::basic_protected_function<sol::basic_reference<false>, false,
sol::basic_reference<false> >&> (this=0x5ffda0)
    at C:/Repos/game-over/deps/sol2/include/sol/protected_function.hpp:213
#16 0x00007ff7b0ae168a in main (argc=1, argv=0xde2340)
    at C:/Repos/game-over/tests/test-lua-coroutine-crash2.cpp:43

The stacktrace is basically the same as in my actual code base.

Note that sched only stores the task (sol::function) in a std::vector for later execution. I don't really understand why it should make a difference where sched is called.

Spacechild1 avatar Jul 03 '25 22:07 Spacechild1

Interestingly, it works when I use my own coroutine wrapper (in C++) instead of coroutine.wrap:

#include <sol/sol.hpp>

#include <iostream>
#include <vector>

class Coro {
public:
    Coro(sol::function fn)
    {
        thread_ = sol::thread::create(fn.lua_state());
        coro_ = sol::coroutine(thread_.state(), fn);
    }

    sol::protected_function_result operator()(Coro& coro)
    {
        return coro_.call(coro);
    }

private:
    sol::thread thread_;
    sol::coroutine coro_;
};

std::vector<Coro> schedQueue;
// sol::function currentTask;

void sched(Coro coro)
{
    std::cout << "sched\n";
    schedQueue.push_back(std::move(coro));
}

int main(int argc, const char** argv)
{
    sol::state state;
    state.open_libraries();

    state["sched"] = sched;
    // state["currentTask"] = &currentTask;

    state.new_usertype<Coro>(
        "Coro", sol::call_constructor,
        sol::constructors<void(sol::function)>{});

    state.script(
R"(
sched(Coro(function(task)
    print("current task:", task)
    -- schedule waiting task to be resumed *within itself*
    sched(task)

    print("yield")
    coroutine.yield()
    print("resumed")
end))
)");

    std::cout << "start scheduler:\n";

    while (!schedQueue.empty()) {
        std::cout << "dispatch task\n";
        auto task = schedQueue.front();
        schedQueue.erase(schedQueue.begin());
        // currentTask = task;
        if (sol::protected_function_result result = task(task); !result.valid()) {
            sol::error err = result;
            std::cout << "ERROR: " << err.what() << "\n";
            return 1;
        }
    }

    std::cout << "finished scheduler\n";

    return 0;
}

Spacechild1 avatar Jul 04 '25 00:07 Spacechild1

See item 2 in this comment: https://github.com/ThePhD/sol2/issues/890#issuecomment-552064783

Essentially, the issue is that coroutines have their own lua_State and their own registry. Any sol references (object/function/userdata/thread/whatever) passed to C++ from the coroutine will be associated with that state, as will sol::this_state. When the coroutine dies/finishes, that state and all the references inside its stack and registry are deleted. Then your schedQueue is full of dangling references to lua values (sol::function/sol::coroutine) that no longer exist and very bad things may happen when those are accessed/destroyed.

The simple workaround for this problem is to use the main_ variations of these sol types, which are always stored in Lua's "main thread" and won't get invalidated by script activity. If you use a lot of coroutines, you should audit your code to switch over to these types; err toward storing any long-term references to lua values in this way.

EvanBalster avatar Nov 20 '25 02:11 EvanBalster

Oh, wow, thank you so much! I knew that each coroutine has its own stack (at least it should have if you need independent execution), but I didn't know that it also has its own registry! In hindsight this totally make sense.

As you've mentioned in https://github.com/ThePhD/sol2/issues/890#issuecomment-552064783, this is a huge pitfall.

To be fair, it is explained in https://sol2.readthedocs.io/en/latest/threading.html?highlight=main_object#working-with-multiple-lua-threads and https://sol2.readthedocs.io/en/latest/api/reference.html. However, I didn't really understand what any of this meant before your comment, which indicates that the documentation could be improved :) Would you open an issue about that? Or should I do that?

The simple workaround for this problem is to use the main_ variations of these sol types, which are always stored in Lua's "main thread" and won't get invalidated by script activity.

Another possibility mentioned in https://sol2.readthedocs.io/en/latest/threading.html?highlight=main_object#working-with-multiple-lua-threads is to use the xmove constructor to pin the reference to a particular Lua thread. In my case that would be:

state["sched"] = [&state](sol::function task) {
    std::cout << "sched\n";
    schedQueue.push_back(sol::function(state, task));
};

Spacechild1 avatar Nov 20 '25 09:11 Spacechild1

I'll go ahead and open an issue.

EvanBalster avatar Nov 22 '25 02:11 EvanBalster