Sol needs a coroutine safety guide
Following from the discussion in https://github.com/ThePhD/sol2/issues/890#issuecomment-3555487828 and https://github.com/ThePhD/sol2/issues/1711#issuecomment-3556723670 ...
Sol's tutorials depict an approach to writing script bindings that can lead to heap corruption if those bindings are invoked from a coroutine, even if the coroutine is created and operated by scripts. This can come as a nasty surprise to library users who neglect to read the documentation subpages about threads because they aren't using sol's coroutine API.
The technical problem as I understand it — take this with a grain of salt: Heap corruption can occur whether the coroutine is created and operated by scripts or the host app, so long as the host app either stores
sol::reference-based values somewhere in its own datamodel or (worse yet) passes native data types containing any type ofsol::referenceback to a script. Sol bindings written in the 'standard style' will tend to use thelua_Stateof the coroutine that invoked them, and create these references in a non-main registry. When the coroutine completes, that registry may be deleted or recycled, causing all of these references go out of scope. This invalidates thosesol::references; what happens next is undefined behavior but typically corrupts a heap or freelist causing a crash with unpredictable timing deep in the runtime.
This is a major pitfall, responsible for a number of issues on the repo. I think it could be resolved by adding a prominent item to Sol's table of contents with a title like "how to write coroutine-safe code".
This could either lead to an updated version of the thread page or to a new page written as a guide. Ideally it should explain when and why types like main_reference and this_main_state are necessary without venturing into topics like CPU threading.
"Heap corruption" might be a little too far. What you create are dangling references (and using them is a use-after-free). In my opinion, the page should include:
- Each coroutine has its own Lua state
- References (e.g. tables or functions) are stored in the "current" Lua state.
- References prefixed with
main_are stored in the main Lua state. - If a reference is used across Lua invocations (i.e. create reference → Lua executes → use reference), almost always, you'd want the
main_variant. This assumes you're running unknown code. - Examples of where you can accidentally create dangling references.
- Examples of how to do it correctly.
- Mention that creating usertypes should almost always use a
main_state(because these might create references that should be visible everywhere)
I am a little confused now because I've come across some apocrypha that says coroutines do use the same registry as the main lua_State, but this seems to conflict with my firsthand experience using sol with luajit. Perhaps the coroutine's registry "inherits" from the main one..? I would write a guide myself if I had a stronger understanding of the specifics.
The main offender in my program was a C++ function exposed to Lua that returned a class containing a collection of Lua references. That class was never retained by C++ code, instead going into the possession of Lua's garbage collector. If the coroutine ended, the references (created from the coroutine state) would cause issues when the garbage collector finalized the C++ object. This caused some kind of double-free heisenbug which would variously corrupt either the registry's freelist or luajit's bulk allocator heap.
Here's a stripped-down version of that "coroutine-unsafe" code:
// An ever-changing list we expose to Lua from time to time.
static std::vector<Bogo*> bogos = {...};
// A userdata type that represents a snapshot of the list. Pretend it has some useful methods.
struct LuaBogos {std::vector<sol::userdata> bogos;};
// A function Lua scripts can call to get the snapshot.
LuaBogos get_bogos(sol::this_state lua)
{
LuaBogos result;
for (auto &bogo : bogos) result.bogos.emplace_back(sol::make_userdata(lua, bogo));
return result;
}