sol2 icon indicating copy to clipboard operation
sol2 copied to clipboard

Sol needs a coroutine safety guide

Open EvanBalster opened this issue 1 month ago • 2 comments

Following from the discussion in https://github.com/ThePhD/sol2/issues/890#issuecomment-3555487828 and https://github.com/ThePhD/sol2/issues/1711#issuecomment-3556723670 ...

Sol's tutorials depict an approach to writing script bindings that can lead to heap corruption if those bindings are invoked from a coroutine, even if the coroutine is created and operated by scripts. This can come as a nasty surprise to library users who neglect to read the documentation subpages about threads because they aren't using sol's coroutine API.

The technical problem as I understand it — take this with a grain of salt: Heap corruption can occur whether the coroutine is created and operated by scripts or the host app, so long as the host app either stores sol::reference-based values somewhere in its own datamodel or (worse yet) passes native data types containing any type of sol::reference back to a script. Sol bindings written in the 'standard style' will tend to use the lua_State of the coroutine that invoked them, and create these references in a non-main registry. When the coroutine completes, that registry may be deleted or recycled, causing all of these references go out of scope. This invalidates those sol::references; what happens next is undefined behavior but typically corrupts a heap or freelist causing a crash with unpredictable timing deep in the runtime.

This is a major pitfall, responsible for a number of issues on the repo. I think it could be resolved by adding a prominent item to Sol's table of contents with a title like "how to write coroutine-safe code".

This could either lead to an updated version of the thread page or to a new page written as a guide. Ideally it should explain when and why types like main_reference and this_main_state are necessary without venturing into topics like CPU threading.

EvanBalster avatar Nov 22 '25 02:11 EvanBalster

"Heap corruption" might be a little too far. What you create are dangling references (and using them is a use-after-free). In my opinion, the page should include:

  • Each coroutine has its own Lua state
  • References (e.g. tables or functions) are stored in the "current" Lua state.
  • References prefixed with main_ are stored in the main Lua state.
  • If a reference is used across Lua invocations (i.e. create reference → Lua executes → use reference), almost always, you'd want the main_ variant. This assumes you're running unknown code.
  • Examples of where you can accidentally create dangling references.
  • Examples of how to do it correctly.
  • Mention that creating usertypes should almost always use a main_state (because these might create references that should be visible everywhere)

Nerixyz avatar Nov 22 '25 09:11 Nerixyz

I am a little confused now because I've come across some apocrypha that says coroutines do use the same registry as the main lua_State, but this seems to conflict with my firsthand experience using sol with luajit. Perhaps the coroutine's registry "inherits" from the main one..? I would write a guide myself if I had a stronger understanding of the specifics.

The main offender in my program was a C++ function exposed to Lua that returned a class containing a collection of Lua references. That class was never retained by C++ code, instead going into the possession of Lua's garbage collector. If the coroutine ended, the references (created from the coroutine state) would cause issues when the garbage collector finalized the C++ object. This caused some kind of double-free heisenbug which would variously corrupt either the registry's freelist or luajit's bulk allocator heap.

Here's a stripped-down version of that "coroutine-unsafe" code:

// An ever-changing list we expose to Lua from time to time.
static std::vector<Bogo*> bogos = {...};

// A userdata type that represents a snapshot of the list.  Pretend it has some useful methods.
struct LuaBogos {std::vector<sol::userdata> bogos;};

// A function Lua scripts can call to get the snapshot.
LuaBogos get_bogos(sol::this_state lua)
{
   LuaBogos result;
   for (auto &bogo : bogos) result.bogos.emplace_back(sol::make_userdata(lua, bogo));
   return result;
}

EvanBalster avatar Nov 24 '25 00:11 EvanBalster