qsync: `txn_limbo_is_empty()` and `txn_limbo_last_synchro_entry()` are unsafe

Open Gerold103 opened this issue 3 months ago • 0 comments

This is a generalization of #11807. That ticket was about a crash which was happening when txn_limbo_is_empty() is false, but txn_limbo_last_synchro_entry() is NULL. Which sounds strange, but it is theoretically possible when the limbo only has async txns or "fake" entries (created by txn_limbo_flush()).

That shouldn't really happen under normal work, but might happen with some weird sequence of fiber scheduling while box.cfg.replication_synchro_queue_max_size is actually reached.

The main issue why this can't really be all fixed under #11807 is that only a single case, described in that ticket, is easily reproducible. All the other cases in the code around these two C functions seem to be either not possible (the last synchro txn is just never null in those cases), or there is no known way to reproduce them without fake error injections doing unrealistic things.

The proposal is the following:

Consider removing these 2 functions entirely from everywhere but txn_limbo.c. Look at each individual case of their usages and rework it so they don't depend on that, or so those cases use some new more high-level public functions of txn_limbo.
Consider speeding up #11596 and create a new type of error injection which would "block" only a single fiber when the injection is reached, but would not affect the next fibers. That would be very helpful. For example, when one wants to block just the first fiber entering the volatile limbo queue, but not block the other ones (to ensure they actually wait for the first one to leave, not that they simply are blocked on the same injection). This should simplify writing some test cases.

Sep 29 '25 21:09 Gerold103