server MDEV-37541 Race of rolling back and committing transaction to binlog

This is the 1st of two commits to demonstrate the failure which is that a rolling back transaction while serializes its completion in Engine before a contender transaction the two are logged into binary log in reverse order.

[x] The Jira issue number for this PR is: MDEV-______

Description

TODO: fill description here

Release Notes

TODO: What should the release notes say about this change? Include any changed system variables, status variables or behaviour. Optionally list any https://mariadb.com/kb/ pages that need changing.

How can this PR be tested?

TODO: modify the automated test suite to verify that the PR causes MariaDB to behave as intended. Consult the documentation on "Writing good test cases".

If the changes are not amenable to automated testing, please explain why not and carefully describe how to test manually.

Basing the PR against the correct MariaDB version

[ ] This is a new feature or a refactoring, and the PR is based against the main branch.
[ ] This is a bug fix, and the PR is based against the earliest maintained branch in which the bug can be reproduced.

PR quality check

[ ] I checked the CODING_STANDARDS.md file and my PR conforms to this where appropriate.
[ ] For any trivial modifications to the PR, I am ok with the reviewer making the changes themselves.

Sep 15 '25 17:09 andrelkin

Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.
_{You have signed the CLA already but the status is still pending? Let us recheck it.}

Sep 15 '25 17:09 CLAassistant

andrelkin @.***> writes:

This is the 1st of two commits to demonstrate the failure which is that a rolling back transaction while serializes its completion in Engine before a contender transaction the two are logged into binary log in reverse order.<!--

diff --git a/sql/handler.cc b/sql/handler.cc index 5d7138490fdd2..c8ddb0b5ebd5f 100644 --- a/sql/handler.cc +++ b/sql/handler.cc @@ -2306,11 +2306,18 @@ int ha_rollback_trans(THD thd, bool all) if (is_real_trans) / not a statement commit */ thd->stmt_map.close_transient_cursors();
int err;

if (has_binlog_hton(ha_info) &&
   (err= binlog_hton->rollback(binlog_hton, thd, all)))
{
 my_error(ER_ERROR_DURING_COMMIT, MYF(0), err);
 error= 1;
} for (; ha_info; ha_info= ha_info_next) {
 int err;
 handlerton *ht= ha_info->ht();
 if ((err= ht->rollback(ht, thd, all)))
 if (ht != binlog_hton && (err= ht->rollback(ht, thd, all)))

If binlog needs to do some operation before engines roll back (and I agree this is needed here), then that operation should not be done inside the binlog_hton->rollback() call. This is really ugly, first we iterate a list to find some specific element and operate on it, then we iterate the list again and have a check to skip that specific element. Instead, use an explicit call into the binlog that does the operations needed, at the point it time where it should happen.

(And yes, I know this is done similar in commit_one_phase_2(), and it is as wrong there as it is here. Repeating a sin doesn't make it right ;-).

Also, do we really need to do this in 10.6? That doesn't seem right. This looks like a bug that has always been there, and IIUC it only occurs when mixing transactional and non-transactional tables in the same "transaction", something that is in any case fragile and problem-ridden.

Kristian.

Sep 24 '25 08:09 knielsen

@knielsen

do we really need to do this in 10.6?

Yes we unfortunately do, there's an active support case. And for that reason I suggest we keep the ugly style for 10.6 alone. After all its pattern is proven to be effectual and robust.

For 10.11 and up I am preparing much better shaped code that addresses the search for binlog hton. It should be ready for review in a separate PR.

I hope the two branches compromise makes sense to you.

Sep 29 '25 19:09 andrelkin

andrelkin @.***> writes:

andrelkin left a comment (MariaDB/server#4301)

@knielsen

do we really need to do this in 10.6?

Yes we unfortunately do, there's an active support case. And for

That's not a valid reason.

I suggest we keep the ugly style for 10.6 alone. After all its pattern is proven to be effectual and robust.

For 10.11 and up I am preparing much better shaped code that addresses the search for binlog hton.

I hope the two branches compromise makes sense to you.

Ok, sounds reasonable.

Kristian.

Sep 30 '25 08:09 knielsen

int err;
if (has_binlog_hton(ha_info) &&

   (err= binlog_hton->rollback(binlog_hton, thd, all)))

{

 my_error(ER_ERROR_DURING_COMMIT, MYF(0), err);

```
 error= 1;
```
} for (; ha_info; ha_info= ha_info_next) {

 int err;
 handlerton *ht= ha_info->ht();

 if ((err= ht->rollback(ht, thd, all)))

 if (ht != binlog_hton && (err= ht->rollback(ht, thd, all)))

If binlog needs to do some operation before engines roll back (and I agree this is needed here), then that operation should not be done inside the binlog_hton->rollback() call. This is really ugly, first we iterate a list to find some specific element and operate on it, then we iterate the list again and have a check to skip that specific element. Instead, use an explicit call into the binlog that does the operations needed, at the point it time where it should happen.

Andrei, please hold off with this patch for a bit.

I just talked to Monty, he mentioned that the binlog is supposed to always be first in the ha_info list. Possibly something at some point broke that, and this could be the root cause of this bug (or regression in this case).

Monty wants to look into this and discuss with Serg. If we can ensure that binlog is always first in the list, then it automatically should fix this case, and could fix other corner cases as well.

Kristian.

Sep 30 '25 10:09 knielsen

If we can ensure

Not that it's impossible but I've been pessimistic about it as there're a number of ways to screw this nice property.

Just one simple case. ~Let's have binlog_ignore_db=R and trx updates first Innodb in the database I to binlog and Rocksdb in R - so to not.~ ~Then~ A two engine - like Innodb and Innodb - transaction's trans->ha_list would be built as RocksDB, Binlog, Innodb.

Sep 30 '25 12:09 andrelkin

Well, the above RocksDB, Binlog, Innodb sequence of ha_info actually takes place without any binlog_ignore_db=R or if an Aria statement instead of RocksDB. That is

Begin;
  insert into t_innodb...;
  insert into t_aria;
Commit;

Now

(gdb) p thd->transaction->all->ha_list->m_ht->commit
+p thd->transaction->all->ha_list->m_ht->commit
$1 = (int (*)(THD *, bool)) 0x5555583dba8a <maria_commit(THD*, bool)>

My feeling remains the same: it costs nothing to locate Binlog (to do that w/o list scan naturally, rather remembering its introduction to ha_list) and exclude hton->complete action on it in the loop (like it is currently). That has been my current plan B.

B because I had A which aimed to keep Binlog always in the head via slight changes in trans_register_ha(). This seemed to be a promising approach except ha_savepoint() and ha_rollback_to_savepoint() are defined to rely on the current always prepend method to trans->ha_list. I turned to B, as I could not quickly find/decide-on how/whether the function could be adapted to A. E.g with SAVEPOINT *sv to contain a list of trans->ha_info copies. Or we may consider to tolerate some of my_error(ER_ERROR_DURING_ROLLBACK...inha_rollback_to_savepoint()`.

Sep 30 '25 16:09 andrelkin

Added Monty and Serg to the discussion about

I just talked to Monty, he mentioned that the binlog is supposed to always be first in the ha_info list.

Sep 30 '25 16:09 andrelkin

I gave it some more thinking. Here is a summary of transaction::all::ha_list ordering state of affairs and my suggestion how to proceed. First of all, to remind the issue is incorrect binlogging on master spotted in a two transactions scenario with a specific partly non-transactional workload. When one of transactions rolls back it finds its binlog and engine branches ordered "unnaturally" (Binlog is not first in the transaction branches list) which ensues incorrect logging.

There exist few patterns when transaction::all::ha_list is (Engine_n, Binlog, Engine_n-1, ...) and which don't have to be this way. Like

Begin;
  insert into t_innodb...;
  insert into t_aria ...;
Commit;

where it's (Aria, Binlog, Innodb). The precise "all" list at commit must be (Binlog, Innodb) because the Aria statement is obviously outside the main transaction. Its is clear how to improve these cases.

There also exist a harder pattern like mentioned two 2pc engines with one them such that its updates are filtered out, on either master or slave, of binlog which leads to the binlog-in-the-middle list. In order to handle the latter pattern I see only two main options available

enforce binlog ha_info be always in the head of transaction->all.ha_list, or
remember (or just search as the current server does) for it to execute its binlog_hton->"complete"() first in ha_{commit,rollback}_trans() loops.

If we go via p.1, we'll to fix ha_savepoint() etc, as this feature relies on the prepared-only policy in growing the list. If we go via p.2 we'd have to exempt the binlog ha_info from "complete"() inside the loop, similarly to how it's currently done in commit_one_phase_2()

for (binlog_hton->"complete"() when present ; ha_info ; ha_info= ha_info->next)
    if (hton is not binlog)
         hton->complete()

I don't see why the 2nd, being clearly of lesser efforts, should be lesser favorable.

Dear reviewers, please respond with your ideas or voting.

Oct 08 '25 17:10 andrelkin

Somewhat polished version of p.2 of my last comments is pushed to bb-10.11-andrei with the idea it to be 10.11+ version while bb-10.6-MDEV-37541 to cover 10.6 alone.

Oct 17 '25 08:10 andrelkin

Disclaimer: I only looked at the first commit. For now I'm trying to understand a problem and what a solution should be, not to discuss a particular solution.

I see a few rather suspicious issues demonstrated by the first commit:

Why binlog starts a transaction for a memory table? It's not going to log anything into the trx cache. What does "transaction for memory table" conceptually mean in binlog? A bit more generally, to include Andrei's case 2, what does "transaction that is not going to be logged into the trx cache" mean?
There's no tc->prepare() call. Even though a transaction must be logged by a tc. This looks wrong. If a transaction needs to be written into binlog, it should go through tc->prepare() whether it's commit or rollback. Kristian wrote something similar above too.

Oct 17 '25 12:10 vuvova

The immediate issue is indeed that a Memory update caused the binlog branch registration at logging in ROW format and that's due to a legacy pre- (I did not track it further down) 9e62cf67b3cf commit that executes

if (in_multi_stmt_transaction_mode()))
       trans_register_ha(this, TRUE, binlog_hton);
     trans_register_ha(this, FALSE, binlog_hton);

So we do that for any engine modification. The practice introduced back in time may (does) need revising just as you and Kristian point. As I mention above this dubious pattern extended further over engines like Aria that "all"-registers themselves too.

However as comments mention this issue is wider than how it appears here. See the {Rocks, Binlog, Innodb} Rocksdb's binlog-filtered updates last in transaction example.

Oct 17 '25 16:10 andrelkin

binlog-filtered doesn't seem to matter, but, indeed, with two engines one can have binlog in the middle. Still, Kristian's and my point stands — Rollback should do full 2pc if it needs to write something to binlog.

Oct 17 '25 19:10 vuvova

binlog-filtered doesn't seem to matter

:-(

I over-complicated it: {Innodb, Binlog, Rocksdb} is a natural order for two engine trx like

begin;
  insert into t_rocks  set a=1;
  insert into t_innodb set a=1;
rollback;

my point stands — Rollback should do full 2pc if it needs to write something to binlog.

For what reason actually? If you mean crash-recovery of the transactional engine part, then past recovery either prepare()-d as you suggest or un-prepare()-d transaction must complete with rollback? The unprepared one always does so, regardless of what is in binlog. The prepared one would have to be decided..

The policy of ROLLBACK-ended logging, no need to remind you, is merely to record side effects which of course are non-transactional by definition. And in that sense such logging is similar to any non-transactional logging.

PS. I read a key Kristian's note

use an explicit call into the binlog that does the operations needed

(rather than binlog_hton->rollback() that is) , as something similar to what has happened bo binlog_hton->commit(). Just we cannot "deprecate" the rollback method at least without changes around ha_rollback_to_savepoint.

Oct 18 '25 12:10 andrelkin

What do you mean, for what reason, that's how 2pc works. if there's one participant that cannot do 2pc, it should be in the middle. That's why the protocol is what it is — every engine prepares, binlog commits (writes to binlog, it's tc->log call), then everyone commits.

As for rollbacks, there's normally no need to do that, every participant (including binlog) simply throws transaction changes away.

But a "rollback" when a non-trans engine is involved is not really a rollback. That engine has the changes committed. And binlog needs to commit (write transaction down persistently) too, so it's a commit, of a sort. And must be subjected to full 2pc.

{Innodb, Binlog, Rocksdb} case is not a problem, it'll be a true rollback and nothing will be written to binlog. It doesn't matter whether nothing will be written to binlog first or last.

Oct 20 '25 07:10 vuvova

But a "rollback" when a non-trans engine is involved is not really a rollback. That engine has the changes committed.

What engine committed? None of the trans->all.ha_list does so.

As for rollbacks, there's normally no need to do that

just as you say the transaction is fine to rollback anyway it likes.

What happens is that along with the transactional changes there's a non-transactional modification which may be placed in the transactional cache. And the whole cache with ROLLBACK termination is logged to reflect that fact. We might have designed instead a method to extract that modification from the cache and log it alone. Obviously it need not any prepare().

{Innodb, Binlog, Rocksdb} case is not a problem

Not a problem without side effects. But we have ones in this bug... I should've mentioned that just in case in my last comments).

Oct 20 '25 13:10 andrelkin

What engine committed? None of the trans->all.ha_list does so.

Aria. Or MyISAM. Any non-transactional engine. "Committed" in the sense that changes made between START TRANSACTION and ROLLBACK were made persistent.

Not a problem without side effects. But we have ones in this bug...

what problem do you have in this bug with {Innodb, Binlog, Rocksdb} case?

Oct 21 '25 13:10 vuvova

Aria. Or MyISAM. Any non-transactional engine.

Right. So we can say it one-phase-committed.

We have two different ways to log such 1pc engine statement when they mingle a 2pc engine transaction. And I have been pointing to that either the ROLLBACK-termination (e.g necessary for multi-table update) or the second "standard" logging method ahead of the 2pc transaction (of a pure 1pc engine modification if it mingles the 2pc trx) need not any prepare by the 2pc trx. It rolls back when it likes. The ROLLBACK-termination can be conceived as if it's an extended standard method. Extended in the sense we log some extra events whose properties only require roll-back wherever the whole event will be replayed.

what problem do you have in this bug with {Innodb, Binlog, Rocksdb} case?

When one of two 2pc transactions that depend on each other through Innodb engine, and the dependency parent is represented by the list of handlertons as { Innodb, Binlog, Rocksdb} also having a side effect, the 2 engine parent trx may end up in binlog as it happens in the bug. That is ROLLBACK-terminatedly logged after the child. Indeed, Innodb branch is rolled back, the child Innodb updates and logs, the parent completes its entire work with rolling back Rocksdb and the logging.

Oct 21 '25 16:10 andrelkin

@vuvova , just one little thing, there is a detail that seems to have been left out

For now I'm trying to understand a problem

but may be critical to the understanding.

And it is that is not the Memory table that produces the side effect in the test. It's DROP TEMPORARY TABLE which causes trans_cannot_safely_rollback() to claim rollback is unsafe. The Memory table role is to "simulate" a 2nd 2pc-capanle engine.

Oct 22 '25 16:10 andrelkin