server MDEV-32830 I. refactor XA binlogging for better integration with BGC/…

trafficstars

…replication/recovery

This commit is the part I of the series of four that addresses MDEV-31949 in two main directions which are xa parallel slave performance and xa transaction crash-recovery.

This one improves upon MDEV-742 design's XA binlogging to facilitate to the crash-recovery (the actual binlog-based recovery is coming in the part IV of MDEV-33168 et al).

With the refactoring changes, when binlog is ON, handling of execution of a XA transaction, including binlogging, is made conceptually uniform with the normal BEGIN-COMMIT transaction. That is At XA-PREPARE the transaction first is prepared in engines and after that accumulated replication events are written to the binary log, naturally without any completion event as it's unknown yet. When later XA-"COMPLETE" that is XA-COMMIT and XA-PREPARE follows up, the binary logging of respective Query event takes place first. One can perceive such scheme as if a normal transaction logging is split in the middle into two parts (and nothing really happens in between of them but time passed by). And after the second chunk is sent to binlog the transaction gets committed (or rolled back) in engine.

With binlog is enabled both phases' loggings go through binlog-group-commit, where XA-PREPARE "sub-transaction" merely groups for binary logging so skips the engine action while XA-"COMPLETE" does both, that is the logging and an ordered "complete". This behavior is also consistent between completions from the native and external connections. Being a participant of binlog-group-commit designates either XA phase is recoverable (not implemented here) from active binlogs determined by binlog-checkpoint. For the latter specifically this patch removes custom unlogging of XA-prepare group. See entry.need_unlog= 0 et al in MYSQL_BIN_LOG::write_transaction_to_binlog().

In addition to the above a corner case of engine read-only XA transaction is addressed. Previously it was streamlined with logging an empty XA-PREPARE group of binlog events concluded by XA-"COMPLETE" query-event. Now when a preparing XA transaction is found to have only read-only engine branches or none it is marked for rollback as XA_RDONLY optimization:

nothing gets logged at the prepare time an XA_RDONLY note is generated and
it's rolled back at disconnect For XA-COMPLETE to tell whether the prepare phase was logged or not the XID state object is extended with a boolean flag which is a part internal interface for recovery implementation. The flag is normally raised by XA-prepare at flushing to binlog and also at binlog recovery (will be done so it's fully implemented).

Notable changes:

sql/handler.cc

ha_prepare() a. is ensured to execute binlog_hton::prepare() as the last XA's branch for preparing; b. engine read-only is marked in the xid state to rollback and ER_XA_RDONLY note is generated.
conversely ha_rollback_trans() executes binlog hton::rollback() as first branch (the commit method was already equipped to do so)
ditto to the external completion of XA via ha_commit_or_rollback_by_xid(); the function is made a sort of recursive. It may be first be invoked on a top level to take on the binlog hton "completion" to be called from its stack once again now having is_xap_binlogged() false, so to carry out the engine commit.
xarecover_handlerton() now only simulates successful find of the user xid in binlog. sql/log.cc
binlog_commit,rollback() et al are simplified and cleaned up (like binlog_complete_by_xid() introduction).

In particular binlog_{commit,rollback}() are rendered to retain just a single piece of XA footprint in either. The methods recognize naturally empty transaction caches at XA completion to proceed anyway into binlog-group-commit thickness; the binlog_commit's binlog_commit_flush_trx_cache() decides which type of transaction and which XA phase is being handled so a proper group event closure is computed.
MYSQL_BIN_LOG::trx_group_commit_with_engines() takes care to raise or drop XID::binlogged flag via xid_cache_update_xa_binlog_state().
the new run_xa_complete_ordered() encapsulates XA specifics at execution of the engine ordered commit. It's defined with asserts due to MDEV-32455. It's not done as TC_LOG::member because of the scope of this work is limited. The new function mirrors the logic of the normal run_commit_ordered() in that it skips engine completion for those that lack hton::commit_ordered() in favor of doing that on the top level of ha_commit,rollback_trans(). sql/log_event_server.cc
use the transaction cache at XA-PREPARE handling (specifically when XA-END Query event is logged). sql/xa.cc
XID_cache_element extended with is-binlogged meaning flag and few rating functions added to use by binlogging and recovery (xid_cache_update_xa_binlog_state());
trans_xa_commit,rollback() external action branches are converted into calls of a new largely common function. sql/xa.h
xid_cache_insert(XID *xid, bool is_binlogged) parameter list is extended for xa binlog recovery.<!-- Thank you for contributing to the MariaDB Server repository!

You can help us review your changes faster by filling in this template <3

If you have any questions related to MariaDB or you just want to hang out and meet other community members, please join us on https://mariadb.zulipchat.com/ . -->

[x] The Jira issue number for this PR is: MDEV-______

Description

TODO: fill description here

Release Notes

TODO: What should the release notes say about this change? Include any changed system variables, status variables or behaviour. Optionally list any https://mariadb.com/kb/ pages that need changing.

How can this PR be tested?

TODO: modify the automated test suite to verify that the PR causes MariaDB to behave as intended. Consult the documentation on "Writing good test cases".

If the changes are not amenable to automated testing, please explain why not and carefully describe how to test manually.

Basing the PR against the correct MariaDB version

[ ] This is a new feature or a refactoring, and the PR is based against the main branch.
[ ] This is a bug fix, and the PR is based against the earliest maintained branch in which the bug can be reproduced.

PR quality check

[ ] I checked the CODING_STANDARDS.md file and my PR conforms to this where appropriate.
[ ] For any trivial modifications to the PR, I am ok with the reviewer making the changes themselves.

May 10 '25 17:05 andrelkin

Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you all sign our Contributor License Agreement before we can accept your contribution.
0 out of 2 committers have signed the CLA.

:x: emoonrain
:x: andrelkin
_{You have signed the CLA already but the status is still pending? Let us recheck it.}

May 10 '25 17:05 CLAassistant

This is a huge patch, >7k lines. In line with what we also discussed some weeks ago on the replication meetings, I'll start by making sure things around the patch are clear, ie. description and buildbot/testing.

As I understand, this is a follow-up to an earlier review of a related patch, but that was now quite a long time ago, so forgive me for asking for re-iterating any discussions/explanations that were also given back then.

There seem to be a lot of failures in the buildbot on the brach?

https://buildbot.mariadb.org/#/grid?branch=bb-12.1-MDEV-32830_xa https://buildbot.mariadb.org/#/grid?branch=refs%2Fpull%2F4041%2Fhead

andrelkin @.***> writes:

This one improves upon MDEV-742 design's XA binlogging to facilitate to the crash-recovery (the actual binlog-based recovery is coming in the part IV of MDEV-33168 et al).

Does this mean that this patch does not make XA + binlog crash-safe on the master?

Are there any other major user-visible behaviour changes in this patch, or is it only internal code-refactoring?

With the refactoring changes, when binlog is ON, handling of execution of a XA transaction, including binlogging, is made conceptually uniform with the normal BEGIN-COMMIT transaction.

So what are the motivations for doing this? Is it mainly to make the code clearer? Are there other benefits? Performance? Facilitating later improvements (and if so, which)? Others?

That is At XA-PREPARE the transaction first is prepared in engines and after that accumulated replication events are written to the binary log, naturally without any completion event as it's unknown yet. When later XA-"COMPLETE" that is XA-COMMIT and XA-PREPARE follows up,

I don't understand what it means that XA-COMPLETE is XA-COMMIT and XA-PREPARE?

the binary logging of respective Query event takes place first.

I don't understand what it means "logging of respective Query event takes place first". Respective to what, is this the Query event with the query string "XA COMMIT ''" or "XA PREPARE ''"? What is "first" relative to, what comes after?

One can perceive such scheme as if a normal transaction logging is split in the middle into two parts (and nothing really happens in between of them but time passed by). And after the second chunk is sent to binlog the transaction gets committed (or rolled back) in engine.

With binlog is enabled both phases' loggings go through binlog-group-commit, where XA-PREPARE "sub-transaction" merely groups for binary logging so skips the engine action while XA-"COMPLETE" does both, that is the logging and an ordered "complete".

Can you please describe more concretely what happens during the binlog group commit for XA transactions? What are the exact steps taken, which major functions in server layer and in engine are being called in what order?

For example, of the the primary purposes of the binlog group commit is to ensure that binlog and engine contains the transactions in the same order, that is the purpose of the commit_ordered() call. This ensures for example that a non-locking mariabackup will be consistent with a specific GTID in the binlog. Does this patch similarly ensure that XA PREPARE happens in the engine in the same order as it appears in the binlog? What about XA COMMIT or XA ROLLBACK? If so, how is this ensured?

Another primary motivation is on the (parallel) slave to ensure that commits on the slave happen in the same order as in the master's binlog. Does this patch ensure the same property for the XA PREPARE/COMMIT/ROLLBACK? If so, how?

What is the behavior for cross-engine XA transactions (eg. XA transaction that modify two or more transactional engines)? Is it handled in this patch, or if not is there a design for how it could be handled?

This behavior is also consistent between completions from the native and external connections.

What does this mean, completions from native and external connections?

Being a participant of binlog-group-commit designates either XA phase is recoverable (not implemented here) from active binlogs determined by binlog-checkpoint. For the latter specifically this patch removes custom unlogging of XA-prepare group. See entry.need_unlog= 0 et al in MYSQL_BIN_LOG::write_transaction_to_binlog().

The purpose of the binlog checkpoints is to save one fsync() at the end of the two-phase commit between engine(s) and binlog. A binlog checkpoint means that any (normal) transaction in that binlog has now been fsync()'ed to the InnoDB redo log.

What is the corresponding mechanism for user XA in this patch? Please describe the exact semantics of binlog checkpoints wrt. XA PREPARE/COMMIT/ROLLBACK, and how the algorithm will be that makes each XA phase recoverable.

What was the purpose of the custom unlogging that is removed in this patch, and why is it no longer necessary?

Now when a preparing XA transaction is found to have only read-only engine branches or none it is marked for rollback as XA_RDONLY optimization:

nothing gets logged at the prepare time an XA_RDONLY note is generated and

it's rolled back at disconnect

Where is the XA_RDONLY note generated? Just internally, in the XA hash holding the current state of the XA transaction?

Then what happens if the server crashes after the read-only XA PREPARE, what will be the state of the transaction once the server comes up?

For XA-COMPLETE to tell whether the prepare phase was logged or not the XID state object is extended with a boolean flag which is a part internal interface for recovery implementation. The flag is normally raised by XA-prepare at flushing to binlog and also at binlog recovery (will be done so it's fully implemented).

Maybe this was intended to answer the previous question, but I don't understand or do not see any concrete description of how it works.

Notable changes:

Kristian.

May 19 '25 08:05 knielsen

There seem to be a lot of failures in the buildbot on the brach? buildbot.mariadb.org#/grid?branch=bb-12.1-MDEV-32830_xa buildbot.mariadb.org#/grid?branch=refs%2Fpull%2F4041%2Fhead

'Failed test nm' and explicit binlog.mdev-32830_qa_tests are the same thing. To sum up on two more tests from the replication suits, I did not look into

   [ fail ] rpl.rpl_xa_survive_disconnect_mixed_engines

   https://buildbot.mariadb.org/#/builders/534/builds/27445/steps/7/logs/stdio

   At line 172: The result of the 1st execution does not match with 
   the result of the 2nd execution of ps-protocol:

while the symptoms do not really worrying; and

   [ fail ] rpl.rpl_drop_temp

   -Slave_open_temp_tables	0
   +Slave_open_temp_tables	1

caused by ^.

I need to make some test

Does this mean that this patch does not make XA + binlog crash-safe on the master? Are there any other major user-visible behavior changes in this patch, or is it only internal code-refactoring?

Right, it's a prerequisite. Ultimately the binlog-based crash-recovery is in MDEV-33168. As to the user pov, it only changes post-crash traces of XA-PREPARE (to remind in the BASE it can be in binlog having no persistent status in Engine).

With the refactoring changes, when binlog is ON, handling of execution of a XA transaction, including binlogging, is made conceptually uniform with the normal BEGIN-COMMIT transaction.

So what are the motivations for doing this? Is it mainly to make the code clearer? Are there other benefits? Performance? Facilitating later improvements (and if so, which)? Others?

The main goal to achieve recovery from binlog the normal transaction's rollback way. This feature delayed as it had been previously hoped on MDEV-18959 roll-forward method (we agreed to put it aside for time being). The binlogging module readability definitely takes advantage of that. Performance will be improved certainly compared to pre-MDEV-742 that did not implement the ordered-commit for XA:s, and probably in comparison with the BASE (of the patch) too. Let's speak about that in more detail when I publish the part III of necessary changes for performance in Innodb.

that is XA-COMMIT and XA-PREPARE follows up,

I don't understand what it means that XA-COMPLETE is XA-COMMIT and XA-PREPARE?

Typo, sorry. XA-COMPLETE is XA-COMMIT or XA-ROLLBACK of course.

the binary logging of respective Query event takes place first.

I don't understand what it means "logging of respective Query event takes place first". Respective to what

So the respective adjective attaches to the completion type. E.g

"XA COMMIT ''"

is written first (to match Xid_log_event in the normal case).

Can you please describe more concretely what happens during the binlog group commit for XA transactions? What are the exact steps taken, which major functions in server layer and in engine are being called in what order?

Here is a gdb stack of commit from the native (defined as the connection that executes all phase of the xa) connection that it logically compatible to the normal trx:

innobase_commit_ordered
run_xa_complete_ordered
TC_LOG::run_commit_ordered
...
binlog_commit
commit_one_phase_2
ha_commit_one_phase
trans_xa_commit

The external (defined as a connection from which a prepared disconnected xa is completed) connection commit stack is different is that hton::commit_by_xid() runs in place of innobase_commit_ordered(). The upcoming part III is going to make it clear that hton::commit_by_xid() is an ordered-commit method so is different from the normal trx' one solely in the argument list.

For example, of the the primary purposes of the binlog group commit is to ensure that binlog and engine contains the transactions in the same order, that is the purpose of the commit_ordered() call. This ensures for example that a non-locking mariabackup will be consistent with a specific GTID in the binlog. Does this patch similarly ensure that XA PREPARE happens in the engine in the same order as it appears in the binlog? What about XA COMMIT or XA ROLLBACK? If so, how is this ensured?

As you can guess from the above stack and a mentioned equivalency of XA-"COMPLETE" and Xid-log-event it's so (the same order) for the XA-COMMIT and X-ROLLBACK. To the XA-PREPARE, the ordering of prepared XA:s in Engine is as random as it is on normal transactions. This is sufficient to achieve the same normal trx' logics of backup restoration: when an xa transaction is present in backup and is missed in binlog it's to be rolled back, otherwise it remains alive (and of course in Prepared state, differently from the normal case).

Another primary motivation is on the (parallel) slave to ensure that commits on the slave happen in the same order as in the master's binlog. Does this patch ensure the same property for the XA PREPARE/COMMIT/ROLLBACK? If so, how?

The parallel slave execution of XA-COMPLETE is governed similarly to the normal trx' commit. The only deviation on the binlogging slave is the xa executes hton::commit_by_xid() as an ordered-commit. Whether the hton method is such is proven in the upcoming part III. On the non-binlogging slave wait-for-prior-commit is done explicitly in ha_commit_or_rollback_by_xid().

To the XA-PREPARE terminated group of events, it's a transaction from the binlogging pov. So slave does not have any specific rule to write it to binlog correctly. On the non-binlogging slave an explicit wait-for-prior-commit is "reused" via

THD::wait_for_prior_commit
ha_commit_one_phase
ha_commit_trans
rpl_slave_state::record_gtid
Xid_apply_log_event::do_record_gtid
Xid_apply_log_event::do_apply_event
Log_event::apply_event
apply_event_and_update_pos_apply
apply_event_and_update_pos_for_parallel
rpt_handle_event

Of course it's unrecoverable yet waiting for MDEV-21777.

What is the behavior for cross-engine XA transactions (eg. XA transaction that modify two or more transactional engines)? Is it handled in this patch, or if not is there a design for how it could be handled?

Without crashes this this refactoring part I must be able to handle such load, which is not different from MDEV-742 that has tests for that.

However now, see mdev-36802_multiple_xa_engine, I've had to delve deeper to recognize some nitty-gritties whose further coverage I'd like to hand over to the part III (where I suggest we'd talk it over thoroughly). That for those engines, like Spider, that have commit-by-xid and no ordered-commit the external completion or parallel slave execution would have, after binlogging, to execute the whole commit, again to emulate the normal transaction, .

Being a participant of binlog-group-commit designates either XA phase is recoverable (not implemented here) from active binlogs determined by binlog-checkpoint. For the latter specifically this patch removes custom unlogging of XA-prepare group. See entry.need_unlog= 0 et al in MYSQL_BIN_LOG::write_transaction_to_binlog().

The purpose of the binlog checkpoints is to save one fsync() at the end of the two-phase commit between engine(s) and binlog. A binlog checkpoint means that any (normal) transaction in that binlog has now been fsync()'ed to the InnoDB redo log.

Sure, and that's not deny what was stated in what you quote.

What is the corresponding mechanism for user XA in this patch? Please describe the exact semantics of binlog checkpoints

I note that 'cos XA-COMPLETE has no specifics beyond normal transaction commit, except a tiny bit of how to find the xid (which is MDEV-33168 pain). The rest is just the same:

group-binlog,
rotate binlog if necessary,
group-ordered-commit and
request-checkpoint if rotate was necessary.

To the XA-PREPARE, the difference is p.3 skipped.

What was the purpose of the custom unlogging that is removed in this patch, and why is it no longer necessary?

I think that should be much more clear now as the strategy is, to reiterate, to unify with the normal transaction.

Now when a preparing XA transaction is found to have only read-only engine branches or none it is marked for rollback as XA_RDONLY optimization: - nothing gets logged at the prepare time an XA_RDONLY note is generated and - it's rolled back at disconnect

Where is the XA_RDONLY note generated? Just internally, in the XA hash holding the current state of the XA transaction?

Right.

Then what happens if the server crashes after the read-only XA PREPARE, what will be the state of the transaction once the server comes up?

The answer is a read-only XA would be gone already after disconnect from the server. The user knows of that via a note generated by XA-PREPARE.

This is a MDEV-33168 objective to whom a piece of logics around

+ XID_cache_element:::notify_xap_binlogged

this refactoring provides with. At post-crash restart the recovery would not find any such xid neither in Engine (Innodb as you know does not recover pure locks) nor in binlog.

For XA-COMPLETE to tell whether the prepare phase was logged or not the XID state object is extended with a boolean flag which is a part internal interface for recovery implementation. The flag is normally raised by XA-prepare at flushing to binlog and also at binlog recovery (will be done so it's fully implemented).

Maybe this was intended to answer the previous question, but I don't understand or do not see any concrete description of how it works.

Right. A lighter "recovery" (still recovery from the user pov) is reconnect with xa-recover. It'd not show a read-only XA thanks to the boolean flag. XA-COMPLETE also consults it.

May 19 '25 13:05 andrelkin

andrelkin @.***> writes:

the binary logging of respective Query event takes place first.

I don't understand what it means "logging of respective Query event takes place first". Respective to what

So the respective adjective attaches to the completion type. E.g

"XA COMMIT ''"

is written first (to match Xid_log_event in the normal case).

Aha, so I think you mean that the XA COMMIT Query event is written to the binlog before the XA COMMIT is executed in the engine(s)?

And is this changed in the patch, eg. in the existing code in 10.5+, is the XA COMMIT Query event written to the binlog before or after the XA COMMIT is executed in the engine(s)?

Can you please describe more concretely what happens during the binlog group commit for XA transactions? What are the exact steps taken, which major functions in server layer and in engine are being called in what order?

Here is a gdb stack of commit from the native (defined as the connection that executes all phase of the xa) connection that it logically compatible to the normal trx:
innobase_commit_ordered
run_xa_complete_ordered
TC_LOG::run_commit_ordered
...
binlog_commit
commit_one_phase_2
ha_commit_one_phase
trans_xa_commit

Hm, so you omitted part ("...") of the stack trace. So I'm wondering if the "..." includes the binlog group commit, queue_for_group_commit() and/or trx_group_commit_leader()?

Since you wrote this in the description:

"With binlog is enabled both phases' loggings go through binlog-group-commit"

The call path uses ha_commit_one_phase(), which I guess is because the XA COMMIT or XA ROLLBACK does not do a two-phase commit with the binlog (it cannot, as it is already prepared), but it can still take part in the binlog group commit, just without the log_and_order() / unlog() part.

For example, of the the primary purposes of the binlog group commit is to ensure that binlog and engine contains the transactions in the same order, that is the purpose of the commit_ordered() call. This ensures for example that a non-locking mariabackup will be consistent with a specific GTID in the binlog. Does this patch similarly ensure that XA PREPARE happens in the engine in the same order as it appears in the binlog? What about XA COMMIT or XA ROLLBACK? If so, how is this ensured?

To the XA-PREPARE, the ordering of prepared XA:s in Engine is as random as it is on normal transactions. This is sufficient to achieve the same normal trx' logics of backup restoration: when an xa transaction is present in backup and is missed in binlog it's to be rolled back, otherwise it remains alive (and of course in Prepared state, differently from the normal case).

This I do not understand.

I understand that during crash recovery, an XA PREPARE will be kept or rolled back depending on what is in the binlog. IIUC this will be implemented not in this patch but in a follow-up.

But when restoring a backup, we do not have the binlog available, right?

I am considering an example where we have XA PREPARE 't1' and XA PREPARE 't2'.

Suppose we have the following sequence of execution on a (parallel) slave:

t2 prepare in engine t2 release backup lock in queue_for_group_commit() (*) t1 prepare in engine t1 release backup lock in queue for group commit() leader binlogs t1, t2

The binlog order here (on master and slave both) is 't1' followed by 't2'. If we do a mariabackup at point (*), then it seems there is no way to provision another slave from that backup, as there is no valid GTID position corresponding to it. A position after XA PREPARE 't2' will skip replicating XA PREPARE 't1'. A position before XA PREPARE 't1' will fail on duplicate XA PREPARE 't2'.

So did I miss something, and there is a mechanism to ensure that a backup of a slave can be used to provision another slave? If so, what mechanism?

Or there is no such mechanism in this patch, but will be in a follow-up patch? If so, what mechanism?

Or this is a limitation, and a mariabackup of a (parallel) slave can not always be used to provision a new slave? (This is after all already the case for mysqldump).

Note that this is not a critisism of the patch at this point one way or the other, it is an attempt to understand what is the intention of the patch to be able to do the review. This is a huge patch, >8k lines, very difficult to review correctly. It is very important during a complex review to understand what the purpose of the patch is. Otherwise whenever something looks odd, the reviewer is left to guess if it is a bug in the patch, if it is a deliberate limitation of the patch, or if it is merely mis-understanding on the part of the reviewer.

Kristian.

May 22 '25 07:05 knielsen

the XA COMMIT Query event is written to the binlog before the XA COMMIT is executed in the engine(s)?

Sure. In the recovery sense this binlog event is (going to be via MDEV-33168) equivalent to Xid-log-event. The BASE of that patch might always have had a normal transaction style recovery for XA-COMPLETE. A problem was with XA-PREPARE whose binlogging and engine phases were ordered incompatibally to this method. The patch addresses that to ready XA-PREPAREd "sub-transaction" be recoverable that way.

if the "..." includes the binlog group commit, queue_for_group_commit() and/or trx_group_commit_leader()?

Does include. Sorry, I rushed to remove that piece of evidence from the scene.

But when restoring a backup, we do not have the binlog available, right?

Actually I naively thought binlog is around. However there's a method to sort out your example. And that's Xid-list-log-event envisioned in MDEV-33168. Look

If we do a mariabackup at point (*)

so t2 is XA-PREPAREd in the backup it is still not in the list. We would roll back t2 then just as the server recovery is going to do. The xid list event will mention xid:s that have been prepered in binlog, via xid_cache_update_xa_binlog_state().

Thanks for the case. This is worth of another ticket to report.

As to your thorough approach, as I completely with you especially having pretty constructive feedback that you're always prolific with!

May 22 '25 10:05 andrelkin

server server copied to clipboard

MDEV-32830 I. refactor XA binlogging for better integration with BGC/…

Description

Release Notes

How can this PR be tested?

Basing the PR against the correct MariaDB version

PR quality check

server
server copied to clipboard