server
server copied to clipboard
Async rollback prepared transactions during binlog crash recovery
MDEV-33853 Async rollback prepared transactions during binlog crash recovery
Summary
When doing server recovery, the active transactions will be rolled back by InnoDB background rollback thread automatically. The prepared transactions will be committed or rolled back accordingly by binlog recovery. Binlog recovery is done in main thread before the server can provide service to users. If there is a big transaction to rollback, the server will not available for a long time.
This patch provides a way to rollback the prepared transactions asynchronously. Thus the rollback will not block server startup.
Design
-
Handler::recover_rollback_by_xid() This patch provides a new handler interface to rollback transactions in recover phase. InnoDB just set the transaction's state to active. Then the transaction will be rolled back by the background rollback thread.
-
Handler::signal_tc_log_recover_done() This function is called after tc log is opened(typically binlog opened) has done. When this function is called, all transactions will be rolled back have been reverted to ACTIVE state. Thus it starts rollback thread to rollback the transactions.
-
Background rollback thread With this patch, background rollback thread is defered to run until binlog recovery is finished. It is started by innobase_tc_log_recovery_done().
- [x] The Jira issue number for this PR is: MDEV-33853
Description
TODO: fill description here
How can this PR be tested?
TODO: modify the automated test suite to verify that the PR causes MariaDB to behave as intended. Consult the documentation on "Writing good test cases".
If the changes are not amenable to automated testing, please explain why not and carefully describe how to test manually.
Basing the PR against the correct MariaDB version
- [x] This is a new feature and the PR is based against the latest MariaDB development branch.
- [ ] This is a bug fix and the PR is based against the earliest maintained branch in which the bug can be reproduced.
PR quality check
- [x] I checked the CODING_STANDARDS.md file and my PR conforms to this where appropriate.
- [x] For any trivial modifications to the PR, I am ok with the reviewer making the changes themselves.
@SongLibing are you ok to finish this?
@SongLibing are you ok to finish this?
Sorry for the delay. I have been very busy these days. I can probably finish the update in next week.
Hi @dr-m , The patch is updated, please have a look. @knielsen Marko suggest you to review the patch, please have a look.
I took a look at the patch 5105774, I only looked at the non-InnoDB parts in detail.
I think the patch looks solid, it does all the things already I would expect it to do. I had one suggestion to clarify the comments on the new handlerton calls introduced.
This part of the commit message I did not understand:
"With this patch, background rollback thread will not exit unless ddl log recovery is finished. It guarantees that all recovered prepared transactions which reverted to active state will be rolled back by the background thread."
Maybe you can elaborate/explain how this patch relates to the ddl log recovery?
It is not related to ddl log in the latest patch, the comment message has been updated.
Looks good. Thanks LiBing!
Sorry, in the last comment I meant a mere approval.
I see that this is rebased on top of the currently latest 11.6. There are some failures that I do not observe for the 11.6 branch. I can also reproduce a hang myself by executing the following:
./mtr --parallel=auto --repeat=10 mariabackup.slave_provision_nolock{,,,,,,,,,,,,,,,,}
I checked the stack traces of one hung mariadbd
process, and I did not see anything obvious inside InnoDB. Outside InnoDB, I am not that familiar with the code. To debug this, I suggest that you use ./mtr --rr
and then killall -ABRT mariadbd
to get an rr replay
trace of a hung mariadbd
process. With that trace, it should be possible to understand what exact sequence of events is leading to the hang.
I see that you were able to fix the occasional hangs during the test mariabackup.slave_provision_nolock.
What was the problem, and how was it fixed? Was it a bug in the patch that was fixed, or just an unrelated issue picked up during a rebase?
It's hard to follow what happened since my first review since there are no new commits, just force-pushes (but I will try...)
I see that you were able to fix the occasional hangs during the test mariabackup.slave_provision_nolock.
the test case has a case to start mysqld with --tc-heuristic-recover=ROLLBACK, that is a case the patch didn't handle correctly. This patch just postpone the start of rollback thread, it is possible that error happens before rollback thread is started. innodb cleanup suppose rollback thread is there, but that is not true for above case. It cause a failure.
@SongLibing, would it be possible to change the target branch to main
and rebase this pull request on top of that?
@dr-m I rebased the PR to main.
@SongLibing Thank you. I had updated the branch bb-11.6-MDEV-33853 earlier today in order to facilitate some testing. The two branches only differ by the current main
head db5d1cde4505fdd04bdb3389b28da004ca8ec579.