percona-server
percona-server copied to clipboard
Ps 8.0 9214
PS-9214 : Alter table online results in "duplicate key" error on the primary key (only index).
https://perconadev.atlassian.net/browse/PS-9214
Problem:
ALTER TABLE with rebuilds InnoDB table using INPLACE algorithm occasionally might fail with unwarranted duplicate primary key error if there are concurrent insertions into the table, even though these insertions do not icause any PK conflict.
Analysis:
New implementation of parallel ALTER TABLE INPLACE in InnoDB was introduced in MySQL 8.0.27. Its code is used for online table rebuild even in a single-thread case. This implementation iterates over all the rows in the table, in general case, handling different subtrees of a B-tree in different threads. This iteration over table rows needs to be paused, from time to time, to commit InnoDB MTR/ release page latches it holds. This is necessary to give a way to concurrent actions on the B-tree scanned or before flushing rows of new version of table from in-memory buffer to the B-tree. In order to resume iteration after such pause persistent cursor position saved before pause is restored.
The cause of the problem described above lies in how PCursor::savepoint() and PCursor::resume() methods perform this saving and restore of cursor position.
Instead of storing position of current record pointed by cursor savepoint() stores the position of record that precedes it, and then resume() does restore in two steps - 1) restores position of this preceding record (or its closest precedessor if it was purged meanwhile) and then 2) moves one step forward assuming that will get to the record at which cursor pointed originally.
Such approach makes sense when we try to save/restore cursor pointing to page's supremum record before switching to a new page, as it allows to avoid problems with records being skipped when the next page is merged into the current one while latches are released (so records which we have not scanned yet are moved over supremum record, but not over the record which originally preceded supremum).
Let us take look at an example. Let us assume that we have two pages p1 : [inf, a, b, c, sup] and the next one p2 : [inf, d, e, f, sup]. Out thread which is one of the parallel ALTER TABLE worker threads has completed scan of p1, so its cursor positioned on p1:'sup' record. Now it needs to switch to page p2, but also give a way to threads concurrently updating the table. So it needs to make cursor savepoint, commit mini-transaction and release the latches.
If naive approach to making savepoint is used and we simply do btr_pcur_t::store_position()/restore_position() with the cursor positioned on p1 : 'sup' record, then the following might happen: concurrent purge on page p2 might delete some record from it (e.g. 'f') and decide to merge of this page into the page p1. If this happens while latches are released this merge would go through and and resulting in page p1 with the following contents p1 : [inf, a, b, c, d, e, sup]. Savepoint for p1 : 'sup' won't be invalidated (one can say that savepoints for sup and inf are not safe against concurrent merges in this respect) and after restoration of cursor the iteration will continue, on the next page, skipping records 'd' and 'e'.
With non-naive approach implemented at the moment, PCursor::savepoint() does a step back before calling btr_pcur_t::store_position(), so for cursor positioned on p1 : 'sup' it is actually position corresponding to p1 : 'c' what is saved. If the merge happens when latches are released, we still get p1 : [inf, a, b, c, d, e, sup] and the savepoint is not invalidated. PCursor::resume() calls btr_pcur_t::restore_position() gets cursor pointing to p1 : 'c' as result, and then it tries to compensate for step-back in PCursor::savepoint() and moves cursor one step forward making it to point to p1 : 'd'. Code which does scanning detects the situation that we savepoint()/resume() resulted in jump from supremum record to user record and resume iteration from p1 : 'd' without skipping any records.
However, it is not necessary and becomes problematic we try to save/restore cursor pointing to user record from within record processing callback. In this case record which position we are trying to save when savepoint() method is called can be considered already processed as corresponding value already will be inserted into output buffer soon after restore. When a concurrent insert adds a new record between the record which position we have inteded to save by calling savepoint() and its precedessor which position this call stored internally, the later call to resume() will position cursor at this newly inserted record. This will lead to the resumed scan revisiting original record once again. As result the code will attempt to add this original record into the output buffer one more time and get duplicate key error.
Let us take a look at an example once again. Let us assume that parallel ALTER TABLE thread is scanning page p1 with the following contents - p1 : [inf, a, b, c, d, sup]. It has processed record 'c' (so cursor points to p1: 'c') and decides that it needs take savepoint/ commit mini-transaction and release latches in order to flush in-memory buffer with new versions of the records. PCursor::savepoint() does a step back and calls btr_pcur_t::store_position() which saves position corresponding to p1 : 'b'. After the latches are released concurrent insert might happen into the page adding record 'b1', between records 'b' and 'c', resulting in page looking like p1: [inf, a, b, b1, c, d, sup]. After that PCursor::resume() will call restore_position() which will restore cursor pointing to p1 : 'b' and then will try to compensate for step-back in savepoint() by moving cursor one step forward to p1 : 'b1'. The scanning code will continue its iteration using this cursor, by moving to the next record - p1 :'c' and trying to process it once again, resulting in duplicate key error.
Fix:
This patch solves the problem by adjusting PCursors::savepoint()/resume() logic not to do this step back on save/step forward on restore if we are trying to save/restore cursor pointing to a user record (in which case it is not necessary). This process still used when we are trying to save/ restore cursor pointing to page infimum (where it is useful).
It also adds some comments explaining how this code works and a few debug asserts enforcing its invariants.