mysql-5.6 icon indicating copy to clipboard operation
mysql-5.6 copied to clipboard

MyRocks does commit step non-durably under server group 2PC protocol

Open laurynas-biveinis opened this issue 9 months ago • 1 comments

When server commits transactions involving binlog, MyRocks, and possibly InnoDB, it uses group 2PC protocol. During its prepare step, the server asks the storage engines not to flush/sync individual transactions, which MyRocks complies with:

static int rocksdb_prepare(handlerton *const hton MY_ATTRIBUTE((__unused__)),
                           THD *const thd, bool prepare_tx) {
...
    if (tx->is_two_phase()) {
      if (thd->durability_property == HA_IGNORE_DURABILITY) {
        tx->set_sync(false);
      }

But, after this MyRocks fails to re-enable sync for the commit step of the same transaction, effectively committing without durability guarantee. Part of the commit payload (as opposed to the prepare payload) is the binlog position. Since this is 2PC across multiple engines, the exact effect depends on what was the transaction:

  1. MyRocks DDL (involves binlog, MyRocks, and InnoDB):
CREATE TABLE t1 (a INT PRIMARY KEY) ENGINE=ROCKSDB;

--source include/kill_and_restart_mysqld.inc

DROP TABLE t1;

Appears to be fine, in recovery all SEs agree on the binlog position, XA recovery does nothing.

  1. MyRocks DML (involves binlog, MyRocks, may or may not involve InnoDB):
CREATE TABLE t1 (a INT PRIMARY KEY) ENGINE=ROCKSDB;

BEGIN;
INSERT INTO t1 VALUES (1);
COMMIT;

--source include/kill_and_restart_mysqld.inc

DROP TABLE t1;

In recovery all SEs agree on the binlog position, but XA recovery commits one transaction.

  1. Multi-engine DML (involves binlog, MyRocks, and InnoDB):
CREATE TABLE t1 (a INT PRIMARY KEY) ENGINE=ROCKSDB;
CREATE TABLE t2 (a INT PRIMARY KEY) ENGINE=InnoDB;

BEGIN;
INSERT INTO t1 VALUES (1);
INSERT INTO t2 VALUES (2);
COMMIT;

--source include/kill_and_restart_mysqld.inc

SELECT * FROM t1;
SELECT * FROM t2;

DROP TABLE t1, t2;

In recovery the binlog position in MyRocks is outdated, XA recovery commits one transaction.

In all cases the binlog-based XA recovery is able to clean up. However, if the server data dictionary is moved to MyRocks, the bug becomes much more serious data-corrupting one when InnoDB DDL is involved, because during startup MyRocks recovers individually from its WAL, DD is initialized without the last committed transaction, then InnoDB attempts to recover using the outdated DD, and only then binlog XA recovery runs, which is too late.

I'll make a PR

laurynas-biveinis avatar Sep 22 '23 11:09 laurynas-biveinis

It looks like this behavior is an undocumented deliberate design decision.The server layer requests commits to be non-durable too in addition to prepares, when binlog is enabled. InnoDB does the same thing: https://bugs.mysql.com/bug.php?id=75519

Since InnoDB is the DDSE, it also unconditionally flushes the redo log on any DD operation. This way it addresses the scenario above where it recovers individually from the redo log, initializes the data dictionary, and only then binlog crash recovery runs. It looks like MyRocks DDSE will have to flush on every DD operation commit too.

I will still make a PR but only to check the server durability request individually at commit time too, instead of assuming that prepare and commit requested the same.

Any other changes we'd make eventually will depend on the resolution of https://bugs.mysql.com/bug.php?id=75519, if any

laurynas-biveinis avatar Sep 27 '23 12:09 laurynas-biveinis