vitess
vitess copied to clipboard
WIP: incremental logical backup and point in time recovery
Description
This PR introduces incremental backups, towards point in time recovery (which may or may not happen in this PR).
Incremental backups are only available for:
-
builtin
backup method - Running MySQL GTID
As reminder, builtin
backup is a full backup taken by shutting down the server and copying over files.
The incremental backup is done by creating a backup of binary log files. The MySQL server is not stopped nor interrupted by this operation.
An incremental backup is taken like so:
vtctlclient -- Backup --incremental_from_pos "MySQL56/16b1039f-22b6-11ed-b765-0a43f95f28a3:1-615" zone1-0000000102
In the above we ask Vitess to create an incremental backup that covers MySQL56/16b1039f-22b6-11ed-b765-0a43f95f28a3:1-615
. Since the backup method is to copy complete binlog files, it's possible that the backup actually starts at an earlier position. Vitess does not impose that the position is at a binary log rotation.
The backup fails if:
- The first available binlog file does not cover required position (ie binlog files have been rotated and purged, and the incremental GTID do not exist anymore)
- No binary log file contains the requested position
- There is nothing to backup (there are no new GTID entries in the binary logs on top of requested position, ie the requested position is either at the end of existing binary logs or even more futuristic).
The backup's manifest has been updated with new fields. Here's a new manifest:
{
"BackupMethod": "builtin",
"Position": "MySQL56/16b1039f-22b6-11ed-b765-0a43f95f28a3:1-883",
"FromPosition": "MySQL56/16b1039f-22b6-11ed-b765-0a43f95f28a3:1-867",
"Incremental": true,
"BackupTime": "2022-08-25T12:55:05Z",
"FinishedTime": "2022-08-25T12:55:05Z",
"ServerUUID": "1ea0631b-22b6-11ed-933f-0a43f95f28a3",
"TabletAlias": "zone1-0000000102",
"CompressionEngine": "pargzip",
"FileEntries": [
..
]
}
- The above is an incremental backup's manifest. Clearly indicated by
"Incremental": true,
- "FileEntries" will only list binary logs
- "FromPosition" indicates the first position covered by the backup. It is smaller than or equal to the requested
--incremental_from_pos
. This value is empty for full backup. -
ServerUUID
is new and self explanatory, added for convenience -
TabletAlias
is new and self explanatory, added for convenience
Added some unit tests. WIP to add endtoend tests.
Related Issue(s)
TBD
Checklist
- [ ] "Backport me!" label has been added if this change should be backported
- [ ] Tests were added or are not required
- [ ] Documentation was added or is not required
Deployment Notes
Review Checklist
Hello reviewers! :wave: Please follow this checklist when reviewing this Pull Request.
General
- [x] Ensure that the Pull Request has a descriptive title.
- [ ] If this is a change that users need to know about, please apply the
release notes (needs details)
label so that merging is blocked unless the summary release notes document is included.
If a new flag is being introduced:
- [ ] Is it really necessary to add this flag?
- [ ] Flag names should be clear and intuitive (as far as possible)
- [ ] Help text should be descriptive.
- [ ] Flag names should use dashes (
-
) as word separators rather than underscores (_
).
If a workflow is added or modified:
- [ ] Each item in
Jobs
should be named in order to mark it asrequired
. - [ ] If the workflow should be required, the maintainer team should be notified.
Bug fixes
- [ ] There should be at least one unit or end-to-end test.
- [ ] The Pull Request description should include a link to an issue that describes the bug.
Non-trivial changes
- [ ] There should be some code comments as to why things are implemented the way they are.
New/Existing features
- [ ] Should be documented, either by modifying the existing documentation or creating new documentation.
- [x] New features should have a link to a feature request issue or an RFC that documents the use cases, corner cases and test cases.
Backward compatibility
- [ ] Protobuf changes should be wire-compatible.
- [ ] Changes to
_vt
tables and RPCs need to be backward compatible. - [ ]
vtctl
command output order should be stable andawk
-able. - [ ] RPC changes should be compatible with vitess-operator
- [ ] If a flag is removed, then it should also be removed from VTop, if used there.
Milestone progress, as we now have endtoend tests to validate PITR restore.
Sep 21 10:33:38 --- PASS: TestIncrementalBackupMysqlctld (308.09s)
Sep 21 10:33:38 --- PASS: TestIncrementalBackupMysqlctld/full_backup (31.03s)
Sep 21 10:33:38 --- PASS: TestIncrementalBackupMysqlctld/first_incremental_backup (1.22s)
Sep 21 10:33:38 --- PASS: TestIncrementalBackupMysqlctld/make_writes,_succeed (1.21s)
Sep 21 10:33:38 --- PASS: TestIncrementalBackupMysqlctld/fail,_no_binary_logs_to_backup (1.17s)
Sep 21 10:33:38 --- PASS: TestIncrementalBackupMysqlctld/make_writes_again,_succeed (1.19s)
Sep 21 10:33:38 --- PASS: TestIncrementalBackupMysqlctld/auto_position,_succeed (1.21s)
Sep 21 10:33:38 --- PASS: TestIncrementalBackupMysqlctld/fail_auto_position,_no_binary_logs_to_backup (1.18s)
Sep 21 10:33:38 --- PASS: TestIncrementalBackupMysqlctld/auto_position,_make_writes_again,_succeed (1.20s)
Sep 21 10:33:38 --- PASS: TestIncrementalBackupMysqlctld/from_full_backup_position (1.21s)
Sep 21 10:33:38 --- PASS: TestIncrementalBackupMysqlctld/PITR (108.00s)
Sep 21 10:33:38 --- PASS: TestIncrementalBackupMysqlctld/PITR/MySQL56/252b3998-3998-11ed-b790-0a43f95f28a3:1-55,_3_records (18.51s)
Sep 21 10:33:38 --- PASS: TestIncrementalBackupMysqlctld/PITR/MySQL56/252b3998-3998-11ed-b790-0a43f95f28a3:1-57,_5_records (17.52s)
Sep 21 10:33:38 --- PASS: TestIncrementalBackupMysqlctld/PITR/MySQL56/252b3998-3998-11ed-b790-0a43f95f28a3:1-53,_1_records (17.44s)
Sep 21 10:33:38 --- PASS: TestIncrementalBackupMysqlctld/PITR/MySQL56/252b3998-3998-11ed-b790-0a43f95f28a3:1-56,_4_records (17.52s)
Sep 21 10:33:38 --- PASS: TestIncrementalBackupMysqlctld/PITR/MySQL56/252b3998-3998-11ed-b790-0a43f95f28a3:1-54,_2_records (18.50s)
Sep 21 10:33:38 --- PASS: TestIncrementalBackupMysqlctld/PITR/MySQL56/252b3998-3998-11ed-b790-0a43f95f28a3:1-58,_6_records (18.51s)
Sep 21 10:33:38 --- PASS: TestIncrementalBackupMysqlctld/remove_full_position_backups (0.01s)
Sep 21 10:33:38 --- PASS: TestIncrementalBackupMysqlctld/PITR-2 (110.87s)
Sep 21 10:33:38 --- PASS: TestIncrementalBackupMysqlctld/PITR-2/MySQL56/252b3998-3998-11ed-b790-0a43f95f28a3:1-58,_6_records (18.52s)
Sep 21 10:33:38 --- PASS: TestIncrementalBackupMysqlctld/PITR-2/MySQL56/252b3998-3998-11ed-b790-0a43f95f28a3:1-53,_1_records (18.43s)
Sep 21 10:33:38 --- PASS: TestIncrementalBackupMysqlctld/PITR-2/MySQL56/252b3998-3998-11ed-b790-0a43f95f28a3:1-54,_2_records (18.48s)
Sep 21 10:33:38 --- PASS: TestIncrementalBackupMysqlctld/PITR-2/MySQL56/252b3998-3998-11ed-b790-0a43f95f28a3:1-55,_3_records (18.46s)
Sep 21 10:33:38 --- PASS: TestIncrementalBackupMysqlctld/PITR-2/MySQL56/252b3998-3998-11ed-b790-0a43f95f28a3:1-56,_4_records (18.50s)
Sep 21 10:33:38 --- PASS: TestIncrementalBackupMysqlctld/PITR-2/MySQL56/252b3998-3998-11ed-b790-0a43f95f28a3:1-57,_5_records (18.48s)
Sep 21 10:33:38 PASS
Sep 21 10:33:38 ok vitess.io/vitess/go/test/endtoend/backup/pitr 308.102s
Sep 21 10:33:38 2022/09/21 10:33:38 local.backup_pitr: PASSED in 5m9.5s
The manifest of a full backup now also includes a PurgedPosition
field, which is the value of MySQL @@gtid_purged
where applicable. This value is essential to evaluating a recovery path from incremental backups. The reason is that backed up binary logs may have an incomplete view of Previous-GTIDs
. For example, the primary may have a GTID set of f3a47136-3993-11ed-9678-0a43f95f28a3:1-53
, while the first binary log on a replica can state f3a47136-3993-11ed-9678-0a43f95f28a3:3-53
(notice 3-53
). This happens when the replica is reset, starts from scratch with purged GTIDs; it's a valid situation but then causes confusion: are the binary logs actually covering, say, a backup for ...:1-53
or are they missing anything?
Knowing that the full backup was made with purged GTIDs ...:1-2
sets our mind to rest, and validate that the binary log is legitimately following up on that full backup.
Overview of changes in this PR
This PR is now ready to review. Here's what it provides:
- Ability to run an incremental backup: a backup of binary logs from either:
- A given position (GTID), or
- Position of last good backup ("auto" position)
- Ability to restore to a specific position (GTID)
- Additions to the backup manifest
- Un-explode function calls by passing
RestoreFromBackupRequest
proto -
endtoend
tests for incremental backup / PITR
Let's drill down, in constructive order. Please refer to the RFC for background on why we doing this.
Additions to the backup manifest
The following fields are added:
-
Incremental
: boolean, to identify whether this is a full or incremental backup -
PurgedPosition
: specific for MySQL, thegtid_purged
at time of backup. This is an essential information for a point in time restore -
FromPosition
: in an incremental backup, this is the position from which the backup applies. the increment is fromFromPosition
(exclusive) and up toPosition
(inclusive) -
ServerUUID
: useful information for debugging/analyzing, the UUID of the mysql server being backed up -
TabletAlias
: useful information for debugging/analyzing, the alias of the tablet issuing the backup -
Keyspace
: useful information for debugging/analyzing -
Shard
: useful information for debugging/analyzing
Running an incremental backup
The Backup
command now supports --incremental_from_pos
flag, which can receive a valid position or the value auto
. For example:
$ vtctlclient -- Backup --incremental_from_pos "MySQL56/16b1039f-22b6-11ed-b765-0a43f95f28a3:1-615" zone1-0000000102
$ vtctlclient -- Backup --incremental_from_pos "auto" zone1-0000000102
When the value is auto
, the position is evaluated as the last successful backup's Position
. The idea with incremental backups is to create a contiguous (overlaps allowed) sequence of backups that store all changes from last full backup.
The incremental backup copies binary log files. It does not take MySQL down nor places any locks. It does not interrupt traffic on the MySQL server. The incremental backup copies comlete binlog files. It initially rotates binary logs, then copies anything from the requested position and up to the last completed binary log.
The backup thus does not necessarily start exactly at the requested position. It starts with the first binary log that has newer entries than requested position. It is OK if the binary logs include transactions prior to the equested position. The restore process will discard any duplicates.
Normally, you can expect the backups to be precisely contiguous. Consider an auto
value: due to the nature of log rotation and the fact we copy complete binlog files, the next incremental backup will start with the first binay log not covered by the previous backup, which in itself copied the one previous binlog file in full. Again, it is completely valid to enter any good position.
The incremental backup fails if it is unable to attain binary logs from given position (ie binary logs have been purged).
The manifest of an incremental backup has a non-empty FromPosition
value, and a Incremental = true
value.
Running a point in time recovery
We call this "point in time" but it is in fact "point in position". The RestoreFromBackup
command now supports these new flags:
-
--restore_to_pos
: request to restore the server up to the given position (inclusive) and not one step further. -
--dry_run
: whentrue
, calculate the restore process, if possible, evaluate a path, but exit without actually making any changes to the server.
Examples:
$ vtctlclient -- RestoreFromBackup --restore_to_pos "MySQL56/16b1039f-22b6-11ed-b765-0a43f95f28a3:1-220" zone1-0000000102
The restore process seeks a restore path: a sequence of backups (handles/manifests) consisting of one full backup followed by zero or more incremental backups, that can bring the server up to the requested position, inclusive.
The command fails if it cannot evaluate a restore path. Possible reasons:
- there's gaps in the incremental backups
- existing backups don't reach as far as requested position
- all full backups exceed requested position (so there's no way to get into an ealier position)
The command outputs the restore path.
There may be multiple restore paths, the command prefers a path with the least number of backups. This has nothing to say about the amount and size of binary logs involved.
The RestoreFromBackup --restore_to_pos
ends with:
- the restored server in intentionally broken replication setup
- tablet type is
DRAINED
The reasoning is that we want to restore up to a specific point and not a single step further. We therefore do not let the restored server join the replication stream. We also do not want it to serve traffic, though we do want to be able to query it explicitly.
Testing
Unit tests validate logic such as constructing a restore path from a list of manifests, or finding sequences of binary log files.
We introduce a new endtoend
test shard: backup_pitr
. This test validates incremental backups as well as point in time restores.
The test works by running a cluster and taking backups off one of the replicas. We injest data into the database such that we know what data to expect in each backup.
We validate pre-calculated or auto
positions for incremental backups. We validate invalid positions and failures.
We then validate restores by randomly restoring into any one of our pre-recorded positions. We know what data to expect per restore and validate that. The reason for random restores is so that we validate we can restore into both older and newer positions, and ensure that all existing data is erased.
Cluster (vttablet_prscomplex)
fails repeatedly, but I see it has nothing to do with this PR.
Cluster (vttablet_prscomplex)
fails repeatedly, but I see it has nothing to do with this PR.
Yes, it was broken by #11592. @harshit-gangal is planning to revert that and fix it properly.
Not sure which all files require other reviews. We should probably add more code owners for vtadmin proto files - basically duplicate the owners of the proto files. cc @harshit-gangal
I still need to write release notes, as well as website docs.
Release notes added.
bwahahaha! It is merged!