vitess adding new mysql shell backup engine

Description

This is a PR that implements a new backup engine for use with MySQL Shell, which is mentioned in the feature request here: https://github.com/vitessio/vitess/issues/16294

It works a bit differently than the existing engines in vitess in which it only stores the metadata of how the backup was created (location + parameters used) and during the restore uses the location plus other parameters (mysql shell are different if you are doing a dump vs a restore, so we can't use exactly the same ones)

Related Issue(s)

Fixes #16294

Checklist

[ ] "Backport to:" labels have been added if this change should be back-ported to release branches
[ ] If this change is to be back-ported to previous releases, a justification is included in the PR description
[ ] Tests were added or are not required
[ ] Did the new or modified tests pass consistently locally and on CI?
[ ] Documentation was added or is not required

Deployment Notes

Jun 28 '24 16:06 rvrangel

Review Checklist

Hello reviewers! :wave: Please follow this checklist when reviewing this Pull Request.

General

[x] Ensure that the Pull Request has a descriptive title.
[x] Ensure there is a link to an issue (except for internal cleanup and flaky test fixes), new features should have an RFC that documents use cases and test cases.

Tests

[ ] Bug fixes should have at least one unit or end-to-end test, enhancement and new features should have a sufficient number of tests.

Documentation

[x] Apply the release notes (needs details) label if users need to know about this change.
[ ] New features should be documented.
[ ] There should be some code comments as to why things are implemented the way they are.
[ ] There should be a comment at the top of each new or modified test to explain what the test does.

New flags

[ ] Is this flag really necessary?
[ ] Flag names must be clear and intuitive, use dashes (-), and have a clear help text.

If a workflow is added or modified:

[ ] Each item in Jobs should be named in order to mark it as required.
[ ] If the workflow needs to be marked as required, the maintainer team must be notified.

Backward compatibility

[ ] Protobuf changes should be wire-compatible.
[ ] Changes to _vt tables and RPCs need to be backward compatible.
[ ] RPC changes should be compatible with vitess-operator
[ ] If a flag is removed, then it should also be removed from vitess-operator and arewefastyet, if used there.
[ ] vtctl command output order should be stable and awk-able.

Jun 28 '24 16:06 vitess-bot[bot]

Codecov Report

Attention: Patch coverage is 25.07740% with 242 lines in your changes missing coverage. Please review.

Project coverage is 69.43%. Comparing base (6d43afc) to head (bc2cff2). Report is 7 commits behind head on main.

Files with missing lines	Patch %	Lines
go/vt/mysqlctl/mysqlshellbackupengine.go	24.10%	211 Missing :warning:
go/vt/mysqlctl/query.go	0.00%	21 Missing :warning:
go/vt/mysqlctl/fakemysqldaemon.go	33.33%	4 Missing :warning:
go/vt/mysqlctl/backup.go	83.33%	2 Missing :warning:
go/vt/mysqlctl/builtinbackupengine.go	0.00%	2 Missing :warning:
go/vt/mysqlctl/xtrabackupengine.go	0.00%	2 Missing :warning:

Additional details and impacted files

@@            Coverage Diff             @@
##             main   #16295      +/-   ##
==========================================
- Coverage   69.51%   69.43%   -0.08%     
==========================================
  Files        1569     1570       +1     
  Lines      202529   202945     +416     
==========================================
+ Hits       140784   140913     +129     
- Misses      61745    62032     +287

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

Jun 28 '24 16:06 codecov[bot]

These are good questions, thanks Shlomi!

I wasn't sure, but checking the officila mysql docker images, it seems to be included actually:
```
$ docker run -it mysql:8.0 mysqlsh --version
Cannot set LC_ALL to locale en_US.UTF-8: No such file or directory
mysqlsh   Ver 8.0.38 for Linux on x86_64 - for MySQL 8.0.38 (MySQL Community Server (GPL))
```
In relation to the version dependency, my understanding is that MySQL Shell needs to be at least the same version of the MySQL Server, but it can be newer. we have successfully been using MySQL Shell 8.4 with Percona Server 8.0 releases.

We don't use vitess-operator so for us it would mean we need to make sure required binaries are installed anyway (like mysqld, xtrabackup). But I imagine it being included in the official docker images means it will be less of an issue?
Thats a good point I didn't realise. While it is possible to read the @.json file from a directory when it is completed (if we are writing to disk), it is less straight forward when we are storing the backups on an object store. Because mysqlsh doesn't work the same way (it doesn't provide you with a single stream that can be cleanly uplodad to whatever storage engine we are using), the thought was to bypass the storage engine in vitess (except for the MANIFEST which we still write using it) and just use this metadata to help the engine locate and restore the backup instead. If this was only saving to disk, it is much easier but also very limiting.

If we were to do this, we would need to write code that would need to fetch the @.json from the supported object stores where there is a support overlap between mysqlsh and vitess (S3, Azure, GCP, etc) and some might be missing. Perhaps a better idea would be to include this backup engine without support for PITR in the beginning and file an upstream feature request that would print or save a copy of the executed GTID once the backup is done, which we could capture in an easier way (similar to the xtrabackup engine)?

For additional context, as proposed on https://github.com/vitessio/vitess/pull/16428 and described in the use case, we plan to use this mostly to keep two backup types around for each shard, but always restoring by default from xtrabackup unless we require the logical backups for a specific reason.

We also considered mysqldump which to be honest would fit the vitess backup engine workflow a lot better, but it was just too slow. This benchmark from Percona also highlights the same thing, and for use backing up/restoring was so slow it didn't make sense.

Jul 19 '24 10:07 rvrangel

We also considered mysqldump which to be honest would fit the vitess backup engine workflow a lot better, but it was just too slow.

Have you looked at mysqlpump? (Next gen mysqldump, included in standard builds).

Jul 22 '24 12:07 shlomi-noach

I wasn't sure, but checking the officila mysql docker images, it seems to be included actually:

Oh, that's nice! The reason I thought it wasn't included is that mysqlsh/mysqlshell is not included in the standard MySQL build.

Thats a good point I didn't realise. While it is possible to read the @.json file from a directory when it is completed (if we are writing to disk), it is less straight forward when we are storing the backups on an object store.

I feel like it's OK to have some solution "with limitations". We should strive to support as much functionality as possible though. So IMHO we should strive to include the GTID when the backup goes into a directory. This should be possible to do, which then means the backup should fail if for some reason we can't fetch the GTID or validate it (correct GTID form). i.e. return BackupUnusable if unable to fetch and validate the GTID entry.

I'd like @deepthi to weigh in her opinion.

Assuming we do decide to move forward, then I'd next expect a CI/endtoend test please, as follows:

Take a look at https://github.com/vitessio/vitess/blob/c51ee4b7e3da381843ca6ae37fa404715bf8d459/go/test/endtoend/backup/pitr_xtrabackup/backup_pitr_xtrabackup_test.go and https://github.com/vitessio/vitess/blob/c51ee4b7e3da381843ca6ae37fa404715bf8d459/go/test/endtoend/backup/pitr/backup_pitr_test.go. You will create a similarly third path&file that uses a (new) SetupType: backup.MySQLShellBackup.
Similarly to https://github.com/vitessio/vitess/blob/c51ee4b7e3da381843ca6ae37fa404715bf8d459/.github/workflows/cluster_endtoend_backup_pitr.yml, there needs to be a test for backup_pitr_mysqlshell
Also configured in test/config.json.

When these are all added, a new CI job will run to test mysqlshell-based backup, restores, and point-in-time recoveries. These can (and should) use the directory-based backup configuration, one which does make the GTID available.

If this test passes, then you will have validated the full cycle of backup and resotre, as well as correctness of the captured GTID.

Edit: since mysqlshell does not come bundled in the mysql distribution, we'd need to further download/install mysqlshell in the GitHub workflow file.

S3, Azure, GCP can be left without GTID support for now.

We'd need a documentation PR that clearly indicates the limitations of this method.

Jul 22 '24 13:07 shlomi-noach

@shlomi-noach yeah, we looked into mysqlpump, but it has been deprecated (the page you linked also has a notice) and it is still slower than MySQL Shell. but since it is likely going to be removed in a future MySQL version, we though it would be better not to introduce a new feature using it :)

I think the proposal to have the GTID read when backing up to a directory while not when using an object store is fair, let me look into it and make the necessary changes. If all looks good I will proceed with working on the CI/endtoend tests, I just wanted to get some initial feedback before doing so. Also curious what @deepthi thinks about this approach.

Edit: since mysqlshell does not come bundled in the mysql distribution, we'd need to further download/install mysqlshell in the GitHub workflow file.

Is that something that needs to happen as part of this PR or something separate?

Jul 22 '24 15:07 rvrangel

Since we have not fully finished the deletion of mysqld in the vitess/lite Docker Images, the mysqlsh binary will have to be included in the vitess/lite image regardless if it's included in the official MySQL Docker Images or not. Since we are letting people choose between using an official MySQL image or the vitess/lite image for their Docker/K8S deployment we must have the binary in both.

@frouioui We don't use the docker images, but I am not sure I understand, if mysqld is being removed does that means users are expect to use a different image for running mysqld?

regarding the change in the vitess-operator, I opened a separate PR for it: https://github.com/planetscale/vitess-operator/pull/586, is that all that needs to be changed?

Jul 22 '24 15:07 rvrangel

I just wanted to get some initial feedback before doing so. Also curious what @deepthi thinks about this approach.

@rvrangel absolutely! Let's wait for more input.

Edit: since mysqlshell does not come bundled in the mysql distribution, we'd need to further download/install mysqlshell in the GitHub workflow file.

Is that something that needs to happen as part of this PR or something separate?

In the scope of this PR please. I was merely pointing out that our standard workflow won't install mysql-shell. Oh, and there's a greater discussion about how our workflows are generated, see test/ci_workflow_gen.go. I can help you with that part if you need.

Jul 22 '24 15:07 shlomi-noach

We don't use the docker images, but I am not sure I understand, if mysqld is being removed does that means users are expect to use a different image for running mysqld?

@rvrangel, historically we have always shipped mysqld and all required binaries in our Docker Images (vitess/lite). My point is that we need to make sure mysqlsh is available in the vitess/lite Docker Image so that people can use all the Vitess features on Docker/K8S. Currently, mysqlsh is not in our vitess/lite Docker Image:

$> docker run -it --user=vitess vitess/lite:latest bash
vitess@e92ffa0b2f14:/$ mysqlsh
bash: mysqlsh: command not found

EDIT: I said I was going to push to this PR, but I ended up not doing this in case you have ongoing work on the branch or else.

We need to add mysql-shell to the list of packages we want to install in the install_dependencies.sh script. We also need to create the /home/vitess directory in the lite image and give it the right permissions. Below is the git diff for the fix.

diff --git a/docker/lite/Dockerfile b/docker/lite/Dockerfile
index d5c46cac13..5fc83123b1 100644
--- a/docker/lite/Dockerfile
+++ b/docker/lite/Dockerfile
@@ -42,7 +42,7 @@ RUN /vt/dist/install_dependencies.sh mysql80
 
 # Set up Vitess user and directory tree.
 RUN groupadd -r vitess && useradd -r -g vitess vitess
-RUN mkdir -p /vt/vtdataroot && chown -R vitess:vitess /vt
+RUN mkdir -p /vt/vtdataroot /home/vitess && chown -R vitess:vitess /vt /home/vitess
 
 # Set up Vitess environment (just enough to run pre-built Go binaries)
 ENV VTROOT /vt/src/vitess.io/vitess
diff --git a/docker/utils/install_dependencies.sh b/docker/utils/install_dependencies.sh
index b686c2418b..91e6e2b8c7 100755
--- a/docker/utils/install_dependencies.sh
+++ b/docker/utils/install_dependencies.sh
@@ -86,6 +86,7 @@ mysql57)
         /tmp/mysql-client_${VERSION}-1debian10_amd64.deb
         /tmp/mysql-community-server_${VERSION}-1debian10_amd64.deb
         /tmp/mysql-server_${VERSION}-1debian10_amd64.deb
+        mysql-shell
         percona-xtrabackup-24
     )
     ;;
@@ -112,6 +113,7 @@ mysql80)
         /tmp/mysql-community-server-core_${VERSION}-1debian11_amd64.deb
         /tmp/mysql-community-server_${VERSION}-1debian11_amd64.deb
         /tmp/mysql-server_${VERSION}-1debian11_amd64.deb
+        mysql-shell
         percona-xtrabackup-80
     )
     ;;

You can ensure that the fix works by doing the following:

$> docker build -t vitess/lite-test:latest -f docker/lite/Dockerfile .
$> docker run -it --user="vitess" docker.io/vitess/lite-test:latest bash
vitess@e31f32ddcc91:/$ mysqlsh 
Cannot set LC_ALL to locale en_US.UTF-8: No such file or directory
MySQL Shell 8.0.37

Copyright (c) 2016, 2024, Oracle and/or its affiliates.
Oracle is a registered trademark of Oracle Corporation and/or its affiliates.
Other names may be trademarks of their respective owners.

Type '\help' or '\?' for help; '\quit' to exit.
 MySQL  JS > 
Bye!

Doing this, the version of mysqlsh is 8.0.37 which does not necessarily match the version of MySQL we have in the image (8.0.30), but it respect what you said earlier:

my understanding is that MySQL Shell needs to be at least the same version of the MySQL Server, but it can be newer

Note that if you are on different architecture than amd64 the Docker build will fail. You should merge main into this branch to get the commit that fixes it if you need it.

For @shlomi-noach, in the output I pasted just above, where I run mysqlsh inside our Docker Image. We immediately get the following warning Cannot set LC_ALL to locale en_US.UTF-8: No such file or directory, would that potentially be an issue for our use-case?

regarding the change in the vitess-operator, I opened a separate PR for it: planetscale/vitess-operator#586, is that all that needs to be changed?

Thank you for opening this! 🙇🏻 I will take a look and comment on that PR directly.

Jul 22 '24 17:07 frouioui

@shlomi-noach and @frouioui have pretty much covered all of the main concerns. My opinion is that in principle this is a good feature addition. The only concern I have left is whether this feature works only with the file backup storage type. I'm basing this off shlomi's comment

S3, Azure, GCP can be left without GTID support for now.

Can you explain what will and will not work with S3, GCP and Azure?

Jul 22 '24 17:07 deepthi

S3, Azure, GCP can be left without GTID support for now.

Can you explain what will and will not work with S3, GCP and Azure?

@deepthi I will meanwhile explain what my understanding is: that in S3, Azure, and GCP, we won't have GTID information in the backup manifest, which means we won't be able to use such backups for point in time recoveries.

Jul 23 '24 04:07 shlomi-noach

@shlomi-noach sorry I haven't made too much progress on this today, but wanted to bring another idea for something we discussed today in relation to the GTIDs.

what do you think if instead of capturing the executed GTID sets from the @.json file, we were to stop replication on the tablet, capture the executed GTID sets directly from MySQL and then starting the backup process? once MySQL Shell has acquired the global read lock, it should have the same GTID set we had and then we can start the replication threads again. Any thoughts on this approach?

Jul 24 '24 15:07 rvrangel

sorry I haven't made too much progress on this today

We're here for the long run. Take your time.

what do you think if instead of capturing the executed GTID sets from the @.json file, we were to stop replication on the tablet

That's a good idea, with questions. Does that mean backups need to necessarily run on replicas? Or will mysqlshell also be able to run on primaries? Meaning, is this a new limitation we will now impose on mysqlshell backups, or is that an already existing limitation? I'm cheating a bit in this question, because we normally run backups from a BACKUP tablet type, which is always a replica.

When taking a backup from a replica, if you need to stop replication, which is fine, you first need to take the replica out of serving mode, to return it back. Usually the tablet type is BACKUP anyway, which means it's not serving, but it could. Let's do the comparison with builtin backup engine, that goes farther to actually shut down the mysql server. See what happens there in terms of draining the tablet / taking it out of and back into service?

Jul 25 '24 05:07 shlomi-noach

fair enough, we don't use vtbackup so we don't actually provision an entirely new tablet which will be of the BACKUP type. saying that, there are some differences in between the engines. the builtin backup engine will automatically transition the host to BACKUP because ShouldDrainForBackup() returns true, which the xtrabackup engine doesn't.

So I guess our options are:

collect the GTIDs from the output directory

can be used on any tablet type
can be taken online, similar to the xtrabackup engine
won't support PITR for object stores, only when backing up to a directory

stop replication and collect GTIDs from MySQL

backup can't be taken from a primary
replication needs to be stopped (either briefly at the beginning or for the entire backup) to ensure the GTID sets are consistent
will support PITR regardless of where it is stored.

second approach sounds like a good compromise, but first I wanted to ask a clarifying question. while the GTID sets are not in the MANIFEST, we actually pass them back to Vitess when it is doing a restore (that's because we have time to fetch them from MySQL between MySQL Shell has finished the restore and we hand it over to Vitess). That means the ExecuteRestore() will return the MANIFEST with the correct GTIDs then. Does the Vitess PITR process expect to be able to read the GTIDs from the MANIFEST before it attempts to restore any backups, or does it happen later in the process?

Jul 25 '24 11:07 rvrangel

Does the Vitess PITR process expect to be able to read the GTIDs from the MANIFEST before it attempts to restore any backups

Yes, it does, as part of computing the potential recovery path (one full restore + zero or more incremental restores).

Jul 25 '24 11:07 shlomi-noach

I hit an issue that we can't start/stop replication when there is a backup lock in place in MySQL (created by MySQL Shell), so we wouldn't be able to use that approach.

a different solution I have been testing is to acquire a global read lock (FLUSH TABLES WITH READ LOCK) and the releasing it as soon mysqlsh releases its own, this way we only lock writes for a couple of seconds and we get the consistent GTID sets - it also works on all tablet types including primaries. I am making adjustments to make sure the endtoend tests cover everything.

Jul 31 '24 16:07 rvrangel

@shlomi-noach I have added the endtoend tests and fixed the issues with the other tests, I would appreciate if you could take a look when you have the time. I changed the approach as described in my last comment, which will support different tablet types (including PRIMARY) and will cause the least amount of disruption to the tablet, while being able to consistently save the GTID sets and work with PITR

Aug 14 '24 10:08 rvrangel

@rvrangel awesome, thank you! I'll do another full review by early next week (already done for the week).

Aug 15 '24 14:08 shlomi-noach

I hit an issue that we can't start/stop replication when there is a backup lock in place in MySQL (created by MySQL Shell), so we wouldn't be able to use that approach.

The other thing you would have ended up hitting is vtorc restarting replication that was stopped by this process. So I'm glad you found an alternate solution.

Aug 15 '24 16:08 deepthi

[ ] When we get to adding release notes, we should mark this new feature as "experimental"

Aug 20 '24 05:08 shlomi-noach

Could you please rename pitr_mysqlshell to backup_pitr_mysqlshell so that it aligns with backup_pitr_extrabackup?

Aug 21 '24 09:08 shlomi-noach

yeah, I actually fixed it to be backup_pitr_mysqlshell but I didn't re-generate them, it should be correct now

Aug 21 '24 12:08 rvrangel

I see the new backup_pitr_mysqlshell passes CI -- this is wonderful news! I'll look into the logs to get more details.

Aug 21 '24 14:08 shlomi-noach

it seems the static checks are failing now, complaining about different version of Go in the CI templates, but I don't remember making any change that would cause that, would know what the problem could be? running tools/check_go_versions.sh doesn't produce any errors

Aug 21 '24 16:08 rvrangel

it seems the static checks are failing now, complaining about different version of Go in the CI templates, but I don't remember making any change that would cause that

Quickest advice: merge the latest from main? Hopefully that will make the problem go away. Otherwise we'll have to look into it 😆

Aug 21 '24 16:08 shlomi-noach

merge the latest from main

And make sure to then make generate_ci_workflows again.

Aug 21 '24 16:08 shlomi-noach

yeah, that fixed it, all tests are passing now 🎉

Aug 21 '24 18:08 rvrangel

Gentle reminder @rvrangel in case this got lost among all the other comments: https://github.com/vitessio/vitess/pull/16295#pullrequestreview-2194441789 is still pending.

Let me know if I can help in any way with the vitess-operator stuff. I am happy to help 💪🏻

Aug 21 '24 21:08 frouioui

@rvrangel In addition to what @frouioui pointed out, there are two other things that are needed before we can merge

an update to the release notes summary file with a brief note on this functionality
a website docs PR that updates all relevant backup/restore pages in the v21 (Development) docs.

Aug 27 '24 17:08 deepthi

Hey all! So we're all realizing this PR has been ongoing for a while. None of us was expecting it to take this long, and we keep finding new stuff we were unaware of at the beginning.

So first of all, please know we all appreciate your efforts and patience. We're all learning on the go I've personally never used mysqlshell before, let alone mysqlshell backups).

The other thing is I'm thinking there are multiple ongoing discussions on this issue. I wanted to propose we all take a brief pause, and figure out what the remaining issues are. Also, are y'all interested in synching to flesh out the remaining issues? Or would you like to prefer the discussion here?

Sep 03 '24 17:09 shlomi-noach

vitess vitess copied to clipboard

adding new mysql shell backup engine

Description

Related Issue(s)

Checklist

Deployment Notes

Review Checklist

General

Tests

Documentation

New flags

If a workflow is added or modified:

Backward compatibility

Codecov Report

collect the GTIDs from the output directory

stop replication and collect GTIDs from MySQL

vitess
vitess copied to clipboard