vitess
vitess copied to clipboard
Bug Report: PITR from timestamp or position not working
Overview of the Issue
Doing incremental(builtin method)/full(xtrabackup) method. Failing PITR using timestamp or position. K8S deployment with operator.
Reproduction Steps
- Deploy the following vschema:
{
"tables": {
"product": {},
"customer": {},
"corder": {}
}
}
- Deploy Schema:
create table if not exists product(
sku varbinary(128),
description varbinary(128),
price bigint,
primary key(sku)
) ENGINE=InnoDB;
create table if not exists customer(
customer_id bigint not null auto_increment,
email varbinary(128),
primary key(customer_id)
) ENGINE=InnoDB;
create table if not exists corder(
order_id bigint not null auto_increment,
customer_id bigint,
sku varbinary(128),
price bigint,
primary key(order_id)
) ENGINE=InnoDB;
-
Run full backup manually for the first time and then run cron with every 12h full backup and incrementals every hour. Add cron to insert random data into tables every 30min which will ensure incrementals wouldn't fail (they fail if there is no change in db from the last backup)
-
Try to PITR from timestamp
vtctldclient RestoreFromBackup --restore-to-timestamp "timestamp" tablet-alias
-
View the error
Binary Version
vitess version v18.0.1
latest tag for vitess operator
Operating System and Environment details
K8S environment
Log Fragments
Here is the log output when running PITR from timestamp:
https://gist.github.com/GenLN/7b9ba323aee7778390a9269379b16c2d
I am unable to reproduce, but I think this issue description leaves a lot of implementation freedom.
To avoid waiting days for reproduction, I used this script:
#!/bin/bash
vtctldclient BackupShard commerce/0
for f in {1..3}; do
echo "new full backup cycle"
for i in {1..12}; do
echo "new incremental backup cycle"
sleep 30
mysql -e "insert into corder values(rand()*1000000,rand()*1000000,rand()*1000000,rand()*1000000);"
mysql -e "insert into corder values(rand()*1000000,rand()*1000000,rand()*1000000,rand()*1000000);"
sleep 30
vtctldclient BackupShard commerce/0 --incremental-from-pos=auto
done
vtctldclient BackupShard commerce/0
done
Which:
- Runs a full backup (using
BackupShard
) - Runs 12 incremental backups in ~ 1 min interval
- Runs a full backup
- loop: run 12 incremental backups again..
Essentially turns 12h
->12m
, 1h
->1m
, 30min
->1min
The runtime totals at about 36
minutes (more due to overhead). On my test host, the runtime ranged Dec 17 12:31:34
to Dec 17 13:07:51
, or in RFC3339
format these are 2023-12-17T12:31:34Z
to 2023-12-17T13:07:51Z
.
My tablets are:
$ mysql -e "show vitess_tablets"
+-------+----------+-------+------------+---------+------------------+------------+----------------------+
| Cell | Keyspace | Shard | TabletType | State | Alias | Hostname | PrimaryTermStartTime |
+-------+----------+-------+------------+---------+------------------+------------+----------------------+
| zone1 | commerce | 0 | PRIMARY | SERVING | zone1-0000000100 | <redacted> | 2023-12-17T12:24:32Z |
| zone1 | commerce | 0 | REPLICA | SERVING | zone1-0000000101 | <redacted> | |
| zone1 | commerce | 0 | RDONLY | SERVING | zone1-0000000102 | <redacted> | |
+-------+----------+-------+------------+---------+------------------+------------+----------------------+
I ran the following restores:
vtctldclient RestoreFromBackup --restore-to-timestamp "2023-12-17T12:31:35Z" zone1-0000000101
vtctldclient RestoreFromBackup --restore-to-timestamp "2023-12-17T12:32:00Z" zone1-0000000101
vtctldclient RestoreFromBackup --restore-to-timestamp "2023-12-17T12:35:00Z" zone1-0000000101
vtctldclient RestoreFromBackup --restore-to-timestamp "2023-12-17T12:42:00Z" zone1-0000000101
vtctldclient RestoreFromBackup --restore-to-timestamp "2023-12-17T12:43:00Z" zone1-0000000101
vtctldclient RestoreFromBackup --restore-to-timestamp "2023-12-17T12:44:00Z" zone1-0000000101
vtctldclient RestoreFromBackup --restore-to-timestamp "2023-12-17T12:45:00Z" zone1-0000000101
vtctldclient RestoreFromBackup --restore-to-timestamp "2023-12-17T12:46:00Z" zone1-0000000101
vtctldclient RestoreFromBackup --restore-to-timestamp "2023-12-17T12:47:00Z" zone1-0000000101
vtctldclient RestoreFromBackup --restore-to-timestamp "2023-12-17T12:48:00Z" zone1-0000000101
vtctldclient RestoreFromBackup --restore-to-timestamp "2023-12-17T12:49:00Z" zone1-0000000101
vtctldclient RestoreFromBackup --restore-to-timestamp "2023-12-17T12:50:00Z" zone1-0000000101
vtctldclient RestoreFromBackup --restore-to-timestamp "2023-12-17T12:51:00Z" zone1-0000000101
vtctldclient RestoreFromBackup --restore-to-timestamp "2023-12-17T12:52:00Z" zone1-0000000101
vtctldclient RestoreFromBackup --restore-to-timestamp "2023-12-17T12:53:00Z" zone1-0000000101
vtctldclient RestoreFromBackup --restore-to-timestamp "2023-12-17T12:54:00Z" zone1-0000000101
vtctldclient RestoreFromBackup --restore-to-timestamp "2023-12-17T12:55:00Z" zone1-0000000101
vtctldclient RestoreFromBackup --restore-to-timestamp "2023-12-17T13:04:00Z" zone1-0000000101
vtctldclient RestoreFromBackup --restore-to-timestamp "2023-12-17T13:07:00Z" zone1-0000000101
and all went well and without error.
Let's try to focus on the mysqlbinlog
error you got:
mysqlbinlog command failed: exit status 1\nfailed to apply binlog file /tmp/restore-incremental-3212119968/vt/vtdataroot/vt_2469782763/bin-logs/vt-2548885007-bin.000091
Can you find anything in the mysqld
logs? Is it possible mysqlbinlog
is not in your path? (Can you check that mysqlbinlog
is on the same path as mysqld
?)
Checking to see whether any of the above makes sense?
Hi, will retry everything from the beginning and update with the results.
Hi @shlomi-noach I did again test and still have an issues with restoring pitr backup.
I will describe my steps and also paste some gists of components I define and logs from restoring backups
All vtctldclient commands are run from my localhost machine, before that I do port-forwards using script ./pf.sh
$ ./pf.sh &
alias vtctldclient="vtctldclient --server=localhost:15999"
alias vtctlclient="vtctlclient --server=localhost:15999"
alias mysql="mysql -h 127.0.0.1 -P 15306 -u user"
- Deploy k8s Operator
- Deploy Vitess cluster using values.yaml Here you will notice I use different dir for error/slow logs as I have sidecar container of the fluentbit which parse and send those logs to Loki instance (mounting empty dir for these logs)...
- Deploy schema,vschema for keyspaces
commerce/-
andcustomer
sharded equal (x-80,80-x) -- schemas and vschemas used from Vitess docs for Vitess on k8s - Deploy cronjob dummy insert data (simulating writes on vitess)
- Deploy pitr-backup-cronjob
- Try to
RestoreFromBackup
have this output logs from attempts: 1, 2, 3, 4, 5
$ vtctldclient RestoreFromBackup --restore-to-timestamp "2024-02-15T06:30:00Z" zone1-2469782763
$ vtctldclient RestoreFromBackup --restore-to-timestamp "2024-02-15T06:30:15Z" zone1-0790125915
$ vtctldclient RestoreFromBackup --restore-to-timestamp "2024-02-15T06:00:15Z" zone1-0790125915
$ vtctldclient RestoreFromBackup --restore-to-timestamp "2024-02-15T09:00:14Z" zone1-2469782763
$ vtctldclient RestoreFromBackup --restore-to-timestamp "2024-02-15T09:00:16Z" zone1-2859626137
- Restore Failed
- Here available mysqld
error.log
fromcommerce/-
vttablet: error.log
Here is some info:
$ mysql -e "show vitess_tablets"
+-------+----------+-------+------------+---------+------------------+--------------+----------------------+
| Cell | Keyspace | Shard | TabletType | State | Alias | Hostname | PrimaryTermStartTime |
+-------+----------+-------+------------+---------+------------------+--------------+----------------------+
| zone1 | commerce | - | PRIMARY | SERVING | zone1-2548885007 | REDACTED | 2024-02-14T11:06:07Z |
| zone1 | commerce | - | REPLICA | SERVING | zone1-2469782763 | REDACTED | |
| zone1 | commerce | - | REPLICA | SERVING | zone1-0790125915 | REDACTED | |
| zone1 | customer | -80 | PRIMARY | SERVING | zone1-0120139806 | REDACTED | 2024-02-14T14:30:45Z |
| zone1 | customer | -80 | REPLICA | SERVING | zone1-2859626137 | REDACTED | |
| zone1 | customer | -80 | REPLICA | SERVING | zone1-2289928654 | REDACTED | |
| zone1 | customer | 80- | PRIMARY | SERVING | zone1-4277914223 | REDACTED | 2024-02-14T14:30:04Z |
| zone1 | customer | 80- | REPLICA | SERVING | zone1-0118374573 | REDACTED | |
| zone1 | customer | 80- | REPLICA | SERVING | zone1-2298643297 | REDACTED | |
+-------+----------+-------+------------+---------+------------------+--------------+----------------------+
Here is list of backups for commerce/-
taken every 30 min. (script for backup is using counter of backups, if this backup counter doesn't exist or equal 12 it will run FULL Backup
, else it will run Incremental backup
)
$ vtctldclient GetBackups commerce/-
Handling connection for 15999
2024-02-14.144515.zone1-0790125915
2024-02-14.145018.zone1-2469782763
2024-02-14.145518.zone1-0790125915
2024-02-14.150014.zone1-0790125915
2024-02-14.153015.zone1-2469782763
2024-02-14.160015.zone1-2469782763
2024-02-14.163014.zone1-2469782763
2024-02-14.170015.zone1-2469782763
2024-02-14.173014.zone1-0790125915
2024-02-14.180015.zone1-0790125915
2024-02-14.183016.zone1-0790125915
2024-02-14.190015.zone1-2469782763
2024-02-14.193016.zone1-2469782763
2024-02-14.200014.zone1-2469782763
2024-02-14.203016.zone1-2469782763
2024-02-14.210014.zone1-2469782763
2024-02-14.213015.zone1-0790125915
2024-02-14.220014.zone1-0790125915
2024-02-14.223014.zone1-0790125915
2024-02-14.230015.zone1-2469782763
2024-02-14.233016.zone1-0790125915
2024-02-15.000014.zone1-0790125915
2024-02-15.003014.zone1-0790125915
2024-02-15.010015.zone1-0790125915
2024-02-15.013014.zone1-0790125915
2024-02-15.020015.zone1-2469782763
2024-02-15.023014.zone1-0790125915
2024-02-15.030015.zone1-2469782763
2024-02-15.033014.zone1-0790125915
2024-02-15.040015.zone1-0790125915
2024-02-15.043014.zone1-0790125915
2024-02-15.050017.zone1-2469782763
2024-02-15.053014.zone1-0790125915
2024-02-15.060015.zone1-0790125915
2024-02-15.063015.zone1-0790125915
2024-02-15.070014.zone1-0790125915
2024-02-15.073014.zone1-0790125915
2024-02-15.080015.zone1-2469782763
2024-02-15.083014.zone1-2469782763
2024-02-15.090014.zone1-2469782763
Here listed bin-logs dir from vttablet tried to restore from timestamp (binlog retention isn't defined, so I'm using default from the Vitess):
$ ls vt/vtdataroot/vt_0790125915/bin-logs/
vt-0790125915-bin.000006 vt-0790125915-bin.000007 vt-0790125915-bin.000008 vt-0790125915-bin.index
Here is list of /tmp which was referenced in error messsage at the end of restore attempt logs:
$ ls /tmp
vt-gh-ost
vttablet.ERROR
vttablet.INFO
vttablet.WARNING
vttablet.dev-vitess-vttablet-zone1-0790125915-0d0930ad.vitess.log.ERROR.20240214-112904.1
vttablet.dev-vitess-vttablet-zone1-0790125915-0d0930ad.vitess.log.INFO.20240214-112904.1
vttablet.dev-vitess-vttablet-zone1-0790125915-0d0930ad.vitess.log.WARNING.20240214-112904.1
$ cat /tmp/vttablet.dev-vitess-vttablet-zone1-0790125915-0d0930ad.vitess.log.ERROR.20240214-112904.1
Log file created at: 2024/02/14 11:29:04
Running on machine: dev-vitess-vttablet-zone1-0790125915-0d0930ad
Binary: Built with gc go1.21.4 for linux/amd64
Previous log: <none>
Log line format: [IWEF]mmdd hh:mm:ss.uuuuuu threadid file:line] msg
E0214 11:29:04.512681 1 syslogger.go:149] can't connect to syslog
INCREMENTAL backups run using this function from the script (I do get label values from vtorc pods as there is only one per Keyspace per shard, so I use them to get label values for keyspace/shard to run BackupShard)
# Function to perform incremental backup
perform_incremental_backup() {
for vtorc_pod in $vtorc_pods; do
echo "Processing vtorc pod: $vtorc_pod"
# Fetch the keyspace and shard labels from the vtorc pod
keyspace=$(kubectl get pod "$vtorc_pod" -o jsonpath='{.metadata.labels.planetscale\.com/keyspace}')
shard=$(kubectl get pod "$vtorc_pod" -o jsonpath='{.metadata.labels.planetscale\.com/shard}')
# Remove 'x' letters from the shard label
shard="${shard//x/}"
echo "=========================== Executing INCREMENTAL BACKUP command on $keyspace / $shard ======================================"
# Run the BackupShard command with keyspace/shard as a single argument
vtctldclient --server="${service_name}:${port}" --alsologtostderr BackupShard --incremental-from-pos=auto "${keyspace}/${shard}"
done
}
Hi @GenLN ! Thanks for following up. In trying to narrow things down, were you able to follow my flow on https://github.com/vitessio/vitess/issues/14765#issuecomment-1859178702, and run that exact script on your servers?
Did you happen to check the mysqlbinlog
binary as per https://github.com/vitessio/vitess/issues/14765#issuecomment-1859178702?
Realistically, I'm not going to be able to reproduce your specific setup on a k8s cluster, so I'm trying to eliminate what I think are irrelevant aspects of your setup, and narrow down to some basics.
Let's again focus on the specific error. From your logs above:
mysqlbinlog command failed: exit status 1: Can't restore backup: failed to apply binlog file /tmp/restore-incremental-2625521499/vt/vtdataroot/vt_0790125915/bin-logs/vt-2469782763-bin.000004
Thank you!
Hi @shlomi-noach I have tried also your flow for testing pitr. Same error happened..
E0216 10:57:46.259043 77402 main.go:56] rpc error: code = Unknown desc = TabletManager.RestoreFromBackup on zone1-0790125915 error: Can't restore backup: failed to apply binlog file /tmp/restore-incremental-2976517831/vt/vtdataroot/vt_0790125915/bin-logs/vt-0790125915-bin.000001: rpc error: code = Unknown desc = mysqlbinlog command failed: exit status 1: Can't restore backup: failed to apply binlog file /tmp/restore-incremental-2976517831/vt/vtdataroot/vt_0790125915/bin-logs/vt-0790125915-bin.000001: rpc error: code = Unknown desc = mysqlbinlog command failed: exit status 1
Where should I check mysqlbinlog path? my localhost machine which does execute vtctldclient cmds have mysqlbinlog.
As I said, im using this script to port-forward vtctld, vtgate, vtadmin and use vtctldclient tool present on my machine along with other vt
local@machine$ ls /usr/local/vitess/bin
mysqlctl topo2topo vtadmin vtbench vtcombo vtctlclient vtctldclient vtgate vttablet zk zkctld
mysqlctld vtaclcheck vtbackup vtclient vtctl vtctld vtexplain vtorc vttestserver zkctl
local@machine$ vtctldclient --version
vtctldclient version Version: 18.0.0 (Git revision 9a6f5262f7707ff80ce85c111d2ff686d85d29cc branch 'HEAD') built on Mon Nov 6 12:16:43 UTC 2023 by runner@fv-az422-64 using go1.21.3 linux/amd64
local@machine$ whereis mysqlbinlog
mysqlbinlog: /usr/bin/mysqlbinlog
local@machine$ mysqlbinlog --version
mysqlbinlog Ver 8.0.36 for Linux on x86_64 (MySQL Community Server - GPL)
local@machine$ echo $PATH
/usr/local/vitess/bin:/usr/local/bin:/usr/local/sbin:/usr/bin:/usr/sbin
These are images of the cluster
images:
vtctld: vitess/vtctld:v18.0.1
vtadmin: vitess/vtadmin:v18.0.1
vtgate: vitess/lite:v18.0.1
vttablet: vitess/lite:v18.0.1
vtbackup: vitess/lite:v18.0.1
vtorc: vitess/lite:v18.0.1
mysqld:
mysql80Compatible: vitess/lite:v18.0.1
mysqldExporter: prom/mysqld-exporter:v0.14.0
vtctld doesn't have mysqlbinlog vtadmin doesn't have mysqlbinlog all vitess/lite images does have mysqlbinlog....
Where should I check mysqlbinlog path?
On the same container that runs your mysqld
instances or vtmysqlctl
instances.
This is k8s environment, mysqld is a container inside vttablet pod..
Same docker image used for mysqld and vttablet container vitess/lite:v18.0.1
VTTablet pod containers:
- vttablet
- mysqld
- mysqld_exporter
mysqld container
vitess@dev-vitess-vttablet-zone1-0790125915-0d0930ad:/$ whereis mysqlbinlog
mysqlbinlog: /usr/bin/mysqlbinlog
vitess@dev-vitess-vttablet-zone1-0790125915-0d0930ad:/$ echo $PATH
/vt/src/vitess.io/vitess/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
vttablet container
vitess@dev-vitess-vttablet-zone1-0790125915-0d0930ad:/$ whereis mysqlbinlog
mysqlbinlog: /usr/bin/mysqlbinlog
vitess@dev-vitess-vttablet-zone1-0790125915-0d0930ad:/$ echo $PATH
/vt/src/vitess.io/vitess/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
We have a visibility problem here. We're looking for the mysqlbinlog
error output. Right now I believe you should be able to find it in the mysqld
container logs.
Regardless, vitess
should clearly output any error messages from mysqlbinlog
. Please consider this patch: https://github.com/vitessio/vitess/commit/3652f6800f9b35cf70e0ffcf9a47936cdff20cf2 (I will create an orderly PR later on).
If you're able to apply this patch, build, and deploy to your k8s environment, I think we'll have a better idea of the error.
Related issue: https://github.com/vitessio/vitess/issues/15277
PR: https://github.com/vitessio/vitess/pull/15278
Hi @shlomi-noach I have updates on this issue. So I used this image setup on k8s vitess v19.0.0 Important note: In docs there is said to freely use mysql:8.0.30 or even mysql:8.0.34 image but Incremental backups are failling using this images as output is showing no path for mysqlbinlog...
images:
vtctld: vitess/lite:v19.0.0
vtadmin: vitess/vtadmin:v19.0.0
vtgate: vitess/lite:v19.0.0
vttablet: vitess/lite:v19.0.0
vtbackup: vitess/lite:v19.0.0
vtorc: vitess/lite:v19.0.0
mysqld:
mysql80Compatible: vitess/lite:v19.0.0-mysql80
mysqldExporter: prom/mysqld-exporter:v0.14.0
With this when i try to run command
vtctldclient RestoreFromBackup --restore-to-timestamp "2024-03-08T10:23:00Z" zone1-0790125915
Have this output at the end:
Handling connection for 15999
E0308 11:35:55.012632 110704 main.go:56] rpc error: code = Unknown desc = TabletManager.RestoreFromBackup on zone1-0790125915: Can't restore backup: failed to apply binlog file /tmp/restore-incremental-3487966834/vt/vtdataroot/vt_0790125915/bin-logs/vt-0790125915-bin.000002: rpc error: code = Unknown desc = mysqlbinlog command failed: with error output: mysqlbinlog: File '/tmp/restore-incremental-3487966834/vt/vtdataroot/vt_0790125915/bin-logs/vt-0790125915-bin.000002' not found (OS errno 2 - No such file or directory)
ERROR: Could not open log file
: exit status 1
Whole log output is available here gists
@GenLN the output does not show that the mysqlbinlog
binary could not be found. It shows that it WAS found, and it was in fact it was mysqlbinlog
that returned this error:
mysqlbinlog: File '/tmp/restore-incremental-3487966834/vt/vtdataroot/vt_0790125915/bin-logs/vt-0790125915-bin.000002' not found (OS errno 2 - No such file or directory)
For example:
❯ mysqlbinlog /tmp/noexist 1>/dev/null
mysqlbinlog: File '/tmp/noexist' not found (OS errno 2 - No such file or directory)
ERROR: Could not open log file
Why it's returning that I cannot say as I don't have access to your systems. Is it somehow not correct to use that specific binary log file? Is that not the right location? Does the file actually exist? I have no idea.
This is not clearly a bug to me at this point and may require a lot of back and forth so might be better to move it to the Vitess Slack.
If this requires a lot of back and forth we can mvoe to slack -- but I'd say this is a bug. The RestoreFromBackup
command clearly expects the file to exist, and asks mysqlbinlog
to apply it. I have no insight on why the file cannot be found -- could you e.g. find /tmp/restore-incremental-3487966834/
?
@mattlord Sorry for misunderstanding about this note
Important note: In docs there is said to freely use mysql:8.0.30 or even mysql:8.0.34 image but Incremental backups are failling using this images as output is showing no path for mysqlbinlog...
no path for mysqlbinlog
happens when using mysql:8.0.30
or mysql:8.0.34
image for mysqld
(This doesn't happen when using mysqld image vitess/lite:v19.0.0-mysql80
)
@shlomi-noach there isn't any folder or file neither in mysqld container or vttablet container
mysqld container:
$ ls -alh /tmp/
total 8.0K
drwxrwxrwt 1 root root 4.0K Mar 11 10:32 .
drwxr-xr-x 1 root root 4.0K Mar 8 10:19 ..
$ ls -alh /vt/vtdataroot/vt_0790125915/
total 36K
drwxr-sr-x 7 vitess vitess 4.0K Mar 11 10:32 .
drwxrwsr-x 3 root vitess 4.0K Mar 8 10:19 ..
drwxr-sr-x 2 vitess vitess 4.0K Mar 11 10:32 bin-logs
drwxr-s--- 8 vitess vitess 4.0K Mar 11 10:32 data
drwxr-sr-x 4 vitess vitess 4.0K Mar 11 10:32 innodb
-rw-r--r-- 1 vitess vitess 2.9K Mar 11 10:32 my.cnf
-rw-r----- 1 vitess vitess 5 Mar 11 10:32 mysql.pid
drwxr-sr-x 2 vitess vitess 4.0K Mar 11 10:32 relay-logs
-rw-r--r-- 1 vitess vitess 0 Mar 11 10:32 restore_in_progress
drwxr-sr-x 2 vitess vitess 4.0K Mar 11 10:32 tmp
$ ls -alh /vt/vtdataroot/vt_0790125915/tmp/
total 8.0K
drwxr-sr-x 2 vitess vitess 4.0K Mar 11 10:32 .
drwxr-sr-x 7 vitess vitess 4.0K Mar 11 10:32 ..
$ cat /vt/vtdataroot/vt_0790125915/restore_in_progress
vitess@dev-vitess-vttablet-zone1-0790125915-0d0930ad:/$
vttablet:
$ ls -alh /tmp/
total 8.9M
drwxrwxrwt 1 root root 4.0K Mar 11 10:32 .
drwxr-xr-x 1 root root 4.0K Mar 8 10:19 ..
-rw-r--r-- 1 vitess vitess 57 Mar 11 09:58 percona-version-check
-rwxr-xr-x 1 vitess vitess 8.9M Mar 8 10:19 vt-gh-ost
/vt
is shared folder for each container in vttablet pod so it is same output for mysqld and vttablet from/vt/vtdataroot/...
Also mysqld
container get restarted in restore process, this could be potential issue why does /tmp folder looks empty
@GenLN right. By the time you go looking for it, /tmp/restore-incremental-*
will have already been cleaned up by the restore process.
Looking at the log gist file:
commerce/- (zone1-0790125915): time:{seconds:1709894154 nanoseconds:781767909} file:"backup.go" line:475 value:"Restore: applying 2 incremental backups"
commerce/- (zone1-0790125915): time:{seconds:1709894154 nanoseconds:803742162} file:"builtinbackupengine.go" line:931 value:"Restoring incremental backup to position: 5f3b7323-dd35-11ee-9ec5-aae9b09cabe1:1-29"
commerce/- (zone1-0790125915): time:{seconds:1709894154 nanoseconds:803757012} level:WARNING file:"builtinbackupengine.go" line:1000 value:"engine \"pargzip\" doesn't support decompression, using \"pgzip\" instead"
commerce/- (zone1-0790125915): time:{seconds:1709894154 nanoseconds:803835914} file:"builtinbackupengine.go" line:1050 value:"Copying file 0: vt-0790125915-bin.000002"
commerce/- (zone1-0790125915): time:{seconds:1709894154 nanoseconds:830929001} file:"compression.go" line:222 value:"Decompressing backup using engine \"pgzip\""
commerce/- (zone1-0790125915): time:{seconds:1709894154 nanoseconds:831001633} file:"builtinbackupengine.go" line:1143 value:"closing decompressor"
commerce/- (zone1-0790125915): time:{seconds:1709894154 nanoseconds:831053795} file:"builtinbackupengine.go" line:777 value:"Done taking Backup \"0\""
commerce/- (zone1-0790125915): time:{seconds:1709894154 nanoseconds:835281495} file:"restore.go" line:278 value:"Restore: got a restore manifest: <nil>, err=rpc error: code = Unknown desc = mysqlbinlog command failed: with error output: mysqlbinlog: File '/tmp/restore-incremental-3487966834/vt/vtdataroot/vt_0790125915/bin-logs/vt-0790125915-bin.000002' not found (OS errno 2 - No such file or directory)\nERROR: Could not open log file\n: exit status 1\nfailed to apply binlog file /tmp/restore-incremental-3487966834/vt/vtdataroot/vt_0790125915/bin-logs/vt-0790125915-bin.000002, waitForBackupInterval=0s"
So the restore process was supposed to copy file vt-0790125915-bin.000002
into /tmp/restore-incremental-3487966834/vt/vtdataroot/vt_0790125915/bin-logs/
, but when mysqlbinlog
was told to apply said file at /tmp/restore-incremental-3487966834/vt/vtdataroot/vt_0790125915/bin-logs/vt-0790125915-bin.000002
, the file could not be found.
There is no error in the restore process to indicate that it was unable to create said file. So we're in the dark again. It feels like perhaps a k8s
permission issue? Perhaps the restore process did create the correct files, but then mysqlbinlog
did not have the authorization to read said files?
I did not ask: are you comfortable building a vitess
binary? I'd add a bunch of log entries from the moment the files are copied to validate their existence up till the point where we instruct mysqlbinlog
to read them, and beyond.
@shlomi-noach sure just drop the files which I will pull and build the docker image to use it for vitess cluster
Does it affect which user is setup for xtrabackup_user?
In this testing vitess cluster I have set xtrabackup_user: vt_dba
Also is this a normal content of the dir for incremental backups? As I notice here is 0 name of the incremental backup file which I guess reference to shard name "0" but in K8S initial shards or single sharded keyspaces have "-" and not "0"
Name Size Type
0 588 B application/x-gzip
MANIFEST 1.2 KB text/plain; charset=utf-8
Does it affect which user is setup for xtrabackup_user?
As far as I can see, no.
Also is this a normal content of the dir for incremental backups?
Yes. 0
file name has nothing to do with shard 0
. If you look at a full backup, you'll see files named 0
, 1
, 2
and so forth.
I think I have it. You said:
/vt is shared folder for each container in vttablet pod so it is same output for mysqld and vttablet
But the incremental restore does not extract the files onto /vt
. Instead, it extracts them onto /tmp/restore-incremental-*
. This is done by vtctld
or vttablet
pod. The mysqld
pod does not see those files.
So it means it's wrong location to download incremental backup files which should be applied by mysqlbinlog
Perhaps that location is obviously okay for docker deployment (as you confirmed in docker env it works as it expected), but not for k8s?
Maybe I can try to persist /tmp folder and share between containers in pods, at least I can try to see will this works?
Why don't you try https://github.com/vitessio/vitess/pull/15440, see if that works?
The xtrabackupengine uses the tmpdir
value in the my.cnf
file: https://github.com/vitessio/vitess/blob/main/go/vt/mysqlctl/xtrabackupengine.go
https://dev.mysql.com/doc/refman/8.0/en/temporary-files.html
That defaults to /tmp
. You could change that to /vt/tmp
here in your cluster definition: https://github.com/planetscale/vitess-operator/blob/main/docs/api.md#mysqldspec
@mattlord I think you're referencing an unrelated setting. The path we're using is generated by vitess, not by MySQL. Vitess reads a backup manifest, then copies over the backup files (binary logs in our case) under said generated path, then requests mysqld
to run mysqlbinlog
using those files.
Moreover, this is done by the built-in engine, not xtrabackup
engine. I can add a configuration in builtin
engine as an alternative to https://github.com/vitessio/vitess/pull/15440.
Ah, I see: https://github.com/vitessio/vitess/blob/46975b20fa6b9dd77dc666cedded8b8f16bc757a/go/vt/mysqlctl/builtinbackupengine.go#L1008-L1013
That's the only place that the restore-incremental
string literal is used.
Why don't you try https://github.com/vitessio/vitess/pull/15440, see if that works?
So I should build vitess image using these changes and run the test? Using this procedure from bootstrap image to build base and then lite? Vitess docker build
So I should build vitess image using these changes and run the test?
Yes please.
Using this procedure from bootstrap image to build base and then lite?
Honestly, I think, I'm not sure. This is why I asked whether you're comfortable building a vitess
binary.
I think this solved the issue :+1: So i built image following that instructions, with changes you did.... Run backups and successfully restored PITR or I believe so... Now one vttablet is set to DRAINED type after PITR Here last log from PITR:
commerce/- (zone1-0790125915): time:{seconds:1710170821 nanoseconds:31718605} file:"backup.go" line:450 value:"Restore: starting mysqld for mysql_upgrade"
commerce/- (zone1-0790125915): time:{seconds:1710170823 nanoseconds:34273131} file:"backup.go" line:456 value:"Restore: running mysql_upgrade"
commerce/- (zone1-0790125915): time:{seconds:1710170823 nanoseconds:34904826} file:"backup.go" line:463 value:"Restore: restarting mysqld after mysql_upgrade"
commerce/- (zone1-0790125915): time:{seconds:1710170826 nanoseconds:42300139} file:"backup.go" line:475 value:"Restore: applying 6 incremental backups"
commerce/- (zone1-0790125915): time:{seconds:1710170826 nanoseconds:74517229} file:"builtinbackupengine.go" line:933 value:"Restoring incremental backup to position: fda9934f-dfb8-11ee-b285-d679a6d29dd9:1-28"
commerce/- (zone1-0790125915): time:{seconds:1710170826 nanoseconds:74537690} level:WARNING file:"builtinbackupengine.go" line:1003 value:"engine \"pargzip\" doesn't support decompression, using \"pgzip\" instead"
commerce/- (zone1-0790125915): time:{seconds:1710170826 nanoseconds:74620512} file:"builtinbackupengine.go" line:1053 value:"Copying file 0: vt-0790125915-bin.000001"
commerce/- (zone1-0790125915): time:{seconds:1710170826 nanoseconds:124447381} file:"compression.go" line:222 value:"Decompressing backup using engine \"pgzip\""
commerce/- (zone1-0790125915): time:{seconds:1710170826 nanoseconds:124621285} file:"builtinbackupengine.go" line:1146 value:"closing decompressor"
commerce/- (zone1-0790125915): time:{seconds:1710170826 nanoseconds:124717567} file:"builtinbackupengine.go" line:778 value:"Done taking Backup \"0\""
commerce/- (zone1-0790125915): time:{seconds:1710170826 nanoseconds:149121436} file:"builtinbackupengine.go" line:958 value:"Applied binlog file: /vt/vtdataroot/restore-incremental-833467721/vt/vtdataroot/vt_0790125915/bin-logs/vt-0790125915-bin.000001"
commerce/- (zone1-0790125915): time:{seconds:1710170826 nanoseconds:149138636} file:"builtinbackupengine.go" line:964 value:"Restored incremental backup files to: /vt/vtdataroot/restore-incremental-833467721"
commerce/- (zone1-0790125915): time:{seconds:1710170826 nanoseconds:149354721} file:"builtinbackupengine.go" line:993 value:"Restore: returning replication position fda9934f-dfb8-11ee-b285-d679a6d29dd9:1-28"
commerce/- (zone1-0790125915): time:{seconds:1710170826 nanoseconds:149379262} file:"backup.go" line:484 value:"Restore: applied incremental backup: fda9934f-dfb8-11ee-b285-d679a6d29dd9:1-28"
commerce/- (zone1-0790125915): time:{seconds:1710170826 nanoseconds:182249347} file:"builtinbackupengine.go" line:933 value:"Restoring incremental backup to position: fda9934f-dfb8-11ee-b285-d679a6d29dd9:1-30"
commerce/- (zone1-0790125915): time:{seconds:1710170826 nanoseconds:182271147} level:WARNING file:"builtinbackupengine.go" line:1003 value:"engine \"pargzip\" doesn't support decompression, using \"pgzip\" instead"
commerce/- (zone1-0790125915): time:{seconds:1710170826 nanoseconds:182352859} file:"builtinbackupengine.go" line:1053 value:"Copying file 0: vt-0790125915-bin.000002"
commerce/- (zone1-0790125915): time:{seconds:1710170826 nanoseconds:254408236} file:"compression.go" line:222 value:"Decompressing backup using engine \"pgzip\""
commerce/- (zone1-0790125915): time:{seconds:1710170826 nanoseconds:254531709} file:"builtinbackupengine.go" line:1146 value:"closing decompressor"
commerce/- (zone1-0790125915): time:{seconds:1710170826 nanoseconds:254575220} file:"builtinbackupengine.go" line:778 value:"Done taking Backup \"0\""
commerce/- (zone1-0790125915): time:{seconds:1710170826 nanoseconds:273075721} file:"builtinbackupengine.go" line:958 value:"Applied binlog file: /vt/vtdataroot/restore-incremental-1572766817/vt/vtdataroot/vt_0790125915/bin-logs/vt-0790125915-bin.000002"
commerce/- (zone1-0790125915): time:{seconds:1710170826 nanoseconds:273090761} file:"builtinbackupengine.go" line:964 value:"Restored incremental backup files to: /vt/vtdataroot/restore-incremental-1572766817"
commerce/- (zone1-0790125915): time:{seconds:1710170826 nanoseconds:273307676} file:"builtinbackupengine.go" line:993 value:"Restore: returning replication position fda9934f-dfb8-11ee-b285-d679a6d29dd9:1-30"
commerce/- (zone1-0790125915): time:{seconds:1710170826 nanoseconds:273327407} file:"backup.go" line:484 value:"Restore: applied incremental backup: fda9934f-dfb8-11ee-b285-d679a6d29dd9:1-30"
commerce/- (zone1-0790125915): time:{seconds:1710170826 nanoseconds:354513357} file:"builtinbackupengine.go" line:933 value:"Restoring incremental backup to position: fda9934f-dfb8-11ee-b285-d679a6d29dd9:1-32"
commerce/- (zone1-0790125915): time:{seconds:1710170826 nanoseconds:354532807} level:WARNING file:"builtinbackupengine.go" line:1003 value:"engine \"pargzip\" doesn't support decompression, using \"pgzip\" instead"
commerce/- (zone1-0790125915): time:{seconds:1710170826 nanoseconds:354615069} file:"builtinbackupengine.go" line:1053 value:"Copying file 0: vt-2548885007-bin.000002"
commerce/- (zone1-0790125915): time:{seconds:1710170826 nanoseconds:422943859} file:"compression.go" line:222 value:"Decompressing backup using engine \"pgzip\""
commerce/- (zone1-0790125915): time:{seconds:1710170826 nanoseconds:423037871} file:"builtinbackupengine.go" line:1146 value:"closing decompressor"
commerce/- (zone1-0790125915): time:{seconds:1710170826 nanoseconds:423118783} file:"builtinbackupengine.go" line:778 value:"Done taking Backup \"0\""
commerce/- (zone1-0790125915): time:{seconds:1710170826 nanoseconds:441822098} file:"builtinbackupengine.go" line:958 value:"Applied binlog file: /vt/vtdataroot/restore-incremental-2102494854/vt/vtdataroot/vt_0790125915/bin-logs/vt-2548885007-bin.000002"
commerce/- (zone1-0790125915): time:{seconds:1710170826 nanoseconds:441841879} file:"builtinbackupengine.go" line:964 value:"Restored incremental backup files to: /vt/vtdataroot/restore-incremental-2102494854"
commerce/- (zone1-0790125915): time:{seconds:1710170826 nanoseconds:442064364} file:"builtinbackupengine.go" line:993 value:"Restore: returning replication position fda9934f-dfb8-11ee-b285-d679a6d29dd9:1-32"
commerce/- (zone1-0790125915): time:{seconds:1710170826 nanoseconds:442083524} file:"backup.go" line:484 value:"Restore: applied incremental backup: fda9934f-dfb8-11ee-b285-d679a6d29dd9:1-32"
commerce/- (zone1-0790125915): time:{seconds:1710170826 nanoseconds:473260860} file:"builtinbackupengine.go" line:933 value:"Restoring incremental backup to position: fda9934f-dfb8-11ee-b285-d679a6d29dd9:1-34"
commerce/- (zone1-0790125915): time:{seconds:1710170826 nanoseconds:473279410} level:WARNING file:"builtinbackupengine.go" line:1003 value:"engine \"pargzip\" doesn't support decompression, using \"pgzip\" instead"
commerce/- (zone1-0790125915): time:{seconds:1710170826 nanoseconds:473361122} file:"builtinbackupengine.go" line:1053 value:"Copying file 0: vt-2548885007-bin.000004"
commerce/- (zone1-0790125915): time:{seconds:1710170826 nanoseconds:525794483} file:"compression.go" line:222 value:"Decompressing backup using engine \"pgzip\""
commerce/- (zone1-0790125915): time:{seconds:1710170826 nanoseconds:525905336} file:"builtinbackupengine.go" line:1146 value:"closing decompressor"
commerce/- (zone1-0790125915): time:{seconds:1710170826 nanoseconds:525985598} file:"builtinbackupengine.go" line:778 value:"Done taking Backup \"0\""
commerce/- (zone1-0790125915): time:{seconds:1710170826 nanoseconds:544864697} file:"builtinbackupengine.go" line:958 value:"Applied binlog file: /vt/vtdataroot/restore-incremental-751988353/vt/vtdataroot/vt_0790125915/bin-logs/vt-2548885007-bin.000004"
commerce/- (zone1-0790125915): time:{seconds:1710170826 nanoseconds:544890838} file:"builtinbackupengine.go" line:964 value:"Restored incremental backup files to: /vt/vtdataroot/restore-incremental-751988353"
commerce/- (zone1-0790125915): time:{seconds:1710170826 nanoseconds:545132853} file:"builtinbackupengine.go" line:993 value:"Restore: returning replication position fda9934f-dfb8-11ee-b285-d679a6d29dd9:1-34"
commerce/- (zone1-0790125915): time:{seconds:1710170826 nanoseconds:545149194} file:"backup.go" line:484 value:"Restore: applied incremental backup: fda9934f-dfb8-11ee-b285-d679a6d29dd9:1-34"
commerce/- (zone1-0790125915): time:{seconds:1710170826 nanoseconds:569414389} file:"builtinbackupengine.go" line:933 value:"Restoring incremental backup to position: fda9934f-dfb8-11ee-b285-d679a6d29dd9:1-36"
commerce/- (zone1-0790125915): time:{seconds:1710170826 nanoseconds:569432309} level:WARNING file:"builtinbackupengine.go" line:1003 value:"engine \"pargzip\" doesn't support decompression, using \"pgzip\" instead"
commerce/- (zone1-0790125915): time:{seconds:1710170826 nanoseconds:569509981} file:"builtinbackupengine.go" line:1053 value:"Copying file 0: vt-2548885007-bin.000005"
commerce/- (zone1-0790125915): time:{seconds:1710170826 nanoseconds:603878621} file:"compression.go" line:222 value:"Decompressing backup using engine \"pgzip\""
commerce/- (zone1-0790125915): time:{seconds:1710170826 nanoseconds:603963263} file:"builtinbackupengine.go" line:1146 value:"closing decompressor"
commerce/- (zone1-0790125915): time:{seconds:1710170826 nanoseconds:604023374} file:"builtinbackupengine.go" line:778 value:"Done taking Backup \"0\""
commerce/- (zone1-0790125915): time:{seconds:1710170826 nanoseconds:624610263} file:"builtinbackupengine.go" line:958 value:"Applied binlog file: /vt/vtdataroot/restore-incremental-2566389258/vt/vtdataroot/vt_0790125915/bin-logs/vt-2548885007-bin.000005"
commerce/- (zone1-0790125915): time:{seconds:1710170826 nanoseconds:624624424} file:"builtinbackupengine.go" line:964 value:"Restored incremental backup files to: /vt/vtdataroot/restore-incremental-2566389258"
commerce/- (zone1-0790125915): time:{seconds:1710170826 nanoseconds:624812008} file:"builtinbackupengine.go" line:993 value:"Restore: returning replication position fda9934f-dfb8-11ee-b285-d679a6d29dd9:1-36"
commerce/- (zone1-0790125915): time:{seconds:1710170826 nanoseconds:624829268} file:"backup.go" line:484 value:"Restore: applied incremental backup: fda9934f-dfb8-11ee-b285-d679a6d29dd9:1-36"
commerce/- (zone1-0790125915): time:{seconds:1710170826 nanoseconds:656648969} file:"builtinbackupengine.go" line:933 value:"Restoring incremental backup to position: fda9934f-dfb8-11ee-b285-d679a6d29dd9:1-38"
commerce/- (zone1-0790125915): time:{seconds:1710170826 nanoseconds:656667249} level:WARNING file:"builtinbackupengine.go" line:1003 value:"engine \"pargzip\" doesn't support decompression, using \"pgzip\" instead"
commerce/- (zone1-0790125915): time:{seconds:1710170826 nanoseconds:656740561} file:"builtinbackupengine.go" line:1053 value:"Copying file 0: vt-0790125915-bin.000003"
commerce/- (zone1-0790125915): time:{seconds:1710170826 nanoseconds:748701142} file:"compression.go" line:222 value:"Decompressing backup using engine \"pgzip\""
commerce/- (zone1-0790125915): time:{seconds:1710170826 nanoseconds:748927507} file:"builtinbackupengine.go" line:1146 value:"closing decompressor"
commerce/- (zone1-0790125915): time:{seconds:1710170826 nanoseconds:749001599} file:"builtinbackupengine.go" line:778 value:"Done taking Backup \"0\""
commerce/- (zone1-0790125915): time:{seconds:1710170826 nanoseconds:770191402} file:"builtinbackupengine.go" line:958 value:"Applied binlog file: /vt/vtdataroot/restore-incremental-3497800045/vt/vtdataroot/vt_0790125915/bin-logs/vt-0790125915-bin.000003"
commerce/- (zone1-0790125915): time:{seconds:1710170826 nanoseconds:770242563} file:"builtinbackupengine.go" line:964 value:"Restored incremental backup files to: /vt/vtdataroot/restore-incremental-3497800045"
commerce/- (zone1-0790125915): time:{seconds:1710170826 nanoseconds:770500529} file:"builtinbackupengine.go" line:993 value:"Restore: returning replication position fda9934f-dfb8-11ee-b285-d679a6d29dd9:1-38"
commerce/- (zone1-0790125915): time:{seconds:1710170826 nanoseconds:770520089} file:"backup.go" line:484 value:"Restore: applied incremental backup: fda9934f-dfb8-11ee-b285-d679a6d29dd9:1-38"
commerce/- (zone1-0790125915): time:{seconds:1710170826 nanoseconds:770529479} file:"backup.go" line:486 value:"Restore: done applying incremental backups"
commerce/- (zone1-0790125915): time:{seconds:1710170826 nanoseconds:770539710} file:"backup.go" line:489 value:"Restore: removing state file"
commerce/- (zone1-0790125915): time:{seconds:1710170826 nanoseconds:770574670} file:"backup.go" line:496 value:"Restore: complete"
commerce/- (zone1-0790125915): time:{seconds:1710170826 nanoseconds:770610801} file:"restore.go" line:278 value:"Restore: got a restore manifest: &{2024-03-11.151135.zone1-2548885007 xtrabackup fda9934f-dfb8-11ee-b285-d679a6d29dd9:1-26 fda9934f-dfb8-11ee-b285-d679a6d29dd9:1-26 <nil> false 2024-03-11T15:11:35Z 2024-03-11T15:11:40Z f4bd2f84-dfb8-11ee-9e26-0610ce1e24bf zone1-2548885007 commerce - Ver 8.0.30 MySQL Community Server - GPL true <nil>}, err=<nil>, waitForBackupInterval=0s"
commerce/- (zone1-0790125915): time:{seconds:1710170826 nanoseconds:770633272} file:"restore.go" line:298 value:"Restore: pos=MySQL56/fda9934f-dfb8-11ee-b285-d679a6d29dd9:1-26"
commerce/- (zone1-0790125915): time:{seconds:1710170826 nanoseconds:770645932} file:"restore.go" line:317 value:"Restore: disabling replication"
commerce/- (zone1-0790125915): time:{seconds:1710170826 nanoseconds:853305236} file:"restore.go" line:357 value:"Restore: will set tablet type to DRAINED as this is a point in time recovery"
commerce/- (zone1-0790125915): time:{seconds:1710170826 nanoseconds:853361248} file:"restore.go" line:360 value:"Restore: changing tablet type to DRAINED for cell:\"zone1\" uid:790125915"
Looks good! OK, we need to decide what the actual solution should be. As far as I can see, it's either https://github.com/vitessio/vitess/pull/15440, or as @mattlord suggested, add a command line flag for you to configure the path yourself.
Keep in mind that this only works if vitess/lite image (with included #15440) is used for mysqld. The same bug is hitting if used official mysql image(will test this with this PR as i believe vttablet container starts the Restore procedure and maybe this fix will work with official image)...
Also, another question: Could we expect this to be released as a patch? I don't see any breaking changes here, just a bugfix for incremental restoration for K8s env. Also, I will do a few more tests with the image I have built from your PR and bring more updates if I encounter some issues. Still need to perform full PITR as I didn't have a chance to get to this step. 😄