vitess Bug Report: v18.0.0-rc1 S3 backup failed to upload

Overview of the Issue

Upgraded from v17.0.3 to v18.0.0-rc1 our backup failed to be upload to S3 like storage (not AWS but more like Minio).

E1005 07:20:00.774018      14 main.go:56] rpc error: code = Unknown desc = TabletManager.Backup on commerce-0000001010 error: MultipartUpload: upload multipart failed
	upload id: ZGQ5OWUxNWYtY2IzNS00MDdjLWI1OGItMWNkZDA4M2ZlYzRk
caused by: RequestCanceled: request context canceled
caused by: context canceled: MultipartUpload: upload multipart failed
	upload id: ZGQ5OWUxNWYtY2IzNS00MDdjLWI1OGItMWNkZDA4M2ZlYzRk
caused by: RequestCanceled: request context canceled
caused by: context canceled

Reproduction Steps

/vt/bin/vtctldclient --server=vtctld:15999 --logtostderr=true BackupShard --concurrency=1 "commerce/-"

Running vttablet and vtctld with following flag:

--backup_storage_implementation=s3 \
--backup_engine_implementation=xtrabackup \
--s3_backup_aws_region=us-east-1 \
--s3_backup_storage_bucket=default \
--s3_backup_force_path_style=true \
--s3_backup_aws_endpoint=https://* \
--s3_backup_storage_root=*

Binary Version

vtctldclient version Version: 18.0.0-rc1 (Git revision 6ab165ade925b35a00cf447827d874eba13998b6 branch 'heads/v18.0.0-rc1') built on Tue Oct  3 15:00:58 UTC 2023 by vitess@buildkitsandbox using go1.21.1 linux/amd64

Operating System and Environment details

kubernetes / docker official vitess image

Log Fragments

No response

Oct 05 '23 07:10 L3o-pold

@L3o-pold can you please capture the log from vttablet as well? None of us maintainers have access to a minio environment so this will be tough to debug.

Oct 05 '23 21:10 deepthi

Our minio dependency hasn't changed in years. However the aws-sdk-go dependency has changed. That would be the first place to check. https://github.com/vitessio/vitess/blame/main/go.mod#L14 You can see the diff here https://github.com/vitessio/vitess/commit/9d2029aff8a805c6f646eef47e1b90c76f80d3e9

If you are able to build an image with the older version of aws-sdk-go, and that works, that will give us a starting point. But in any case we'll need the vttablet logs, not just the vtctld logs.

Oct 05 '23 21:10 deepthi

@deepthi we are not using Minio but AWS compatible storage like it. vttablet logs are the same:

vttablet W1006 00:03:11.814578       1 rpc_server.go:72] TabletManager.Backup(concurrency:1)(on commerce-0000001012 from ) error: MultipartUpload: upload multipart failed
vttablet 	upload id: M2MzYjJkY2YtODI4Yi00NGE5LWFkMjgtYTIwYzU2NmM4Yjdh
vttablet caused by: RequestCanceled: request context canceled
vttablet caused by: context canceled

I'll try to see if I can reproduce with a simple Minio docker instance, and with the team in charge of our storage solution to see if they have some log on their side.

Oct 06 '23 07:10 L3o-pold

I reproduce with a Minio docker instance:

    minio:
        image: minio/minio:RELEASE.2023-09-30T07-02-29Z
        command: ["server", "/data", "--console-address", ":9001"]
        environment:
            MINIO_ROOT_USER: "test"
            MINIO_ROOT_PASSWORD: "test"

Oct 06 '23 08:10 L3o-pold

From my testing, downgrading the aws-sdk-go have no impact, but reverting https://github.com/vitessio/vitess/pull/12500 works.

Oct 06 '23 14:10 L3o-pold

I'm going to test with vanilla S3 to see if that exhibits the same problem. If yes, we will have to revert at least a portion of #12500 and then figure out a way forward.

Oct 09 '23 16:10 deepthi

I have tested PR #12500 (or actually it's exact counterpart on the planetscale fork) and backups with vanilla S3 work just fine. However, my test db is very small, and the whole vtbackup process (download previous backup from s3, restore and take a new backup and upload to s3) takes 10 minutes. How big is your backup? Context canceled is telling us that a timeout is being hit. Default timeout (--action_timeout) for all vtctl commands is 1 hour. Does your command take longer than that? Or does it error out exactly after 1 hour? If yes, you need to extend that. If not, let me know and we'll see if we can think of something else.

Oct 09 '23 18:10 deepthi

@deepthi 2GB and fails in less than 2ms once xtrabackup finished.

edit: tried with an empty database and same issue.

Oct 09 '23 18:10 L3o-pold

@deepthi Is planetscale using the new vtctld grpc api to do the backup?

Oct 10 '23 17:10 harshit-gangal

@L3o-pold can you try if builtinbackup works? You did specify the flags in your issue description but I missed the fact that you are in fact using xtrabackup. My test was with vtbackup (which operator uses) + builtin backup, not xtrabackup.

Also:

were you using the same command with v17? Or were you by any chance using vtctlclient instead of vtctldclient?

Oct 10 '23 17:10 deepthi

@deepthi --backup_engine_implementation=xtrabackup was specified in the issue yeah :) I'll try with builtin backup

Oct 10 '23 17:10 L3o-pold

unsharded/- (local-0000000301): time:{seconds:1696962252 nanoseconds:352680626} file:"builtinbackupengine.go" line:609 value:"failed to backup files due to error."
unsharded/- (local-0000000301): time:{seconds:1696962252 nanoseconds:352697998} file:"builtinbackupengine.go" line:609 value:"failed to backup files due to error."
unsharded/- (local-0000000301): time:{seconds:1696962252 nanoseconds:352710499} file:"builtinbackupengine.go" line:609 value:"failed to backup files due to error."
unsharded/- (local-0000000301): time:{seconds:1696962254 nanoseconds:357451136} file:"builtinbackupengine.go" line:487 value:"resetting mysqld super_read_only to false"
unsharded/- (local-0000000301): time:{seconds:1696962254 nanoseconds:359553344} file:"builtinbackupengine.go" line:422 value:"resetting mysqld super_read_only to false"
unsharded/- (local-0000000301): time:{seconds:1696962254 nanoseconds:361549882} level:ERROR file:"backup.go" line:180 value:"backup is not usable, aborting it: [RequestCanceled: request context canceled\ncaused by: context canceled;RequestCanceled: request context canceled\ncaused by: context canceled]"
unsharded/- (local-0000000301): time:{seconds:1696962254 nanoseconds:376869259} level:ERROR file:"backup.go" line:188 value:"failed to finish backup: [MalformedXML: The XML you provided was not well-formed or did not validate against our published schema.\n\tstatus code: 400, request id: 178CD22DF0FC0D2A, host id: dd9025bab4ad464b049177c95eb6ebf374d3b3fd1af9251148b658df7ac2e3e8]: %!v(MISSING)"
E1010 18:24:14.456267      40 main.go:56] rpc error: code = Unknown desc = TabletManager.Backup on local-0000000301 error: RequestCanceled: request context canceled
caused by: context canceled;RequestCanceled: request context canceled
caused by: context canceled: RequestCanceled: request context canceled
caused by: context canceled;RequestCanceled: request context canceled
caused by: context canceled

Oct 10 '23 18:10 L3o-pold

this change works:

Oct 11 '23 13:10 L3o-pold

@L3o-pold there are 3 variables here

vanilla s3 vs your storage
builtinbackup vs xtrabackup -> you have shown that this doesn't matter
vtctlclient vs vtctldclient - can you try running with vtctlclient and tell us what happens?

The reason I don't want to just revert the context change is that the new code is "correct".

Oct 11 '23 15:10 deepthi

@deepthi

/vt/bin/vtctlclient --server=vtctld:15999 BackupShard -- --concurrency=1 --allow_primary "unsharded/-"
W1011 15:53:05.200409      68 log.go:39] Failed to read in config : Config File "vtconfig" Not Found in "[/]". This is optional, and can be ignored if you are not using config files. For a detailed explanation, see https://github.com/vitessio/vitess/blob/main/doc/viper/viper.md#config-files.
E1011 15:53:11.339150      68 main.go:96] E1011 15:53:11.338962 backup.go:110] E1011 15:53:11.338637 backup.go:180] backup is not usable, aborting it: [RequestCanceled: request context canceled
caused by: context canceled;RequestCanceled: request context canceled
caused by: context canceled]
E1011 15:53:11.342682      68 main.go:96] E1011 15:53:11.342484 backup.go:110] E1011 15:53:11.342019 backup.go:188] failed to finish backup: [MalformedXML: The XML you provided was not well-formed or did not validate against our published schema.
        status code: 400, request id: 178D18845BAAE316, host id: dd9025bab4ad464b049177c95eb6ebf374d3b3fd1af9251148b658df7ac2e3e8]: %!v(MISSING)
BackupShard Error: rpc error: code = Unknown desc = TabletManager.Backup on local-0000000301 error: RequestCanceled: request context canceled
caused by: context canceled;RequestCanceled: request context canceled
caused by: context canceled: RequestCanceled: request context canceled
caused by: context canceled;RequestCanceled: request context canceled
caused by: context canceled
E1011 15:53:11.432335      68 main.go:105] remote error: rpc error: code = Unknown desc = TabletManager.Backup on local-0000000301 error: RequestCanceled: request context canceled
caused by: context canceled;RequestCanceled: request context canceled
caused by: context canceled: RequestCanceled: request context canceled
caused by: context canceled;RequestCanceled: request context canceled
caused by: context canceled

Oct 11 '23 15:10 L3o-pold

I just tried the same thing against s3 (without --allow_primary, so it happened on a replica) and the backup succeeds. So at this point I'm going say that this seems to be specific to your storage provider. Especially given that you seem to be getting an error returned from the call to UploadWithContext.

Oct 12 '23 01:10 deepthi

Multiple storage provider as we replicate the behavior with Swift and Minio

Oct 12 '23 06:10 L3o-pold

Similar to https://github.com/vitessio/vitess/issues/8391

Oct 12 '23 13:10 L3o-pold

In order to scope this down a bit further we can try this

my test was with the PlanetScale fork, I can re-test with vanilla vitess
is it possible for you to test with an actual s3 bucket instead of Minio/Swift?

Oct 12 '23 16:10 deepthi

We are not using Amazon S3 because we are on premise. Maybe it works with pure Amazon S3 but not with compatible S3 storage... it used to work in v16 and v17 tho.

Oct 12 '23 18:10 L3o-pold

This is a little bit of a gray area because on the one hand, we don't "officially" support s3 like non-s3 storage backends. On the other hand, it has worked for a long time and you and others are relying on it. So for now, I'm willing for us to revert the change that is causing this error and go back to using Upload instead of UploadWithContext and then we can try and figure out why that is not working as expected. Will you be able to do a PR to make just that change? We don't want to revert the stats changes from #12500

cc @maxenglander

Oct 15 '23 20:10 deepthi

@L3o-pold there's one more thing you should do. Can you run this again with some logging added? Log the context and its deadline in s3.go just before we call Upload/UploadWithContext, and at the beginning of the rpc call in rpc_backup.go. That will tell us if we are somehow overriding the original context or its deadline.

rpc_backup.go

func (tm *TabletManager) Backup(ctx context.Context, logger logutil.Logger, req *tabletmanagerdatapb.BackupRequest) error {

Oct 16 '23 14:10 deepthi

I'll check for this tomorrow.

Oct 17 '23 09:10 L3o-pold

@deepthi

rpc_backup.go:44] Backup: [s3] context.Background.WithValue(type transport.connectionKey, val <not Stringer>).WithValue(type peer.peerKey, val <not Stringer>).WithDeadline(2023-10-18 05:47:58.845009687 +0000 UTC m=+3647.749218339 [59m59.989854174s]).WithValue(type metadata.mdIncomingKey, val <not Stringer>).WithValue(type grpc.streamKey, val <not Stringer>).WithValue(type grpc.rpcInfoContextKey, val <not Stringer>).WithValue(type callinfo.key, val <not Stringer>)

s3.go:176] AddFile: [s3] context.Background.WithValue(type transport.connectionKey, val <not Stringer>).WithValue(type peer.peerKey, val <not Stringer>).WithDeadline(2023-10-18 05:47:58.845009687 +0000 UTC m=+3647.749218339 [59m55.920480222s]).WithValue(type metadata.mdIncomingKey, val <not Stringer>).WithValue(type grpc.streamKey, val <not Stringer>).WithValue(type grpc.rpcInfoContextKey, val <not Stringer>).WithValue(type callinfo.key, val <not Stringer>).WithCancel

Oct 18 '23 04:10 L3o-pold

So the logs tell us that we are not messing up the context in passing it in. AND that it does have a 1 hour timeout. This makes it more likely that somehow this is not being supported on the server side i.e. the specific storage solution you are using. Do you get any logs on the storage server that can help?

Oct 19 '23 14:10 deepthi

@deepthi the only thing I have on the (minio) server side log is 499 http status (client/vttablet disconnection) https://github.com/vitessio/vitess/issues/8391#issuecomment-1759374421

Oct 19 '23 14:10 L3o-pold

Then the next thing I can think of is some sort of grpc keep alive timeout. The client (vttablet) is canceling because there's no response within some shorter timeout on the grpc connection. There was an issue of that sort some time ago https://github.com/vitessio/vitess/issues/9195 but the error message is different.

Oct 19 '23 15:10 deepthi

The error message looks like somehow an explicit cancel on the context was triggered. It doesn't look like an expiry. And these contexts are not really something the server even sees, it's a purely client side concern.

I'd be curious to know if in your case EndBackup ever ends up being called? That is what should wait for the uploads to complete then.

Oct 25 '23 15:10 dbussink

@dbussink EndBackup seems not called

Oct 26 '23 14:10 L3o-pold

@dbussink EndBackup seems not called

That would indicate that we also don't want on the background goroutine that uploads. Which means that request completes and the context is cancelled. Which explains why this happens.

I don't think this is to be honest related to S3 / minio or whatever the implementation is. This is a bug that happens independently of that cc @deepthi.

Oct 26 '23 14:10 dbussink

vitess vitess copied to clipboard

Bug Report: v18.0.0-rc1 S3 backup failed to upload

Overview of the Issue

Reproduction Steps

Binary Version

Operating System and Environment details

Log Fragments

vitess
vitess copied to clipboard