vitess
vitess copied to clipboard
Bug Report: Race condition on uploading backup and manifest
Overview of the Issue
We have encountered a rare issue where we got a backup done where only the MANIFEST was written to S3, but not the actual backup file!
Looking at the logs, it seems we tried to complete the upload process of backup.xbstream.gz after the MANIFEST was written, braking the contract highlighted here.
Reproduction Steps
not easy to reproduce since it depends on S3 throttling us, but might be possible to write a test that simulates this kind of behaviour.
Binary Version
running v15 from our Slack branch
Operating System and Environment details
not OS related.
Log Fragments
I0920 19:04:40.356237 3431799 xtrabackupengine.go:357] xtrabackup stderr: 2024-09-20T19:04:40.355928-07:00 0 [Note] [MY-011825] [Xtrabackup] completed OK!
I0920 19:04:40.750777 3431799 xtrabackupengine.go:709] Found position: <<redacted>>
I0920 19:04:40.750838 3431799 xtrabackupengine.go:146] Closing backup file backup.xbstream.gz
I0920 19:04:40.750850 3431799 xtrabackupengine.go:201] Writing backup MANIFEST
I0920 19:04:40.751397 3431799 xtrabackupengine.go:237] Backup completed
I0920 19:04:40.751421 3431799 xtrabackupengine.go:146] Closing backup file MANIFEST
W0920 19:04:41.425825 3431799 rpc_server.go:80] TabletManager.Backup(concurrency:4)(on us_east_1c-0169388481 from ) error: MultipartUpload: upload multipart failed
\tupload id: <<redacted>>
caused by: Throttling: Rate exceeded
\tstatus code: 400, request id: <<redacted>>
@frouioui looks like you might have run into a similar issue and fixed it in #16806?
EDIT: Taking another look, they look like different race conditions.
Hello @rvrangel, thanks for reporting this. I spend some time investigating, and the issue is rooted in how we do AddFile on S3 and Ceph. To make it short, when backing up Vitess, both the builtin and xtrabackup engine code will do an AddFile to upload each individual file (or stripe), this function returns a writer to which we write. However, the S3 and Ceph AddFile will write to the remote storage asynchronously, meaning that we may return from backupFiles when we are done writing everything to the buffer, but when not everything has been written/processed by S3/Ceph. To fix this, we must use bh.EndBackup() at the end of backupFiles to make sure S3/Ceph are done processing the files before moving on to the MANIFEST.
The refactor I did in https://github.com/vitessio/vitess/pull/17271 will fix this issue for the builtin backup engine. The xtrabackup engine fix should be part of another PR.
thanks for looking into it @frouioui ! could you link here when you have a PR for the xtrabackup engine as well?