cockroach
cockroach copied to clipboard
backupccl: introduce BACKUP-LOCK file
Only one backup job is allowed to write to a backup location. Prior to this change, the backup job would rely on the presence of a BACKUP-CHECKPOINT file to know of the presence of a concurrent backup job writing to the same location. This was problematic in subtle ways.
In 22.1, we moved backup destination resolution, and the writing of the checkpoint file to the backup resumer. Before writing the checkpoint file we would check if anyone else had laid claim to the location. Now, all operations in a job resumer need to be idempotent because a job can be resumed an arbitrary number of times, either due to transient errors or user intervention. One can imagine (and we have seen more than once in recent roachtests) a situation where a job:
- Checks for other BACKUP-CHECKPOINT files in the location, but finds none.
- Writes its own BACKUP-CHECKPOINT file.
- Gets resumed before it gets to update BackupDetails to indicate it has completed 1) and 2).
So, when the job repeats 1), it will now see its own BACKUP-CHECKPOINT file and claim another backup is writing to the location, foolishly locking itself out.
A similar situation can happen in a mixed version state where the node performs
- and 2) during planning, and the planner txn retries.
Before we discuss the solution it is important to highlight the mixed version states to consider:
-
Backups planned/executed by 21.2.x and 22.1.0 nodes will continue to check BACKUP-CHECKPOINT files before laying claim to a location.
-
Backups planned/executed by 21.2.x and 22.1.0 nodes will continue to write BACKUP-CHECKPOINT files as their way of claiming a location.
This change introduces a BACKUP-LOCK file that going forward will be
used to check and lay claim on a location. The BACKUP-LOCK file will
be suffixed with the jobID of the backup job. With this change a backup job will
check for the existence of BACKUP-LOCK files suffixed with a job ID other
than their own, before laying claim to a location. We continue to read
the BACKUP-CHECKPOINT file so as to respect the claim laid by backups
started on older binary nodes. Naturally, the job also continues to write
a BACKUP-CHECKPOINT file which prevents older nodes from starting concurrent
backups.
Release note: None
Release justification: This is a forward port of a feature that is already shipped in 22.1.
Co-authored-by: Aditya Maru [email protected]
This was a fairly manual forward port of #81994.
Thank you for doing this 🙌 LGTM though I only gave it a once over assuming it was functionally the same as 81994.
bors r=adityamaru