Two backups running simultaneously when the sidecar container loses its lease?
I was just looking at the live logs of the stash sidecar container, as it was backing up a volume:
I0304 14:23:33.143651 1 commands.go:120] Backing up target data
[golang-sh]$ /bin/restic backup /mnt/data/timemachine --quiet --json --host samba-0 --exclude /mnt/data/private/sjors/library --cache-dir /stash-tmp/restic-cache --cleanup-cache
As it was doing this, it lost the backup lock because of some timeout talking to the API server:
I0304 14:24:41.343247 1 request.go:614] Waited for 73.429648ms due to client-side throttling, not priority and fairness, request: PUT:https://10.43.0.1:443/api/v1/namespaces/samba/configmaps/lock-statefulset-samba-backup
E0304 14:24:41.612130 1 leaderelection.go:367] Failed to update lock: client rate limiter Wait returned an error: rate: Wait(n=1) would exceed context deadline
I0304 14:24:41.854371 1 leaderelection.go:283] failed to renew lease samba/lock-statefulset-samba-backup: timed out waiting for the condition
I0304 14:24:51.247217 1 backupsession.go:390] Lost leadership
I0304 14:24:52.330837 1 backupsession.go:176] Sync/Add/Update for Backup Session backup-1677939803
I0304 14:24:56.453093 1 backupsession.go:347] Attempting to acquire leadership for backup
I0304 14:24:56.909165 1 leaderelection.go:248] attempting to acquire leader lease samba/lock-statefulset-samba-backup...
I0304 14:24:57.519198 1 leaderelection.go:258] successfully acquired lease samba/lock-statefulset-samba-backup
I0304 14:24:57.701089 1 backupsession.go:379] Got leadership, preparing for backup
I0304 14:25:04.823962 1 commands.go:120] Backing up target data
[golang-sh]$ /bin/restic backup /mnt/data/timemachine --quiet --json --host samba-0 --exclude /mnt/data/private/sjors/library --cache-dir /stash-tmp/restic-cache --cleanup-cache
So it looks like this restarted the backup, but did not actually kill the old one, so now there are two running backups according to ps aux:
47 root 1:15 /bin/restic backup /mnt/data/timemachine --quiet --json --
59 root 0:00 /bin/restic backup /mnt/data/timemachine --quiet --json --
Can you reproduce this on your end?
A similar thing just happened during a long-running restic forget. (The lost lease is due to my API server being flaky, which unfortunately I can't do anything about at the moment. But I figured you might be interested in the behavior of Stash this results in.)
So first, a restic forget is running...
I0304 15:38:13.672163 1 commands.go:178] Cleaning old snapshots according to retention policy
[golang-sh]$ /bin/restic forget --quiet --json --keep-last 2 --prune --cache-dir /stash-tmp/restic-cache
The lock is lost, reacquired, and then the backup starts over...
E0304 16:42:32.077405 1 leaderelection.go:330] error retrieving resource lock samba/lock-statefulset-samba-backup: client rate limiter Wait returned an error: context deadline exceeded
I0304 16:42:32.087620 1 leaderelection.go:283] failed to renew lease samba/lock-statefulset-samba-backup: timed out waiting for the condition
I0304 16:42:32.117033 1 backupsession.go:390] Lost leadership
I0304 16:42:32.125189 1 backupsession.go:176] Sync/Add/Update for Backup Session backup-1677944161
I0304 16:42:32.256509 1 backupsession.go:347] Attempting to acquire leadership for backup
I0304 16:42:32.283558 1 leaderelection.go:248] attempting to acquire leader lease samba/lock-statefulset-samba-backup...
E0304 16:42:32.433162 1 recorder.go:78] events is forbidden: User "system:serviceaccount:samba:default" cannot create resource "events" in API group "" in the namespace "default"
I0304 16:42:32.468349 1 leaderelection.go:258] successfully acquired lease samba/lock-statefulset-samba-backup
I0304 16:42:32.472645 1 backupsession.go:379] Got leadership, preparing for backup
But it fails, because the forget was also still running in the background and has the repository locked.
I0304 16:42:32.672383 1 commands.go:120] Backing up target data
[golang-sh]$ /bin/restic backup /mnt/data/timemachine --quiet --json --host samba-0 --exclude /mnt/data/private/sjors/library --cache-dir /stash-tmp/restic-cache --cleanup-cache
Fatal: unable to create lock in backend: repository is already locked exclusively by PID 233 on samba-0 by root (UID 0, GID 0)
lock was created at 2023-03-04 16:38:19 (4m19.217251168s ago)
storage ID 52420233
W0304 16:42:38.718369 1 backupsession.go:434] Failed to take backup for BackupSession backup-1677944161. Reason: Fatal: unable to create lock in backend: repository is already locked exclusively by PID 233 on samba-0 by root (UID 0, GID 0) lock was created at 2023-03-04 16:38:19 (4m19.217251168s ago) storage ID 52420233
I0304 16:42:38.723286 1 status.go:99] Updating post backup status.......
I0304 16:42:38.953631 1 backupsession.go:390] Lost leadership
Then, some time later, that restic forget actually returns, showing that the sidecar container is still aware of the old ongoing backupsession...
I0304 16:42:38.953631 1 backupsession.go:390] Lost leadership
I0304 17:44:14.668413 1 commands.go:412] sh-output: [{"tags":null,"host":"samba-0","paths":["/mnt/data/private"],"keep":[....]"matches":["last snapshot"],"counters":{}}]}]
I0304 17:44:15.265525 1 backupsession.go:176] Sync/Add/Update for Backup Session backup-1677944161
I0304 17:44:15.269583 1 status.go:317] Verifying repository integrity...........
Hello @sgielen. Thank you for reporting the issue. We will try to reproduce the issue on our end and get back to you.
@hmsayem did you succeed in reproducing the issue on your end? :)
@sgielen. Unfortunately, we have not had the opportunity to reproduce the issue yet. We will start working on the issue as soon as possible and keep you updated on any progress. Thank you for your patience.
So it looks like this restarted the backup, but did not actually kill the old one
Did it happen for the same BackupSession?
So it looks like this restarted the backup, but did not actually kill the old one
Did it happen for the same BackupSession?
Yes absolutely. It is very reproducible here, now that I've noticed that this is what happens. I think it has been occurring for a long time and I suspect it also might be what caused #1488?
You can see it in the logs here, too, from this morning:
I0321 11:00:56.399129 1 backupsession.go:176] Sync/Add/Update for Backup Session backup-1679396456
I0321 11:00:56.415584 1 backupsession.go:347] Attempting to acquire leadership for backup
I0321 11:00:56.416241 1 leaderelection.go:248] attempting to acquire leader lease samba/lock-statefulset-samba-backup...
I0321 11:00:56.905441 1 leaderelection.go:258] successfully acquired lease samba/lock-statefulset-samba-backup
I0321 11:00:56.907545 1 backupsession.go:379] Got leadership, preparing for backup
I0321 11:00:57.923548 1 commands.go:100] Checking whether the backend repository exist or not....
[golang-sh]$ /bin/restic snapshots --json --no-lock --cache-dir /stash-tmp/restic-cache
I0321 11:01:02.055393 1 commands.go:412] sh-output: [snip]
I0321 11:01:02.147208 1 commands.go:120] Backing up target data
[golang-sh]$ /bin/restic backup /mnt/data/timemachine --quiet --json --host samba-0 --exclude /mnt/data/private/sjors/library --cache-dir /stash-tmp/restic-cache --cleanup-cache
I0321 11:03:41.105896 1 commands.go:412] sh-output: {"message_type":"status","seconds_elapsed":85,"percent_done":0}
{"message_type":"status","seconds_elapsed":146,"percent_done":1,"total_files":21596,"files_done":21596,"total_bytes":1448142982967,"bytes_done":1448142982967}
{"message_type":"summary","files_new":0,"files_changed":0,"files_unmodified":21596,"dirs_new":0,"dirs_changed":0,"dirs_unmodified":8,"data_blobs":0,"tree_blobs":0,"data_added":0,"total_files_processed":21596,"total_bytes_processed":1448142982967,"total_duration":150.481592673,"snapshot_id":"d2dc24e3"}
I0321 11:03:41.779846 1 backupsession.go:429] Backup completed successfully for BackupSession backup-1679396456
I0321 11:03:42.033425 1 status.go:99] Updating post backup status.......
I0321 11:03:42.970604 1 status.go:378] Waiting for all other targets/hosts to complete their backup.....
I0321 11:03:42.989620 1 status.go:297] Applying retention policy.....
I0321 11:03:43.050430 1 commands.go:178] Cleaning old snapshots according to retention policy
[golang-sh]$ /bin/restic forget --quiet --json --keep-last 2 --prune --cache-dir /stash-tmp/restic-cache
E0321 11:30:57.874138 1 leaderelection.go:330] error retrieving resource lock samba/lock-statefulset-samba-backup: Get "https://10.43.0.1:443/api/v1/namespaces/samba/configmaps/lock-statefulset-samba-backup": context deadline exceeded
I0321 11:31:00.590746 1 leaderelection.go:283] failed to renew lease samba/lock-statefulset-samba-backup: timed out waiting for the condition
I0321 11:31:08.329558 1 backupsession.go:390] Lost leadership
I0321 11:31:22.554811 1 backupsession.go:176] Sync/Add/Update for Backup Session backup-1679396456
E0321 11:31:24.690379 1 recorder.go:78] events is forbidden: User "system:serviceaccount:samba:default" cannot create resource "events" in API group "" in the namespace "default"
I0321 11:31:25.338329 1 backupsession.go:347] Attempting to acquire leadership for backup
I0321 11:31:25.821707 1 leaderelection.go:248] attempting to acquire leader lease samba/lock-statefulset-samba-backup...
I0321 11:31:26.256067 1 leaderelection.go:258] successfully acquired lease samba/lock-statefulset-samba-backup
I0321 11:31:26.284248 1 backupsession.go:379] Got leadership, preparing for backup
I0321 11:31:37.865459 1 commands.go:120] Backing up target data
[golang-sh]$ /bin/restic backup /mnt/data/timemachine --quiet --json --host samba-0 --exclude /mnt/data/private/sjors/library --cache-dir /stash-tmp/restic-cache --cleanup-cache
note how it starts backup-1679396456, then starts the restic forget, and when the lease fails to renew, it restarts the same backup entirely. This is the only BackupSession that's ongoing, no BackupSessions pending. At the time of writing it's set to Failed even though the original restic forget is still running:
$ kubectl get backupsession -n samba
NAME INVOKER-TYPE INVOKER-NAME PHASE DURATION AGE
backup-1679189534 BackupConfiguration backup Failed 45m23s 2d10h
backup-1679362325 BackupConfiguration backup Failed 10s 10h
backup-1679396456 BackupConfiguration backup Failed 34m50s 88m
$ kubectl exec -ti -n samba samba-0 -c stash -- ps aux
PID USER TIME COMMAND
1 root 5:54 /stash run-backup --invoker-name=backup --invoker-kind=BackupConfiguration --target-name=samba --target-namespace=samba --target-kind=StatefulSet --enable-cache=true
224 root 0:00 sh
251 root 36:14 /bin/restic forget --quiet --json --keep-last 2 --prune --cache-dir /stash-tmp/restic-cache
281 root 0:00 ps aux
@sgielen can you please share the yaml of the BackupSession (failed backup-1679396456) with us?
apiVersion: stash.appscode.com/v1beta1
kind: BackupSession
metadata:
creationTimestamp: "2023-03-21T11:00:56Z"
generation: 1
labels:
app.kubernetes.io/component: stash-backup
app.kubernetes.io/managed-by: stash.appscode.com
stash.appscode.com/invoker-name: backup
stash.appscode.com/invoker-type: BackupConfiguration
stash.appscode.com/target-kind: StatefulSet
stash.appscode.com/target-name: samba
stash.appscode.com/target-namespace: ""
name: backup-1679396456
namespace: samba
ownerReferences:
- apiVersion: stash.appscode.com/v1beta1
blockOwnerDeletion: true
controller: true
kind: BackupConfiguration
name: backup
uid: 5e47ff1a-5ee8-4fb9-a73f-3407b36f44fb
resourceVersion: "22222554"
uid: 47f47cf8-a18f-4af9-8dff-51b90f215e36
spec:
invoker:
apiGroup: stash.appscode.com
kind: BackupConfiguration
name: backup
status:
conditions:
- lastTransitionTime: "2023-03-21T11:01:02Z"
message: Repository exist in the backend.
reason: BackendRepositoryFound
status: "True"
type: BackendRepositoryInitialized
- lastTransitionTime: "2023-03-21T11:35:46Z"
message: Successfully cleaned up backup history according to backupHistoryLimit.
reason: SuccessfullyCleanedBackupHistory
status: "True"
type: BackupHistoryCleaned
- lastTransitionTime: "2023-03-21T11:35:46Z"
message: Successfully pushed metrics.
reason: SuccessfullyPushedMetrics
status: "True"
type: MetricsPushed
phase: Failed
sessionDeadline: "2023-03-23T11:00:56Z"
sessionDuration: 34m50s
targets:
- phase: Failed
postBackupActions:
- ApplyRetentionPolicy
- VerifyRepositoryIntegrity
- SendRepositoryMetrics
preBackupActions:
- InitializeBackendRepository
ref:
apiVersion: apps/v1
kind: StatefulSet
name: samba
namespace: samba
stats:
- error: 'failed to complete backup for host samba-0. Reason: signal: terminated'
hostname: samba-0
phase: Failed
totalHosts: 1
It was terminated, by the way, because I killed the restic backup. This is also visible in the logs above:
[golang-sh]$ /bin/restic backup /mnt/data/timemachine --quiet --json --host samba-0 --exclude /mnt/data/private/sjors/library --cache-dir /stash-tmp/restic-cache --cleanup-cache
[...]
W0321 11:33:52.288366 1 backupsession.go:434] Failed to take backup for BackupSession backup-1679396456. Reason: signal: terminated
I0321 11:34:19.413866 1 status.go:99] Updating post backup status.......
I killed restic backup because it was running at the same time as restic forget, which had the lock, so it would fail anyway.
I believe it should be possible to reproduce this issue by deleting the Lease in the same namespace as the backup Pod. I'm currently trying that.
I believe it should be possible to reproduce this issue by deleting the
Leasein the same namespace as the backup Pod. I'm currently trying that.
That didn't work unfortunately, the process just created a new lock-statefulset-samba-backup. Perhaps if the Lease is deleted and there are no rights to create a new one... I'm trying to reproduce the scenario I have here, where sometimes the API server simply doesn't answer for a few seconds (because of high load) therefore reacquiring the lock fails.
Got it! If you change the holder of the lease using kubectl edit lease -n samba lock-statefulset-samba-backup, the issue is reproduced:
$ kubectl get lease -n samba -w
NAME HOLDER AGE
lock-statefulset-samba-backup not-samba-0 75s
I0321 13:05:43.971306 1 commands.go:120] Backing up target data
[golang-sh]$ /bin/restic backup /mnt/data/timemachine --quiet --json --host samba-0 --exclude /mnt/data/private/sjors/library --cache-dir /stash-tmp/restic-cache --cleanup-cache
[ change holder here ]
I0321 13:07:11.351787 1 leaderelection.go:283] failed to renew lease samba/lock-statefulset-samba-backup: timed out waiting for the condition
I0321 13:07:11.351909 1 backupsession.go:390] Lost leadership
I0321 13:07:11.351998 1 backupsession.go:176] Sync/Add/Update for Backup Session backup-1679403934
I0321 13:07:11.385326 1 backupsession.go:347] Attempting to acquire leadership for backup
I0321 13:07:11.390535 1 leaderelection.go:248] attempting to acquire leader lease samba/lock-statefulset-samba-backup...
I0321 13:07:27.268253 1 leaderelection.go:258] successfully acquired lease samba/lock-statefulset-samba-backup
I0321 13:07:27.271271 1 backupsession.go:379] Got leadership, preparing for backup
I0321 13:07:27.384546 1 commands.go:120] Backing up target data
[golang-sh]$ /bin/restic backup /mnt/data/timemachine --quiet --json --host samba-0 --exclude /mnt/data/private/sjors/library --cache-dir /stash-tmp/restic-cache --cleanup-cache
Now, two backups are ongoing at the same time. Can you reproduce this, too?
It was terminated, by the way, because I killed the restic backup
So the backup keeps restarting indefinitely unless you kill it explicitly?
So the backup keeps restarting indefinitely unless you kill it explicitly?
Every time the process loses its lease, it restarts the backup, while keeping the old one running. Eventually, all of them will succeed or fail, but that's beside the point -- it's the multiple ongoing backups at the same time which is the root of the issue here.
Now, two backups are ongoing at the same time. Can you reproduce this, too?
I will try to reproduce it and let you know the update.
Thank you!