stash Two backups running simultaneously when the sidecar container loses its lease?

I was just looking at the live logs of the stash sidecar container, as it was backing up a volume:

I0304 14:23:33.143651       1 commands.go:120] Backing up target data
[golang-sh]$ /bin/restic backup /mnt/data/timemachine --quiet --json --host samba-0 --exclude /mnt/data/private/sjors/library --cache-dir /stash-tmp/restic-cache --cleanup-cache

As it was doing this, it lost the backup lock because of some timeout talking to the API server:

I0304 14:24:41.343247       1 request.go:614] Waited for 73.429648ms due to client-side throttling, not priority and fairness, request: PUT:https://10.43.0.1:443/api/v1/namespaces/samba/configmaps/lock-statefulset-samba-backup
E0304 14:24:41.612130       1 leaderelection.go:367] Failed to update lock: client rate limiter Wait returned an error: rate: Wait(n=1) would exceed context deadline
I0304 14:24:41.854371       1 leaderelection.go:283] failed to renew lease samba/lock-statefulset-samba-backup: timed out waiting for the condition
I0304 14:24:51.247217       1 backupsession.go:390] Lost leadership
I0304 14:24:52.330837       1 backupsession.go:176] Sync/Add/Update for Backup Session backup-1677939803
I0304 14:24:56.453093       1 backupsession.go:347] Attempting to acquire leadership for backup
I0304 14:24:56.909165       1 leaderelection.go:248] attempting to acquire leader lease samba/lock-statefulset-samba-backup...
I0304 14:24:57.519198       1 leaderelection.go:258] successfully acquired lease samba/lock-statefulset-samba-backup
I0304 14:24:57.701089       1 backupsession.go:379] Got leadership, preparing for backup
I0304 14:25:04.823962       1 commands.go:120] Backing up target data
[golang-sh]$ /bin/restic backup /mnt/data/timemachine --quiet --json --host samba-0 --exclude /mnt/data/private/sjors/library --cache-dir /stash-tmp/restic-cache --cleanup-cache

So it looks like this restarted the backup, but did not actually kill the old one, so now there are two running backups according to ps aux:

   47 root      1:15 /bin/restic backup /mnt/data/timemachine --quiet --json --
   59 root      0:00 /bin/restic backup /mnt/data/timemachine --quiet --json --

Can you reproduce this on your end?

Mar 04 '23 14:03 sgielen

A similar thing just happened during a long-running restic forget. (The lost lease is due to my API server being flaky, which unfortunately I can't do anything about at the moment. But I figured you might be interested in the behavior of Stash this results in.)

So first, a restic forget is running...

I0304 15:38:13.672163       1 commands.go:178] Cleaning old snapshots according to retention policy
[golang-sh]$ /bin/restic forget --quiet --json --keep-last 2 --prune --cache-dir /stash-tmp/restic-cache

The lock is lost, reacquired, and then the backup starts over...

E0304 16:42:32.077405       1 leaderelection.go:330] error retrieving resource lock samba/lock-statefulset-samba-backup: client rate limiter Wait returned an error: context deadline exceeded
I0304 16:42:32.087620       1 leaderelection.go:283] failed to renew lease samba/lock-statefulset-samba-backup: timed out waiting for the condition
I0304 16:42:32.117033       1 backupsession.go:390] Lost leadership
I0304 16:42:32.125189       1 backupsession.go:176] Sync/Add/Update for Backup Session backup-1677944161
I0304 16:42:32.256509       1 backupsession.go:347] Attempting to acquire leadership for backup
I0304 16:42:32.283558       1 leaderelection.go:248] attempting to acquire leader lease samba/lock-statefulset-samba-backup...
E0304 16:42:32.433162       1 recorder.go:78] events is forbidden: User "system:serviceaccount:samba:default" cannot create resource "events" in API group "" in the namespace "default"
I0304 16:42:32.468349       1 leaderelection.go:258] successfully acquired lease samba/lock-statefulset-samba-backup
I0304 16:42:32.472645       1 backupsession.go:379] Got leadership, preparing for backup

But it fails, because the forget was also still running in the background and has the repository locked.

I0304 16:42:32.672383       1 commands.go:120] Backing up target data
[golang-sh]$ /bin/restic backup /mnt/data/timemachine --quiet --json --host samba-0 --exclude /mnt/data/private/sjors/library --cache-dir /stash-tmp/restic-cache --cleanup-cache
Fatal: unable to create lock in backend: repository is already locked exclusively by PID 233 on samba-0 by root (UID 0, GID 0)
lock was created at 2023-03-04 16:38:19 (4m19.217251168s ago)
storage ID 52420233
W0304 16:42:38.718369       1 backupsession.go:434] Failed to take backup for BackupSession backup-1677944161. Reason: Fatal: unable to create lock in backend: repository is already locked exclusively by PID 233 on samba-0 by root (UID 0, GID 0) lock was created at 2023-03-04 16:38:19 (4m19.217251168s ago) storage ID 52420233
I0304 16:42:38.723286       1 status.go:99] Updating post backup status.......
I0304 16:42:38.953631       1 backupsession.go:390] Lost leadership

Then, some time later, that restic forget actually returns, showing that the sidecar container is still aware of the old ongoing backupsession...

I0304 16:42:38.953631       1 backupsession.go:390] Lost leadership
I0304 17:44:14.668413       1 commands.go:412] sh-output: [{"tags":null,"host":"samba-0","paths":["/mnt/data/private"],"keep":[....]"matches":["last snapshot"],"counters":{}}]}]
I0304 17:44:15.265525       1 backupsession.go:176] Sync/Add/Update for Backup Session backup-1677944161
I0304 17:44:15.269583       1 status.go:317] Verifying repository integrity...........

Mar 04 '23 16:03 sgielen

Hello @sgielen. Thank you for reporting the issue. We will try to reproduce the issue on our end and get back to you.

Mar 06 '23 09:03 hmsayem

@hmsayem did you succeed in reproducing the issue on your end? :)

Mar 20 '23 10:03 sgielen

@sgielen. Unfortunately, we have not had the opportunity to reproduce the issue yet. We will start working on the issue as soon as possible and keep you updated on any progress. Thank you for your patience.

Mar 20 '23 12:03 hmsayem

So it looks like this restarted the backup, but did not actually kill the old one

Did it happen for the same BackupSession?

Mar 21 '23 11:03 hmsayem

So it looks like this restarted the backup, but did not actually kill the old one

Did it happen for the same BackupSession?

Yes absolutely. It is very reproducible here, now that I've noticed that this is what happens. I think it has been occurring for a long time and I suspect it also might be what caused #1488?

You can see it in the logs here, too, from this morning:

I0321 11:00:56.399129       1 backupsession.go:176] Sync/Add/Update for Backup Session backup-1679396456
I0321 11:00:56.415584       1 backupsession.go:347] Attempting to acquire leadership for backup
I0321 11:00:56.416241       1 leaderelection.go:248] attempting to acquire leader lease samba/lock-statefulset-samba-backup...
I0321 11:00:56.905441       1 leaderelection.go:258] successfully acquired lease samba/lock-statefulset-samba-backup
I0321 11:00:56.907545       1 backupsession.go:379] Got leadership, preparing for backup
I0321 11:00:57.923548       1 commands.go:100] Checking whether the backend repository exist or not....
[golang-sh]$ /bin/restic snapshots --json --no-lock --cache-dir /stash-tmp/restic-cache
I0321 11:01:02.055393       1 commands.go:412] sh-output: [snip]

I0321 11:01:02.147208       1 commands.go:120] Backing up target data
[golang-sh]$ /bin/restic backup /mnt/data/timemachine --quiet --json --host samba-0 --exclude /mnt/data/private/sjors/library --cache-dir /stash-tmp/restic-cache --cleanup-cache
I0321 11:03:41.105896       1 commands.go:412] sh-output: {"message_type":"status","seconds_elapsed":85,"percent_done":0}
{"message_type":"status","seconds_elapsed":146,"percent_done":1,"total_files":21596,"files_done":21596,"total_bytes":1448142982967,"bytes_done":1448142982967}
{"message_type":"summary","files_new":0,"files_changed":0,"files_unmodified":21596,"dirs_new":0,"dirs_changed":0,"dirs_unmodified":8,"data_blobs":0,"tree_blobs":0,"data_added":0,"total_files_processed":21596,"total_bytes_processed":1448142982967,"total_duration":150.481592673,"snapshot_id":"d2dc24e3"}

I0321 11:03:41.779846       1 backupsession.go:429] Backup completed successfully for BackupSession backup-1679396456
I0321 11:03:42.033425       1 status.go:99] Updating post backup status.......
I0321 11:03:42.970604       1 status.go:378] Waiting for all other targets/hosts to complete their backup.....
I0321 11:03:42.989620       1 status.go:297] Applying retention policy.....
I0321 11:03:43.050430       1 commands.go:178] Cleaning old snapshots according to retention policy
[golang-sh]$ /bin/restic forget --quiet --json --keep-last 2 --prune --cache-dir /stash-tmp/restic-cache
E0321 11:30:57.874138       1 leaderelection.go:330] error retrieving resource lock samba/lock-statefulset-samba-backup: Get "https://10.43.0.1:443/api/v1/namespaces/samba/configmaps/lock-statefulset-samba-backup": context deadline exceeded
I0321 11:31:00.590746       1 leaderelection.go:283] failed to renew lease samba/lock-statefulset-samba-backup: timed out waiting for the condition
I0321 11:31:08.329558       1 backupsession.go:390] Lost leadership
I0321 11:31:22.554811       1 backupsession.go:176] Sync/Add/Update for Backup Session backup-1679396456
E0321 11:31:24.690379       1 recorder.go:78] events is forbidden: User "system:serviceaccount:samba:default" cannot create resource "events" in API group "" in the namespace "default"
I0321 11:31:25.338329       1 backupsession.go:347] Attempting to acquire leadership for backup
I0321 11:31:25.821707       1 leaderelection.go:248] attempting to acquire leader lease samba/lock-statefulset-samba-backup...
I0321 11:31:26.256067       1 leaderelection.go:258] successfully acquired lease samba/lock-statefulset-samba-backup
I0321 11:31:26.284248       1 backupsession.go:379] Got leadership, preparing for backup
I0321 11:31:37.865459       1 commands.go:120] Backing up target data
[golang-sh]$ /bin/restic backup /mnt/data/timemachine --quiet --json --host samba-0 --exclude /mnt/data/private/sjors/library --cache-dir /stash-tmp/restic-cache --cleanup-cache

note how it starts backup-1679396456, then starts the restic forget, and when the lease fails to renew, it restarts the same backup entirely. This is the only BackupSession that's ongoing, no BackupSessions pending. At the time of writing it's set to Failed even though the original restic forget is still running:

$ kubectl get backupsession -n samba
NAME                INVOKER-TYPE          INVOKER-NAME   PHASE    DURATION   AGE
backup-1679189534   BackupConfiguration   backup         Failed   45m23s     2d10h
backup-1679362325   BackupConfiguration   backup         Failed   10s        10h
backup-1679396456   BackupConfiguration   backup         Failed   34m50s     88m
$ kubectl exec -ti -n samba samba-0 -c stash -- ps aux
PID   USER     TIME  COMMAND
    1 root      5:54 /stash run-backup --invoker-name=backup --invoker-kind=BackupConfiguration --target-name=samba --target-namespace=samba --target-kind=StatefulSet --enable-cache=true
  224 root      0:00 sh
  251 root     36:14 /bin/restic forget --quiet --json --keep-last 2 --prune --cache-dir /stash-tmp/restic-cache
  281 root      0:00 ps aux

Mar 21 '23 12:03 sgielen

@sgielen can you please share the yaml of the BackupSession (failed backup-1679396456) with us?

Mar 21 '23 12:03 hmsayem

apiVersion: stash.appscode.com/v1beta1
kind: BackupSession
metadata:
  creationTimestamp: "2023-03-21T11:00:56Z"
  generation: 1
  labels:
    app.kubernetes.io/component: stash-backup
    app.kubernetes.io/managed-by: stash.appscode.com
    stash.appscode.com/invoker-name: backup
    stash.appscode.com/invoker-type: BackupConfiguration
    stash.appscode.com/target-kind: StatefulSet
    stash.appscode.com/target-name: samba
    stash.appscode.com/target-namespace: ""
  name: backup-1679396456
  namespace: samba
  ownerReferences:
  - apiVersion: stash.appscode.com/v1beta1
    blockOwnerDeletion: true
    controller: true
    kind: BackupConfiguration
    name: backup
    uid: 5e47ff1a-5ee8-4fb9-a73f-3407b36f44fb
  resourceVersion: "22222554"
  uid: 47f47cf8-a18f-4af9-8dff-51b90f215e36
spec:
  invoker:
    apiGroup: stash.appscode.com
    kind: BackupConfiguration
    name: backup
status:
  conditions:
  - lastTransitionTime: "2023-03-21T11:01:02Z"
    message: Repository exist in the backend.
    reason: BackendRepositoryFound
    status: "True"
    type: BackendRepositoryInitialized
  - lastTransitionTime: "2023-03-21T11:35:46Z"
    message: Successfully cleaned up backup history according to backupHistoryLimit.
    reason: SuccessfullyCleanedBackupHistory
    status: "True"
    type: BackupHistoryCleaned
  - lastTransitionTime: "2023-03-21T11:35:46Z"
    message: Successfully pushed metrics.
    reason: SuccessfullyPushedMetrics
    status: "True"
    type: MetricsPushed
  phase: Failed
  sessionDeadline: "2023-03-23T11:00:56Z"
  sessionDuration: 34m50s
  targets:
  - phase: Failed
    postBackupActions:
    - ApplyRetentionPolicy
    - VerifyRepositoryIntegrity
    - SendRepositoryMetrics
    preBackupActions:
    - InitializeBackendRepository
    ref:
      apiVersion: apps/v1
      kind: StatefulSet
      name: samba
      namespace: samba
    stats:
    - error: 'failed to complete backup for host samba-0. Reason: signal: terminated'
      hostname: samba-0
      phase: Failed
    totalHosts: 1

It was terminated, by the way, because I killed the restic backup. This is also visible in the logs above:

[golang-sh]$ /bin/restic backup /mnt/data/timemachine --quiet --json --host samba-0 --exclude /mnt/data/private/sjors/library --cache-dir /stash-tmp/restic-cache --cleanup-cache
[...]
W0321 11:33:52.288366       1 backupsession.go:434] Failed to take backup for BackupSession backup-1679396456. Reason: signal: terminated
I0321 11:34:19.413866       1 status.go:99] Updating post backup status.......

I killed restic backup because it was running at the same time as restic forget, which had the lock, so it would fail anyway.

Mar 21 '23 12:03 sgielen

I believe it should be possible to reproduce this issue by deleting the Lease in the same namespace as the backup Pod. I'm currently trying that.

Mar 21 '23 13:03 sgielen

I believe it should be possible to reproduce this issue by deleting the Lease in the same namespace as the backup Pod. I'm currently trying that.

That didn't work unfortunately, the process just created a new lock-statefulset-samba-backup. Perhaps if the Lease is deleted and there are no rights to create a new one... I'm trying to reproduce the scenario I have here, where sometimes the API server simply doesn't answer for a few seconds (because of high load) therefore reacquiring the lock fails.

Mar 21 '23 13:03 sgielen

Got it! If you change the holder of the lease using kubectl edit lease -n samba lock-statefulset-samba-backup, the issue is reproduced:

$ kubectl get lease -n samba -w                            
NAME                            HOLDER        AGE
lock-statefulset-samba-backup   not-samba-0   75s

I0321 13:05:43.971306       1 commands.go:120] Backing up target data
[golang-sh]$ /bin/restic backup /mnt/data/timemachine --quiet --json --host samba-0 --exclude /mnt/data/private/sjors/library --cache-dir /stash-tmp/restic-cache --cleanup-cache

[ change holder here ]
I0321 13:07:11.351787       1 leaderelection.go:283] failed to renew lease samba/lock-statefulset-samba-backup: timed out waiting for the condition
I0321 13:07:11.351909       1 backupsession.go:390] Lost leadership
I0321 13:07:11.351998       1 backupsession.go:176] Sync/Add/Update for Backup Session backup-1679403934
I0321 13:07:11.385326       1 backupsession.go:347] Attempting to acquire leadership for backup
I0321 13:07:11.390535       1 leaderelection.go:248] attempting to acquire leader lease samba/lock-statefulset-samba-backup...
I0321 13:07:27.268253       1 leaderelection.go:258] successfully acquired lease samba/lock-statefulset-samba-backup
I0321 13:07:27.271271       1 backupsession.go:379] Got leadership, preparing for backup
I0321 13:07:27.384546       1 commands.go:120] Backing up target data
[golang-sh]$ /bin/restic backup /mnt/data/timemachine --quiet --json --host samba-0 --exclude /mnt/data/private/sjors/library --cache-dir /stash-tmp/restic-cache --cleanup-cache

Now, two backups are ongoing at the same time. Can you reproduce this, too?

Mar 21 '23 13:03 sgielen

It was terminated, by the way, because I killed the restic backup

So the backup keeps restarting indefinitely unless you kill it explicitly?

Mar 21 '23 13:03 hmsayem

So the backup keeps restarting indefinitely unless you kill it explicitly?

Every time the process loses its lease, it restarts the backup, while keeping the old one running. Eventually, all of them will succeed or fail, but that's beside the point -- it's the multiple ongoing backups at the same time which is the root of the issue here.

Mar 21 '23 13:03 sgielen

Now, two backups are ongoing at the same time. Can you reproduce this, too?

I will try to reproduce it and let you know the update.

Mar 21 '23 13:03 hmsayem

Thank you!

Mar 21 '23 13:03 sgielen