stash icon indicating copy to clipboard operation
stash copied to clipboard

Failed BackupSession is still marked as "running"

Open sgielen opened this issue 3 years ago • 6 comments

I have the following in my stash container logs:

I0930 05:51:15.930910       1 commands.go:120] Backing up target data
[golang-sh]$ /bin/restic backup /mnt/data/timemachine --quiet --json --host samba-0 --cache-dir /tmp/restic-cache --cleanup-cache
E0930 05:53:15.949682       1 leaderelection.go:367] Failed to update lock: context deadline exceeded
I0930 05:53:17.302915       1 leaderelection.go:283] failed to renew lease samba/lock-statefulset-samba-backup: timed out waiting for the condition
I0930 05:53:17.509894       1 backupsession.go:398] Lost leadership
I0930 05:53:18.075803       1 backupsession.go:187] Sync/Add/Update for Backup Session backup-1664245947
I0930 05:53:19.934081       1 backupsession.go:214] Skip processing BackupSession samba/backup-1664245947. Reason: BackupSession has been processed already for host "samba-0"
{"message_type":"error","error":{"Op":"lstat","Path":"/mnt/data/timemachine/sjors/[snip]","Err":2},"during":"archival","item":"/mnt/data/timemachine/sjors/[snip]"}
{"message_type":"error","error":{"Op":"lstat","Path":"/mnt/data/timemachine/sjors/[snip]","Err":2},"during":"archival","item":"/mnt/data/timemachine/sjors/[snip]"}
Warning: at least one source file could not be read
W0930 09:48:21.666033       1 backupsession.go:442] Failed to take backup for BackupSession backup-1664505145. Reason: Warning: at least one source file could not be read
I0930 09:48:21.693846       1 status.go:99] Updating post backup status.......

Inside the stash container, only /stash is running, nothing restic.

However, the BackupSession is still Phase: Running:

Status:
  Conditions:
    Last Transition Time:  2022-09-30T02:32:32Z
    Message:               Repository exist in the backend.
    Reason:                BackendRepositoryFound
    Status:                True
    Type:                  BackendRepositoryInitialized
  Phase:                   Running
  Session Deadline:        2022-10-02T02:32:25Z

Could this be a bug?

sgielen avatar Sep 30 '22 15:09 sgielen

@hmsayem and @piyush1146115 Please take a look when you are available.

hossainemruz avatar Sep 30 '22 16:09 hossainemruz

Hello @sgielen! Did the BackupSession get stuck in the Running phase?

hmsayem avatar Sep 30 '22 18:09 hmsayem

Stash runs backup as a non-root user. Thus, Stash fails to read the targeted data, If the target data directory is not readable to all the users. Solution is to run the backup process as the same user as the targeted application or run the backup process as the root user.
Please check the troubleshooting guide for more details: https://stash.run/docs/v2022.09.29/guides/troubleshooting/source-data-read-failed/

hmsayem avatar Oct 03 '22 05:10 hmsayem

Hello @sgielen! Did the BackupSession get stuck in the Running phase?

Yes, it was in the Running phase and got stuck.

Stash runs backup as a non-root user. Thus, Stash fails to read the targeted data, If the target data directory is not readable to all the users. Solution is to run the backup process as the same user as the targeted application or run the backup process as the root user.

It does not seem to me to be an access issue: the backup runs as root (securityContext runAsUser 0 and runAsGroup 0), has succeeded multiple times before, and only 2 out of the thousands of files could not be read this time. The files are on a remote volume, so it is more likely to me that a network issue caused the backup to fail. A failing backup would then be expected, but not one stuck in Running.

In the meantime, the backup timeout expired so the Running backup was removed, but now backups are failing because the repository is locked (repository is already locked by... for a PID that is not running). I know how to fix this -- I'm mentioning it because perhaps this could give an idea why the backup was stuck in Running -- maybe restic did not exit in the expected way?

sgielen avatar Oct 03 '22 07:10 sgielen

Restic creates a lock at the beginning of its operation. Since the last backup couldn't get completed and failed due to timeout, Restic coudn't remove the lock.

We need to check the BackupSession that failed due to timeout. Please share the BackupSession with us, if that happens again.

hmsayem avatar Oct 03 '22 08:10 hmsayem

When I created my issue here, restic already wasn't running anymore according to ps aux, I reported this in the original issue. So, at that point, the Backup was still Running (see the partial output of the kubectl describe in the original issue) while restic was already exited.

Unfortunately indeed the timed out BackupSession has been removed from the cluster since then, but I will update this issue if it happens again.

sgielen avatar Oct 03 '22 08:10 sgielen