Failed backup stated as succeeded
Description
Backups are stated as Succeeded even if the backup commands failed. This is just an example, the reason why the backup failed has already been fixed. However, it is problematic to describe a defective backup as successful.
Additional Context
No response
Logs
❯ kubectl get backups.k8up.io -n authelia-public
NAME SCHEDULE REF COMPLETION PREBACKUP AGE
postgres Succeeded Finished 12h
postgres-backup-qbq6q postgres Succeeded Finished 12h
2023-11-18T23:02:31Z INFO k8up Starting k8up… {"version": "2.7.2", "date": "2023-10-09T10:13:29Z", "commit": "45d99dd90dbb2a080e6832c34e96b371216a3e0b", "go_os": "linux", "go_arch": "amd64", "go_version": "go1.19.13", "uid": 65532, "gid": 0}
2023-11-18T23:02:31Z INFO k8up.restic initializing
2023-11-18T23:02:31Z INFO k8up.restic setting up a signal handler
2023-11-18T23:02:31Z INFO k8up.restic.restic using the following restic options {"options": [""]}
2023-11-18T23:02:31Z INFO k8up.restic.restic.RepoInit.command restic command {"path": "/usr/local/bin/restic", "args": ["init", "--option", ""]}
2023-11-18T23:02:31Z INFO k8up.restic.restic.RepoInit.command Defining RESTIC_PROGRESS_FPS {"frequency": 0.016666666666666666}
2023-11-18T23:02:32Z INFO k8up.restic.restic.unlock unlocking repository {"all": false}
2023-11-18T23:02:32Z INFO k8up.restic.restic.unlock.command restic command {"path": "/usr/local/bin/restic", "args": ["unlock", "--option", ""]}
2023-11-18T23:02:32Z INFO k8up.restic.restic.unlock.command Defining RESTIC_PROGRESS_FPS {"frequency": 0.016666666666666666}
2023-11-18T23:02:36Z INFO k8up.restic.restic.snapshots getting list of snapshots
2023-11-18T23:02:36Z INFO k8up.restic.restic.snapshots.command restic command {"path": "/usr/local/bin/restic", "args": ["snapshots", "--option", "", "--json"]}
2023-11-18T23:02:36Z INFO k8up.restic.restic.snapshots.command Defining RESTIC_PROGRESS_FPS {"frequency": 0.016666666666666666}
2023-11-18T23:02:43Z INFO k8up.restic.k8sClient listing all pods {"annotation": "k8up.io/backupcommand", "namespace": "authelia-public"}
2023-11-18T23:02:43Z INFO k8up.restic.k8sClient adding to backup list {"namespace": "authelia-public", "pod": "pgdump-77788b7db9-n4tp6"}
2023-11-18T23:02:43Z INFO k8up.restic.k8sExec executing command {"command": "sh, -c, chmod 600 /var/lib/postgresql/.pgpass && pg_dump --clean", "namespace": "authelia-public", "pod": "pgdump-77788b7db9-n4tp6"}
2023-11-18T23:02:43Z INFO k8up.restic.restic.stdinBackup starting stdin backup {"filename": "/authelia-public-pgdump", "extension": ".sql"}
2023-11-18T23:02:43Z INFO k8up.restic.restic.stdinBackup.command restic command {"path": "/usr/local/bin/restic", "args": ["backup", "--option", "", "--stdin-filename", "/authelia-public-pgdump.sql", "--host", "authelia-public", "--json", "--stdin"]}
2023-11-18T23:02:43Z INFO k8up.restic.restic.stdinBackup.command Defining RESTIC_PROGRESS_FPS {"frequency": 0.016666666666666666}
2023-11-18T23:02:43Z INFO k8up.restic.pgdump-77788b7db9-n4tp6.stderr chmod: changing permissions of '/var/lib/postgresql/.pgpass': Read-only file system
2023-11-18T23:02:43Z ERROR k8up.restic.k8sExec streaming data failed {"namespace": "authelia-public", "pod": "pgdump-77788b7db9-n4tp6", "error": "command terminated with exit code 1"}
github.com/k8up-io/k8up/v2/restic/kubernetes.PodExec.func1
/home/runner/work/k8up/k8up/restic/kubernetes/pod_exec.go:74
2023-11-18T23:02:48Z INFO k8up.restic.restic.stdinBackup.progress restic output {"msg": "{\"message_type\":\"error\",\"error\":{\"Op\":\"read\",\"Path\":\"/authelia-public-pgdump.sql\",\"Err\":{}},\"during\":\"archival\",\"item\":\"/authelia-public-pgdump.sql\"}"}
2023-11-18T23:02:48Z ERROR k8up.restic.restic.stdinBackup.progress /authelia-public-pgdump.sql during archival read {"error": "error occurred during backup"}
github.com/k8up-io/k8up/v2/restic/logging.(*BackupOutputParser).out
/home/runner/work/k8up/k8up/restic/logging/logging.go:156
github.com/k8up-io/k8up/v2/restic/logging.writer.Write
/home/runner/work/k8up/k8up/restic/logging/logging.go:103
io.copyBuffer
/opt/hostedtoolcache/go/1.19.13/x64/src/io/io.go:429
io.Copy
/opt/hostedtoolcache/go/1.19.13/x64/src/io/io.go:386
os/exec.(*Cmd).writerDescriptor.func1
/opt/hostedtoolcache/go/1.19.13/x64/src/os/exec/exec.go:407
os/exec.(*Cmd).Start.func1
/opt/hostedtoolcache/go/1.19.13/x64/src/os/exec/exec.go:544
2023-11-18T23:02:48Z INFO k8up.restic.restic.stdinBackup.progress backup finished {"new files": 0, "changed files": 0, "errors": 1}
2023-11-18T23:02:48Z INFO k8up.restic.restic.stdinBackup.progress stats {"time": 2.627027521, "bytes added": 0, "bytes processed": 0}
2023-11-18T23:02:48Z INFO k8up.restic.restic.MountCollector stats mount dir doesn't exist, skipping stats {"dir": "/data"}
2023-11-18T23:02:49Z INFO k8up.restic.restic.stdinBackup.progress restic output {"msg": "Warning: at least one source file could not be read"}
2023-11-18T23:02:49Z INFO k8up.restic backups of annotated jobs have finished successfully
2023-11-18T23:02:49Z INFO k8up.restic.restic.backup starting backup
2023-11-18T23:02:49Z INFO k8up.restic.restic.backup backupdir does not exist, skipping. Sending snapshot list {"dirname": "/data"}
2023-11-18T23:02:49Z INFO k8up.restic.restic.snapshots getting list of snapshots
2023-11-18T23:02:49Z INFO k8up.restic.restic.snapshots.command restic command {"path": "/usr/local/bin/restic", "args": ["snapshots", "--option", "", "--json"]}
2023-11-18T23:02:49Z INFO k8up.restic.restic.snapshots.command Defining RESTIC_PROGRESS_FPS {"frequency": 0.016666666666666666}
### Expected Behavior
Failed backups should be stated as failed instead.
### Steps To Reproduce
_No response_
### Version of K8up
v2.7.2
### Version of Kubernetes
v1.27.7+k3s1
### Distribution of Kubernetes
K3s
Hi
I checked this problem, and after testing this situation I recognized this problem happened because of restic command.
When the backupcommand annotation is executed, the stdin of the command pipe into restic command, and the restic store stream data in the snapshot
Unfortunately, can't fix this problem because of restic. But k8up has a summary backup for detecting the status of the backup. You can use Webhook or Prometheus to get the status of the backup
Also, I have a solution for fixing this problem: We can add an annotation to handle errors in the backup command and delete the snapshot when the backup command fails. This backward compatible
@xoxys @Kidswiss
Just wanted to plusone this. I had this problem happen to me with two different workloads for different reasons (xz not being available, and a wrong env syntax for postgres).
If I hadn't checked manually with restic, I wouldn't have noticed this! I think having failed backups marked as such would be a great UX improvement.
Also bumped into this problem, in my case probably due to file permission issues, did spot the following in the logs:
INFO k8up.restic.restic.backup.progress restic output {"msg": "Warning: at least one source file could not be read"}
It would be great if there would be a way to make the issue more visible. I only noticed this during tests of the restore procedure.
Concerning the backup procedure, I solve the issue by specifying this under the spec of the Backup object (and Schedule object as well).
podSecurityContext:
fsGroup: 0
runAsUser: 0
I wanted to backup my Nextcloud where the data directory is only permitted for www-data:www-data. So I ran my Backup with the user 0 (root) on purpose. I think it only works if you have the permission to execute the k8up jobs as root on the cluster. Otherwise you should try with other user id and group id.
This is a very important issue. Any backup system must inspire trust. If the system reports successful backup completion but actually does not work, there will be no trust to speak of. Without trust, the system will have no users.
Maybe there is a way to wrap restic in another process that will read and parse its logs to determine the success of the operation?
Hello,
We are currently evaluating k8up for adoption at our organisation. We ran into this in our testing.
It looks promising and besides a few last checks, we're likely to go with it. This issue is a blocker for adoption for us, so I'd like to contribute to fixing it here.
I wanted to ask in case we do submit a PR, would there be bandwidth/cycles on the project's end to review it?
What would be the criteria for this issue to be complete? Would it be for the backup resource to report as Completion: Failed as per https://docs.k8up.io/k8up/2.11/references/status.html when the backup command exits with a bad return code?
Anything else I want to watch out for ? Or can I simply submit a PR?
@tobru I saw you were moving the status, so wanted to tag you for visibility. Would you be able to help us to get an idea of if this is possible?
EDIT: we're still evaluating, so no promises here - but in a working day or two I'll know if we go with it and then I'd help get this fixed.
---
apiVersion: k8up.io/v1
kind: PreBackupPod
metadata:
name: mysqldump
spec:
#backupCommand: sh -c 'mysqldump -u$USER -p$PW -h $DB_HOST --all-databases'
backupCommand: 'sh -c "printf whatup && exit 1"'
pod:
spec:
containers:
- env:
- name: MARIADB_USER
value: root
- name: DB_HOST
value: mariadb.default.svc.internal:3306
- name: MARIADB_ROOT_PASSWORD
valueFrom:
secretKeyRef:
name: mariadb-pass
key: password
image: mariadb
command:
- 'sleep'
- 'infinity'
imagePullPolicy: Always
name: mysqldump
Just as an extra note, here's an easy repro/what I used when I first noticed this issue on our end.
This doesn't raise either the errors count (can be found in pod logs and is 0 with this) that the BackupStatus struct stores or get the correct status reflected in the k8s backup resource.
I have pushed an e2e test that detects this problem here: https://github.com/danielpodwysocki-gini/k8up/
I will try to fix that and submit a PR soon.
@danielpodwysocki-gini sorry didn't see your last comment.
I've already pushed a WIP fix here: https://github.com/k8up-io/k8up/pull/1027
Including a proper e2e test that detects the issue.
I expect to get this merged at the end of next week at the latest.
Great, thank you!
Much more elegant compared to what I've been attempting (only halfway successfully so-far, so you came with something nicer just in time.)
Just another potential case for a failed backup that succeeds that we discovered: https://github.com/uselagoon/build-deploy-tool/issues/361
If the mounted volume isn't accessible due to permissions (incorrect user, or other), the restic command errors during the scan, but the backup still continues. This means that the backup data in the snapshot is empty, and because there was no failure we were unable to know there was an issue until it was too late and someone wanted to recover data. If the volume fails during a scan, I would have expected that the backup should actually fail. Not continue and save an empty snapshot.
1.7243634935042348e+09 ERROR k8up.restic.restic.backup.progress /data/nginx during scan {"error": "error occurred during backup"}
github.com/k8up-io/k8up/v2/restic/logging.(*BackupOutputParser).out
/home/runner/work/k8up/k8up/restic/logging/logging.go:156
github.com/k8up-io/k8up/v2/restic/logging.writer.Write
/home/runner/work/k8up/k8up/restic/logging/logging.go:103
io.copyBuffer
/opt/hostedtoolcache/go/1.19.2/x64/src/io/io.go:429
io.Copy
/opt/hostedtoolcache/go/1.19.2/x64/src/io/io.go:386
os/exec.(*Cmd).writerDescriptor.func1
/opt/hostedtoolcache/go/1.19.2/x64/src/os/exec/exec.go:407
os/exec.(*Cmd).Start.func1
/opt/hostedtoolcache/go/1.19.2/x64/src/os/exec/exec.go:544
1.724363493509051e+09 INFO k8up.restic.restic.backup.progress progress of backup {"percentage": "0.00%"}
1.7243634938520162e+09 ERROR k8up.restic.restic.backup.progress /data/nginx during archival {"error": "error occurred during backup"}
github.com/k8up-io/k8up/v2/restic/logging.(*BackupOutputParser).out
/home/runner/work/k8up/k8up/restic/logging/logging.go:156
github.com/k8up-io/k8up/v2/restic/logging.writer.Write
/home/runner/work/k8up/k8up/restic/logging/logging.go:103
io.copyBuffer
/opt/hostedtoolcache/go/1.19.2/x64/src/io/io.go:429
io.Copy
/opt/hostedtoolcache/go/1.19.2/x64/src/io/io.go:386
os/exec.(*Cmd).writerDescriptor.func1
/opt/hostedtoolcache/go/1.19.2/x64/src/os/exec/exec.go:407
os/exec.(*Cmd).Start.func1
/opt/hostedtoolcache/go/1.19.2/x64/src/os/exec/exec.go:544
Hi @shreddedbacon
If you set promURL to a valid prometheus push gateway, you should be able to query k8up_backup_restic_last_errors to get how many files failed. We have a sample alert in the examples folder: https://github.com/k8up-io/k8up/blob/master/config/samples/prometheus/rules.yaml#L4
If you need more help with that, please open another issue, as this issue here is about not detecting if a prebackup command fails.