k8up icon indicating copy to clipboard operation
k8up copied to clipboard

Failed backup stated as succeeded

Open xoxys opened this issue 2 years ago • 7 comments

Description

Backups are stated as Succeeded even if the backup commands failed. This is just an example, the reason why the backup failed has already been fixed. However, it is problematic to describe a defective backup as successful.

Additional Context

No response

Logs

❯ kubectl get backups.k8up.io -n authelia-public 
NAME                    SCHEDULE REF   COMPLETION   PREBACKUP   AGE
postgres                               Succeeded    Finished    12h
postgres-backup-qbq6q   postgres       Succeeded    Finished    12h
2023-11-18T23:02:31Z	INFO	k8up	Starting k8up…	{"version": "2.7.2", "date": "2023-10-09T10:13:29Z", "commit": "45d99dd90dbb2a080e6832c34e96b371216a3e0b", "go_os": "linux", "go_arch": "amd64", "go_version": "go1.19.13", "uid": 65532, "gid": 0}
2023-11-18T23:02:31Z	INFO	k8up.restic	initializing
2023-11-18T23:02:31Z	INFO	k8up.restic	setting up a signal handler
2023-11-18T23:02:31Z	INFO	k8up.restic.restic	using the following restic options	{"options": [""]}
2023-11-18T23:02:31Z	INFO	k8up.restic.restic.RepoInit.command	restic command	{"path": "/usr/local/bin/restic", "args": ["init", "--option", ""]}
2023-11-18T23:02:31Z	INFO	k8up.restic.restic.RepoInit.command	Defining RESTIC_PROGRESS_FPS	{"frequency": 0.016666666666666666}
2023-11-18T23:02:32Z	INFO	k8up.restic.restic.unlock	unlocking repository	{"all": false}
2023-11-18T23:02:32Z	INFO	k8up.restic.restic.unlock.command	restic command	{"path": "/usr/local/bin/restic", "args": ["unlock", "--option", ""]}
2023-11-18T23:02:32Z	INFO	k8up.restic.restic.unlock.command	Defining RESTIC_PROGRESS_FPS	{"frequency": 0.016666666666666666}
2023-11-18T23:02:36Z	INFO	k8up.restic.restic.snapshots	getting list of snapshots
2023-11-18T23:02:36Z	INFO	k8up.restic.restic.snapshots.command	restic command	{"path": "/usr/local/bin/restic", "args": ["snapshots", "--option", "", "--json"]}
2023-11-18T23:02:36Z	INFO	k8up.restic.restic.snapshots.command	Defining RESTIC_PROGRESS_FPS	{"frequency": 0.016666666666666666}
2023-11-18T23:02:43Z	INFO	k8up.restic.k8sClient	listing all pods	{"annotation": "k8up.io/backupcommand", "namespace": "authelia-public"}
2023-11-18T23:02:43Z	INFO	k8up.restic.k8sClient	adding to backup list	{"namespace": "authelia-public", "pod": "pgdump-77788b7db9-n4tp6"}
2023-11-18T23:02:43Z	INFO	k8up.restic.k8sExec	executing command	{"command": "sh, -c, chmod 600 /var/lib/postgresql/.pgpass && pg_dump --clean", "namespace": "authelia-public", "pod": "pgdump-77788b7db9-n4tp6"}
2023-11-18T23:02:43Z	INFO	k8up.restic.restic.stdinBackup	starting stdin backup	{"filename": "/authelia-public-pgdump", "extension": ".sql"}
2023-11-18T23:02:43Z	INFO	k8up.restic.restic.stdinBackup.command	restic command	{"path": "/usr/local/bin/restic", "args": ["backup", "--option", "", "--stdin-filename", "/authelia-public-pgdump.sql", "--host", "authelia-public", "--json", "--stdin"]}
2023-11-18T23:02:43Z	INFO	k8up.restic.restic.stdinBackup.command	Defining RESTIC_PROGRESS_FPS	{"frequency": 0.016666666666666666}
2023-11-18T23:02:43Z	INFO	k8up.restic.pgdump-77788b7db9-n4tp6.stderr	chmod: changing permissions of '/var/lib/postgresql/.pgpass': Read-only file system
2023-11-18T23:02:43Z	ERROR	k8up.restic.k8sExec	streaming data failed	{"namespace": "authelia-public", "pod": "pgdump-77788b7db9-n4tp6", "error": "command terminated with exit code 1"}
github.com/k8up-io/k8up/v2/restic/kubernetes.PodExec.func1
	/home/runner/work/k8up/k8up/restic/kubernetes/pod_exec.go:74
2023-11-18T23:02:48Z	INFO	k8up.restic.restic.stdinBackup.progress	restic output	{"msg": "{\"message_type\":\"error\",\"error\":{\"Op\":\"read\",\"Path\":\"/authelia-public-pgdump.sql\",\"Err\":{}},\"during\":\"archival\",\"item\":\"/authelia-public-pgdump.sql\"}"}
2023-11-18T23:02:48Z	ERROR	k8up.restic.restic.stdinBackup.progress	/authelia-public-pgdump.sql during archival read	{"error": "error occurred during backup"}
github.com/k8up-io/k8up/v2/restic/logging.(*BackupOutputParser).out
	/home/runner/work/k8up/k8up/restic/logging/logging.go:156
github.com/k8up-io/k8up/v2/restic/logging.writer.Write
	/home/runner/work/k8up/k8up/restic/logging/logging.go:103
io.copyBuffer
	/opt/hostedtoolcache/go/1.19.13/x64/src/io/io.go:429
io.Copy
	/opt/hostedtoolcache/go/1.19.13/x64/src/io/io.go:386
os/exec.(*Cmd).writerDescriptor.func1
	/opt/hostedtoolcache/go/1.19.13/x64/src/os/exec/exec.go:407
os/exec.(*Cmd).Start.func1
	/opt/hostedtoolcache/go/1.19.13/x64/src/os/exec/exec.go:544
2023-11-18T23:02:48Z	INFO	k8up.restic.restic.stdinBackup.progress	backup finished	{"new files": 0, "changed files": 0, "errors": 1}
2023-11-18T23:02:48Z	INFO	k8up.restic.restic.stdinBackup.progress	stats	{"time": 2.627027521, "bytes added": 0, "bytes processed": 0}
2023-11-18T23:02:48Z	INFO	k8up.restic.restic.MountCollector	stats mount dir doesn't exist, skipping stats	{"dir": "/data"}
2023-11-18T23:02:49Z	INFO	k8up.restic.restic.stdinBackup.progress	restic output	{"msg": "Warning: at least one source file could not be read"}
2023-11-18T23:02:49Z	INFO	k8up.restic	backups of annotated jobs have finished successfully
2023-11-18T23:02:49Z	INFO	k8up.restic.restic.backup	starting backup
2023-11-18T23:02:49Z	INFO	k8up.restic.restic.backup	backupdir does not exist, skipping. Sending snapshot list	{"dirname": "/data"}
2023-11-18T23:02:49Z	INFO	k8up.restic.restic.snapshots	getting list of snapshots
2023-11-18T23:02:49Z	INFO	k8up.restic.restic.snapshots.command	restic command	{"path": "/usr/local/bin/restic", "args": ["snapshots", "--option", "", "--json"]}
2023-11-18T23:02:49Z	INFO	k8up.restic.restic.snapshots.command	Defining RESTIC_PROGRESS_FPS	{"frequency": 0.016666666666666666}


### Expected Behavior

Failed backups should be stated as failed instead.

### Steps To Reproduce

_No response_

### Version of K8up

v2.7.2

### Version of Kubernetes

v1.27.7+k3s1

### Distribution of Kubernetes

K3s

xoxys avatar Nov 19 '23 11:11 xoxys

Hi

I checked this problem, and after testing this situation I recognized this problem happened because of restic command.

When the backupcommand annotation is executed, the stdin of the command pipe into restic command, and the restic store stream data in the snapshot

Unfortunately, can't fix this problem because of restic. But k8up has a summary backup for detecting the status of the backup. You can use Webhook or Prometheus to get the status of the backup

Also, I have a solution for fixing this problem: We can add an annotation to handle errors in the backup command and delete the snapshot when the backup command fails. This backward compatible

@xoxys @Kidswiss

poyaz avatar Dec 11 '23 07:12 poyaz

Just wanted to plusone this. I had this problem happen to me with two different workloads for different reasons (xz not being available, and a wrong env syntax for postgres).

If I hadn't checked manually with restic, I wouldn't have noticed this! I think having failed backups marked as such would be a great UX improvement.

nadiamoe avatar May 13 '24 09:05 nadiamoe

Also bumped into this problem, in my case probably due to file permission issues, did spot the following in the logs:

INFO    k8up.restic.restic.backup.progress    restic output    {"msg": "Warning: at least one source file could not be read"}   

It would be great if there would be a way to make the issue more visible. I only noticed this during tests of the restore procedure.

johbo avatar Jun 03 '24 17:06 johbo

Concerning the backup procedure, I solve the issue by specifying this under the spec of the Backup object (and Schedule object as well).

  podSecurityContext:
    fsGroup: 0
    runAsUser: 0

I wanted to backup my Nextcloud where the data directory is only permitted for www-data:www-data. So I ran my Backup with the user 0 (root) on purpose. I think it only works if you have the permission to execute the k8up jobs as root on the cluster. Otherwise you should try with other user id and group id.

damsien avatar Jun 25 '24 17:06 damsien

This is a very important issue. Any backup system must inspire trust. If the system reports successful backup completion but actually does not work, there will be no trust to speak of. Without trust, the system will have no users.

Maybe there is a way to wrap restic in another process that will read and parse its logs to determine the success of the operation?

vizh avatar Sep 05 '24 10:09 vizh

Hello,

We are currently evaluating k8up for adoption at our organisation. We ran into this in our testing.

It looks promising and besides a few last checks, we're likely to go with it. This issue is a blocker for adoption for us, so I'd like to contribute to fixing it here.

I wanted to ask in case we do submit a PR, would there be bandwidth/cycles on the project's end to review it?

What would be the criteria for this issue to be complete? Would it be for the backup resource to report as Completion: Failed as per https://docs.k8up.io/k8up/2.11/references/status.html when the backup command exits with a bad return code?

Anything else I want to watch out for ? Or can I simply submit a PR?

@tobru I saw you were moving the status, so wanted to tag you for visibility. Would you be able to help us to get an idea of if this is possible?

EDIT: we're still evaluating, so no promises here - but in a working day or two I'll know if we go with it and then I'd help get this fixed.

danielpodwysocki-gini avatar Oct 18 '24 08:10 danielpodwysocki-gini

---
apiVersion: k8up.io/v1
kind: PreBackupPod
metadata:
  name: mysqldump
spec:
  #backupCommand: sh -c 'mysqldump -u$USER -p$PW -h $DB_HOST --all-databases'
  backupCommand: 'sh -c "printf whatup && exit 1"'
  pod:
    spec:
      containers:
        - env:
            - name: MARIADB_USER
              value: root
            - name: DB_HOST
              value: mariadb.default.svc.internal:3306
            - name: MARIADB_ROOT_PASSWORD
              valueFrom:
                secretKeyRef:
                  name: mariadb-pass
                  key: password
          image: mariadb
          command:
            - 'sleep'
            - 'infinity'
          imagePullPolicy: Always
          name: mysqldump

Just as an extra note, here's an easy repro/what I used when I first noticed this issue on our end.

This doesn't raise either the errors count (can be found in pod logs and is 0 with this) that the BackupStatus struct stores or get the correct status reflected in the k8s backup resource.

danielpodwysocki-gini avatar Oct 18 '24 09:10 danielpodwysocki-gini

I have pushed an e2e test that detects this problem here: https://github.com/danielpodwysocki-gini/k8up/

I will try to fix that and submit a PR soon.

danielpodwysocki-gini avatar Nov 26 '24 15:11 danielpodwysocki-gini

@danielpodwysocki-gini sorry didn't see your last comment.

I've already pushed a WIP fix here: https://github.com/k8up-io/k8up/pull/1027

Including a proper e2e test that detects the issue.

I expect to get this merged at the end of next week at the latest.

Kidswiss avatar Dec 05 '24 15:12 Kidswiss

Great, thank you!

Much more elegant compared to what I've been attempting (only halfway successfully so-far, so you came with something nicer just in time.)

danielpodwysocki-gini avatar Dec 05 '24 17:12 danielpodwysocki-gini

Just another potential case for a failed backup that succeeds that we discovered: https://github.com/uselagoon/build-deploy-tool/issues/361

If the mounted volume isn't accessible due to permissions (incorrect user, or other), the restic command errors during the scan, but the backup still continues. This means that the backup data in the snapshot is empty, and because there was no failure we were unable to know there was an issue until it was too late and someone wanted to recover data. If the volume fails during a scan, I would have expected that the backup should actually fail. Not continue and save an empty snapshot.

1.7243634935042348e+09	ERROR	k8up.restic.restic.backup.progress	/data/nginx during scan 	{"error": "error occurred during backup"}
github.com/k8up-io/k8up/v2/restic/logging.(*BackupOutputParser).out
	/home/runner/work/k8up/k8up/restic/logging/logging.go:156
github.com/k8up-io/k8up/v2/restic/logging.writer.Write
	/home/runner/work/k8up/k8up/restic/logging/logging.go:103
io.copyBuffer
	/opt/hostedtoolcache/go/1.19.2/x64/src/io/io.go:429
io.Copy
	/opt/hostedtoolcache/go/1.19.2/x64/src/io/io.go:386
os/exec.(*Cmd).writerDescriptor.func1
	/opt/hostedtoolcache/go/1.19.2/x64/src/os/exec/exec.go:407
os/exec.(*Cmd).Start.func1
	/opt/hostedtoolcache/go/1.19.2/x64/src/os/exec/exec.go:544
1.724363493509051e+09	INFO	k8up.restic.restic.backup.progress	progress of backup	{"percentage": "0.00%"}
1.7243634938520162e+09	ERROR	k8up.restic.restic.backup.progress	/data/nginx during archival 	{"error": "error occurred during backup"}
github.com/k8up-io/k8up/v2/restic/logging.(*BackupOutputParser).out
	/home/runner/work/k8up/k8up/restic/logging/logging.go:156
github.com/k8up-io/k8up/v2/restic/logging.writer.Write
	/home/runner/work/k8up/k8up/restic/logging/logging.go:103
io.copyBuffer
	/opt/hostedtoolcache/go/1.19.2/x64/src/io/io.go:429
io.Copy
	/opt/hostedtoolcache/go/1.19.2/x64/src/io/io.go:386
os/exec.(*Cmd).writerDescriptor.func1
	/opt/hostedtoolcache/go/1.19.2/x64/src/os/exec/exec.go:407
os/exec.(*Cmd).Start.func1
	/opt/hostedtoolcache/go/1.19.2/x64/src/os/exec/exec.go:544

shreddedbacon avatar Dec 09 '24 04:12 shreddedbacon

Hi @shreddedbacon

If you set promURL to a valid prometheus push gateway, you should be able to query k8up_backup_restic_last_errors to get how many files failed. We have a sample alert in the examples folder: https://github.com/k8up-io/k8up/blob/master/config/samples/prometheus/rules.yaml#L4

If you need more help with that, please open another issue, as this issue here is about not detecting if a prebackup command fails.

Kidswiss avatar Dec 09 '24 08:12 Kidswiss