Fix possible incorrect behaviour of backup_and_restore.sh with --delete-days parameter

Open selivan opened this issue 3 years ago • 0 comments

Summary

backup_and_restore.sh has --delete-days parameter that works very straightforward:

find ${BACKUP_LOCATION}/mailcow-* -maxdepth 0 -mmin +$((${1}*60*24)) -exec rm -rvf {} \;

Also the script does rotation without checking if the backup was successful or not.

This is a pitfall for two possible very bad scenarios:

something goes wrong, and all backups since that time become broken. For example, mariabackup is not successful anymore. But backup rotation still works and in N days user will have N corrupted backups and zero good backups.
server goes offline for a time longer than --delete-days in cron job. After going online it will delete all backups except the last taken.

The two scenarios can combine: server goes offline for N+M days, than it goes offline but now docker stops working: /var/lib/docker failed to mount. Backup volume however mounted correctly, so now the cron job creates incorrect backup and deletes all good backups.

I suggest:

Include backup was successful check in the script and run rotation only if it was ok
Replace --delete-days parametes with safer approach like --number-backups-to-keep

Motivation

Users will be less likely to find themselves without correct backup.

Also, if the backup script can provide exit code indication backup success or failure, it can be integrated with monitoring.

Additional context

When I was just starting to work in IT, I've learned the idea about backup rotation only after checking that taken backup is correct the hard way. Let's keep everybody else from that experience :)

Apr 26 '22 18:04 selivan