zfs-backup
zfs-backup copied to clipboard
Auto recovery in case of locally-missing remote snapshot, other improvements
This solves a few issues, as well as add a bit more automation to remote-local snapshot matching:
- Implemented auto recovery in case of locally-missing remote snapshot
-
zfs hold
local snapshot before send operation, andzfs release
on program exit, to ensure existence and retention. - Improved consistency: if target machine is
localhost
, ssh is not used at all (ssh -n
not necessary anymore) - Improved compatibility: use
sh
instead ofksh
, fallback tosudo
ifpfexec
is not found,zfs
command location is now auto-discovered - Removed redundant code: now $RECENT is used programmatically, instead of branching
- Documented configurable parameters in config file, and added reminder to set properties on datasets
The auto recovery feature basically only changes the way the base of the incremental snapshot is obtained: if the latest remote is missing in the local machine, instead of just going into manteinance mode, zfs-backup tries to find an older snapshot that exists locally, and uses that as the zfs send -I
base instead. Up to $MAXAGE-1 snapshots older than the latest remote are checked; you can configure MAXAGE in the config file.
Sorry, noticed there was a bug as i was testing using multiple datasets (trap
is called only once). I'll fix it asap and open another pull request, for now i'll leave this open in case there's anything you'd like to change before that.
Thank you for your contribution!
I've done a quick read-through and seen enough little things here and there that I don't want to merge as-is, and I need to take time to read the recovery logic in-depth. It's a good idea but we need to make sure it is doing what we expect it to. Also, I'm not sure MAXAGE is the right name if it's based on number of snapshots vs. date?
Quick notes:
- "No need to change this" isn't 100% accurate for LOCK and PID, as if you have multiple backup jobs (e.g. local and offsite) you want different LOCK & PID files for each. There's also a couple typos in various comments.
- When running from cron, directories like /sbin may not be in the PATH, so we shouldn't rely on finding zfs with
which
. Having sensible defaults based on OS (for at least Solaris/FreeBSD/Linux) is on my to-do list. - I believe
zfs send
automatically does its own hold, so I don't know if yours is necessary, but if so, you probably wantzfs hold -r
(andzfs release -r
). I would have to check but it's possible the auto-hold is only on the snapshot being sent at a particular moment (and not its children), so a recursive hold may be desirable. - Please explain removing
ssh -n
. Note that the if/else block you eliminated also outputs different error messages for local vs. remote, which I thought would be helpful. And I'm pretty sure ssh is already skipped for localhost? - Storing the entire snapshot listing in a variable (e.g. $all_locals, $all_remotes) could hit variable length limits, meaning some snapshots are missed (vs. piping to grep and head/awk before assigning the variable). From a quick Google this appears to be platform-dependent on ARG_MAX, e.g. 256k on FreeBSD. It's probably an edge case that would be unlikely to be encountered, but I'd rather not introduce potential bugs.
The recovery logic is very simple: if latest remote snapshot does not exist locally, then check if the first older remote one does, and retry up to $MAXAGE times (or until all remote snapshots have been checked, if MAXAGE=0). I agree that it's not the best name...maybe something like MAXREMOTERECENT, as it resembles your RECENT variable in some ways?
About those issues:
- True, my bad. I just never needed to do backups to multiple remots yet, so leaving it as it is seemed obvious, but others have other needs.
- True as well. A quick test also rules out
/usr/bin/env which
, as it does not include/sbin
either. A solution could be a simplewhich zfs || which /sbin/zfs || which /usr/sbin/zfs
- I introduced the
hold
as a more consistent way of error checking. The previous behaviour was to just re-check the existence of $snap2, after $snap1 has been determined, and then do the send. Theoretically the local snapshot could also be rotated out after that check, but before the send, which can't ever happen with thehold
as it is atomic. Adding the-r
would also be good. - Wow, i actually missed the fact that the
localhost
branch avoided thessh -n
, sorry. That change is then superfluous, though i think that using $REMZFS_CMD consistently instead of branching is more DRY-friendly. - An edge case, but could still happen. Still, doing information-retrieval before everything else, instead of piping command outputs, has the advantage of letting you use that information in more than one place, while optimizing the calls to
zfs
command. For example, while checking if the remote snapshots exist locally one-by-one, at every iteration thezfs list
of the local host is grepped against the remote snapshot name. You can save time by buffering that command's output in a variable or, to avoid hitting the variable limit, to a temporary file. What do you think?