zfs-backup icon indicating copy to clipboard operation
zfs-backup copied to clipboard

Auto recovery in case of locally-missing remote snapshot, other improvements

Open mvitale1989 opened this issue 7 years ago • 3 comments

This solves a few issues, as well as add a bit more automation to remote-local snapshot matching:

  • Implemented auto recovery in case of locally-missing remote snapshot
  • zfs hold local snapshot before send operation, and zfs release on program exit, to ensure existence and retention.
  • Improved consistency: if target machine is localhost, ssh is not used at all (ssh -n not necessary anymore)
  • Improved compatibility: use sh instead of ksh, fallback to sudo if pfexec is not found, zfs command location is now auto-discovered
  • Removed redundant code: now $RECENT is used programmatically, instead of branching
  • Documented configurable parameters in config file, and added reminder to set properties on datasets

The auto recovery feature basically only changes the way the base of the incremental snapshot is obtained: if the latest remote is missing in the local machine, instead of just going into manteinance mode, zfs-backup tries to find an older snapshot that exists locally, and uses that as the zfs send -I base instead. Up to $MAXAGE-1 snapshots older than the latest remote are checked; you can configure MAXAGE in the config file.

mvitale1989 avatar May 07 '17 16:05 mvitale1989

Sorry, noticed there was a bug as i was testing using multiple datasets (trap is called only once). I'll fix it asap and open another pull request, for now i'll leave this open in case there's anything you'd like to change before that.

mvitale1989 avatar May 08 '17 00:05 mvitale1989

Thank you for your contribution!

I've done a quick read-through and seen enough little things here and there that I don't want to merge as-is, and I need to take time to read the recovery logic in-depth. It's a good idea but we need to make sure it is doing what we expect it to. Also, I'm not sure MAXAGE is the right name if it's based on number of snapshots vs. date?

Quick notes:

  • "No need to change this" isn't 100% accurate for LOCK and PID, as if you have multiple backup jobs (e.g. local and offsite) you want different LOCK & PID files for each. There's also a couple typos in various comments.
  • When running from cron, directories like /sbin may not be in the PATH, so we shouldn't rely on finding zfs with which. Having sensible defaults based on OS (for at least Solaris/FreeBSD/Linux) is on my to-do list.
  • I believe zfs send automatically does its own hold, so I don't know if yours is necessary, but if so, you probably want zfs hold -r (and zfs release -r). I would have to check but it's possible the auto-hold is only on the snapshot being sent at a particular moment (and not its children), so a recursive hold may be desirable.
  • Please explain removing ssh -n. Note that the if/else block you eliminated also outputs different error messages for local vs. remote, which I thought would be helpful. And I'm pretty sure ssh is already skipped for localhost?
  • Storing the entire snapshot listing in a variable (e.g. $all_locals, $all_remotes) could hit variable length limits, meaning some snapshots are missed (vs. piping to grep and head/awk before assigning the variable). From a quick Google this appears to be platform-dependent on ARG_MAX, e.g. 256k on FreeBSD. It's probably an edge case that would be unlikely to be encountered, but I'd rather not introduce potential bugs.

adaugherity avatar May 08 '17 19:05 adaugherity

The recovery logic is very simple: if latest remote snapshot does not exist locally, then check if the first older remote one does, and retry up to $MAXAGE times (or until all remote snapshots have been checked, if MAXAGE=0). I agree that it's not the best name...maybe something like MAXREMOTERECENT, as it resembles your RECENT variable in some ways?

About those issues:

  • True, my bad. I just never needed to do backups to multiple remots yet, so leaving it as it is seemed obvious, but others have other needs.
  • True as well. A quick test also rules out /usr/bin/env which, as it does not include /sbin either. A solution could be a simple which zfs || which /sbin/zfs || which /usr/sbin/zfs
  • I introduced the hold as a more consistent way of error checking. The previous behaviour was to just re-check the existence of $snap2, after $snap1 has been determined, and then do the send. Theoretically the local snapshot could also be rotated out after that check, but before the send, which can't ever happen with the hold as it is atomic. Adding the -r would also be good.
  • Wow, i actually missed the fact that the localhost branch avoided the ssh -n, sorry. That change is then superfluous, though i think that using $REMZFS_CMD consistently instead of branching is more DRY-friendly.
  • An edge case, but could still happen. Still, doing information-retrieval before everything else, instead of piping command outputs, has the advantage of letting you use that information in more than one place, while optimizing the calls to zfs command. For example, while checking if the remote snapshots exist locally one-by-one, at every iteration the zfs list of the local host is grepped against the remote snapshot name. You can save time by buffering that command's output in a variable or, to avoid hitting the variable limit, to a temporary file. What do you think?

mvitale1989 avatar May 08 '17 21:05 mvitale1989