btrbk icon indicating copy to clipboard operation
btrbk copied to clipboard

Backup interruption handling

Open mfr-itr opened this issue 7 years ago • 12 comments

Are interruptions correctly handled? For example, my laptop uploads 1/2 of a backup then he's powered down/lid closed for three days. When it's opened, will it resume the daily backup or start a new one? Same question with unreliable network, like a wifi that reconnects every 5 minutes. It would be good to have a section about this in the README, it helps in deciding if your software correspond to the needs.

mfr-itr avatar Oct 30 '17 11:10 mfr-itr

I think this is all well described in btrbk(1), Action "run":

Step 2: Create Backups [...] After comparing the backups to the source snapshots, btrbk transfers all missing snapshots needed to satisfy the configured target retention policy [...]

This means that it will resume the daily backup from 3 days ago, and/or create a new backup if the target_preserve or target_preserve_min configuration says so (e.g. both if target_preserve 7d, but only new backup if target_preserve 2d).

Note that interrupted transfers can cause incomplete ("garbled") backups left behind, these need to be deleted by running btrbk clean.

digint avatar Oct 30 '17 11:10 digint

Oh, OK, I thought it only kept the last snapshot on local, that makes sense. Did you check if btrfs send correctly handles unreliable connections? Are the garbled backups kept only if the corresponding local snapshot was deleted before completion, or is there other cases when it can happen?

mfr-itr avatar Oct 30 '17 12:10 mfr-itr

Did you check if btrfs send correctly handles unreliable connections?

I did some tests, but not that thorough as btrbk uses ssh which I fully trust: ssh either provides a reliable pipe or exits with errors. What btrbk does is basically a btrfs send <snapshot> | ssh btrfs receive <path>, so if the network is flaky ssh ensures that the remote part still gets a reliable stream.

Are the garbled backups kept only if the corresponding local snapshot was deleted before completion, or is there other cases when it can happen?

The garbled backups are kept if btrbk is not able to delete them on unsuccessful transmit (if the btrfs send | ssh btrfs receive command fails for any reason). This can happen if:

  • btrbk was killed (e.g. by hitting Ctrl-C) while a transfer was ongoing
  • ssh exits because the network is down (e.g. if ServerAliveCountMax is set in /etc/ssh/ssh_config), and the subsequent ssh connection needed to delete the garbled subvolume fails (which is usually true if the newtork is down). In this case, btrbk prints a WARNING that deletion of garbled subvolumes failed.

digint avatar Oct 30 '17 13:10 digint

By unreliable network, I mean that the ssh connection is closed like every 5min (for example, by disconnecting/reconnecting the wifi network), so that no upload can complete without at least one failure (assuming each upload takes > 5min).

mfr-itr avatar Oct 30 '17 14:10 mfr-itr

Same behavior, ssh connections do indeed survive closing / reopening of interfaces (just checked on both ssh client as well as server).

However, if I remember correctly, systemd (which I don't use) kills the ssh server (sshd) on network disconnects, so you'll get a garbled subvolume on disconnects here.

digint avatar Oct 30 '17 15:10 digint

So the upload process will start again?

mfr-itr avatar Oct 31 '17 09:10 mfr-itr

Yes, the upload process will be resumed on next run.

In order to make sure that btrbk does not abort on "garbled" subvolumes, some people always do a btrbk clean in their backup scripts, e.g.:

#!/bin/bash
btrbk clean
btrbk run

digint avatar Oct 31 '17 20:10 digint

However, if I remember correctly, systemd (which I don't use) kills the ssh server (sshd) on network disconnects, so you'll get a garbled subvolume on disconnects here.

Maybe it's distro configuration specific, but I don't think that's the case now. It's certainly not the case on Fedora 34.

Yes, the upload process will be resumed on next run.

I think there may be some misunderstanding in the use of the term "resume" in the above discussion. To me, and I think to @mfr-itr(?), "resume" means to continue the upload of a snapshot where it was interrupted, only uploading the outstanding changes for that specific snapshot that hadn't uploaded at the time btrbk gave up/failed on it. I think you (@digint) are using "resume" where I would use "restart", since I don't think btrfs send has any mechanism to resume a send. Rather btrbk has to restart all outstanding btrfs sends from the beginning. As @mfr-itr said in their initial comment, a section on this in the README would "helps in deciding if your software correspond to the needs". For anyone that has to contend with a particularly flaky network or with particularly large snapshot sends, "garbled" backups, as you refer to them, may be the norm and they may want to choose a different backup strategy.

Regarding the text that is currently in the README "Resume backups (for removable and mobile devices)", given the above conversation it's not clear to me what "resume" means here, or why it's restricted to removable and mobile devices.

jwatt avatar Aug 09 '21 14:08 jwatt

You are right, the term "resume" is a bit misleading here:

  • In the btrbk docs, resume refers to btrbk resume action, meaning "resume missing backups, i.e. re-sending of snapshots (which are never deleted on interrupted transfer)". This is well documented I'd say. It is not necessarily related to interrupted send/receive, but can as well mean "backup disk was not attached when btrbk was run".
  • With interrupted transfer over ssh in mind, "resume" kind of implies "resume transfer where it was stopped". As you stated correctly, btrfs send can not do that, as it's agnostic of the target status (well it has been discussed in #149 , btrbk could buffer the whole send-stream on both sides, and only start btrfs receive once the file is fully transferred).

@jwatt if I get you correctly you basically stumbled upon "Resume backups (for removable and mobile devices)". Any ideas how I could rephrase this? (not native english speaking here, and can't come up with anything better right now, "restart" sounds wrong to me as well).

digint avatar Aug 09 '21 16:08 digint

How about the following phrasing:

"Backups to destinations that are sometimes there and sometimes away (each single snapshot needs to be sent uninterrupted)"

I'm not a native speaker either, but I absolutely thought/hoped that the resuming meant resuming an interrupted send/receive.

daniellandau avatar Nov 15 '21 07:11 daniellandau

The issue was discussed at buttersink and they came up with a possible solution it seems. AmesCornish/buttersink/issues/34

beda17 avatar Sep 13 '22 18:09 beda17

@jwatt if I get you correctly you basically stumbled upon "Resume backups (for removable and mobile devices)". Any ideas how I could rephrase this? (not native english speaking here, and can't come up with anything better right now, "restart" sounds wrong to me as well).

Just to add my two cents, I've always disliked the confusing way btrbk uses the word "resume", and I think misunderstandings like this will continue until the terminology is changed.

Even if the command can't be renamed for compatibility reasons, I'd suggest using a more explicit term in documentation. Examples that spring to my mind: "sync", "update", "push", "send". For "send" it seems obvious to me that snapshots already on the receiving side won't be included, and it would drastically reduce the potential for confusion while taking advantage of the matching btrfs terminology.

luxagen avatar Jan 17 '23 14:01 luxagen