btrbk
btrbk copied to clipboard
Backup interruption handling
Are interruptions correctly handled? For example, my laptop uploads 1/2 of a backup then he's powered down/lid closed for three days. When it's opened, will it resume the daily backup or start a new one? Same question with unreliable network, like a wifi that reconnects every 5 minutes. It would be good to have a section about this in the README, it helps in deciding if your software correspond to the needs.
I think this is all well described in btrbk(1), Action "run":
Step 2: Create Backups [...] After comparing the backups to the source snapshots, btrbk transfers all missing snapshots needed to satisfy the configured target retention policy [...]
This means that it will resume the daily backup from 3 days ago, and/or create a new backup if the target_preserve
or target_preserve_min
configuration says so (e.g. both if target_preserve 7d
, but only new backup if target_preserve 2d
).
Note that interrupted transfers can cause incomplete ("garbled") backups left behind, these need to be deleted by running btrbk clean
.
Oh, OK, I thought it only kept the last snapshot on local, that makes sense. Did you check if btrfs send
correctly handles unreliable connections?
Are the garbled backups kept only if the corresponding local snapshot was deleted before completion, or is there other cases when it can happen?
Did you check if btrfs send correctly handles unreliable connections?
I did some tests, but not that thorough as btrbk uses ssh which I fully trust: ssh either provides a reliable pipe or exits with errors. What btrbk does is basically a btrfs send <snapshot> | ssh btrfs receive <path>
, so if the network is flaky ssh ensures that the remote part still gets a reliable stream.
Are the garbled backups kept only if the corresponding local snapshot was deleted before completion, or is there other cases when it can happen?
The garbled backups are kept if btrbk is not able to delete them on unsuccessful transmit (if the btrfs send | ssh btrfs receive
command fails for any reason). This can happen if:
- btrbk was killed (e.g. by hitting Ctrl-C) while a transfer was ongoing
- ssh exits because the network is down (e.g. if
ServerAliveCountMax
is set in/etc/ssh/ssh_config
), and the subsequent ssh connection needed to delete the garbled subvolume fails (which is usually true if the newtork is down). In this case, btrbk prints a WARNING that deletion of garbled subvolumes failed.
By unreliable network, I mean that the ssh connection is closed like every 5min (for example, by disconnecting/reconnecting the wifi network), so that no upload can complete without at least one failure (assuming each upload takes > 5min).
Same behavior, ssh connections do indeed survive closing / reopening of interfaces (just checked on both ssh client as well as server).
However, if I remember correctly, systemd (which I don't use) kills the ssh server (sshd
) on network disconnects, so you'll get a garbled subvolume on disconnects here.
So the upload process will start again?
Yes, the upload process will be resumed on next run.
In order to make sure that btrbk does not abort on "garbled" subvolumes, some people always do a btrbk clean
in their backup scripts, e.g.:
#!/bin/bash
btrbk clean
btrbk run
However, if I remember correctly, systemd (which I don't use) kills the ssh server (
sshd
) on network disconnects, so you'll get a garbled subvolume on disconnects here.
Maybe it's distro configuration specific, but I don't think that's the case now. It's certainly not the case on Fedora 34.
Yes, the upload process will be resumed on next run.
I think there may be some misunderstanding in the use of the term "resume" in the above discussion. To me, and I think to @mfr-itr(?), "resume" means to continue the upload of a snapshot where it was interrupted, only uploading the outstanding changes for that specific snapshot that hadn't uploaded at the time btrbk gave up/failed on it. I think you (@digint) are using "resume" where I would use "restart", since I don't think btrfs send
has any mechanism to resume a send. Rather btrbk
has to restart all outstanding btrfs send
s from the beginning. As @mfr-itr said in their initial comment, a section on this in the README would "helps in deciding if your software correspond to the needs". For anyone that has to contend with a particularly flaky network or with particularly large snapshot sends, "garbled" backups, as you refer to them, may be the norm and they may want to choose a different backup strategy.
Regarding the text that is currently in the README "Resume backups (for removable and mobile devices)", given the above conversation it's not clear to me what "resume" means here, or why it's restricted to removable and mobile devices.
You are right, the term "resume" is a bit misleading here:
- In the btrbk docs, resume refers to
btrbk resume
action, meaning "resume missing backups, i.e. re-sending of snapshots (which are never deleted on interrupted transfer)". This is well documented I'd say. It is not necessarily related to interrupted send/receive, but can as well mean "backup disk was not attached when btrbk was run". - With interrupted transfer over ssh in mind, "resume" kind of implies "resume transfer where it was stopped". As you stated correctly,
btrfs send
can not do that, as it's agnostic of the target status (well it has been discussed in #149 , btrbk could buffer the whole send-stream on both sides, and only startbtrfs receive
once the file is fully transferred).
@jwatt if I get you correctly you basically stumbled upon "Resume backups (for removable and mobile devices)". Any ideas how I could rephrase this? (not native english speaking here, and can't come up with anything better right now, "restart" sounds wrong to me as well).
How about the following phrasing:
"Backups to destinations that are sometimes there and sometimes away (each single snapshot needs to be sent uninterrupted)"
I'm not a native speaker either, but I absolutely thought/hoped that the resuming meant resuming an interrupted send/receive.
The issue was discussed at buttersink and they came up with a possible solution it seems. AmesCornish/buttersink/issues/34
@jwatt if I get you correctly you basically stumbled upon "Resume backups (for removable and mobile devices)". Any ideas how I could rephrase this? (not native english speaking here, and can't come up with anything better right now, "restart" sounds wrong to me as well).
Just to add my two cents, I've always disliked the confusing way btrbk uses the word "resume", and I think misunderstandings like this will continue until the terminology is changed.
Even if the command can't be renamed for compatibility reasons, I'd suggest using a more explicit term in documentation. Examples that spring to my mind: "sync", "update", "push", "send". For "send" it seems obvious to me that snapshots already on the receiving side won't be included, and it would drastically reduce the potential for confusion while taking advantage of the matching btrfs terminology.