A more flexible extension like --stdin to operate with generalised application-specific helpers?
I'd like to use a helper script to create atomic point-in-time snapshot backups of virtual machines - the VM block devices are backed by ceph (rbd) block devices. The existing backup-vm script can't easily do this, since it backs up bind-mounted real files or block devices using --read-special. It might be possible to adapt backup-vm to work in this situation, but it's already forced to work in a bit of a hacky way (e.g. bind mounts feel a bit heavy handed to back up block devices), and I wondered if another way to invoke borg create might be better for this?
This seems to be more-or-less a general problem shared by various application-aware helpers... e.g. Database backups tools, where the data may be generated on-the-fly or otherwise not readily presentable in either of the two ways that borg currently allows i.e. either a single 'file', fed to stdin (with --stdin and --stdin-name), or a directory containing special files and/or plain files (with --read-special).
A backup may need to include one or more block devices (and other data such as nvram storage), and also VM metadata, and optionally the contents of RAM and other related state too.
It seems better to be able to store a single archive containing multiple "virtual" files, perhaps also including some metadata (e.g. name and version of the external tool used, timestamp of the atomic snapshot and perhaps something like directions for the user on how to restore the data as additional files).
A couple of possible solutions:
- The helper backup passes file metadata to borg (including a file descriptor which can be read to obtain the content) for each "virtual file" in the archive.
- A single data stream passed to borg via stdin e.g. a created-on-the-fly archive (tar or similar), which borg reads and gets the data it needs from. Might be tricky for borg to skip data that it's not interested in (e.g. because the particular file's modification time hasn't changed).
For the first option, this could perhaps be json passed to borg (e.g. on the command line or via stdin) containing the metadata, and also the number of an open fd which borg can use to read the data (e.g. the helper invokes borg which inherits the necessary file descriptors - analogous to the use of the gpg utility e.g. gpg --passphrase-fd 8). Alternative mechanisms would also be possible.
In some cases, it might be possible to use FUSE to expose the data to be backed up as a "virtual filesystem", but that also feels a bit hacky, and it may not be possible for all data sources, and it also wouldn't be possible to pass advanced metadata to borg via FUSE (e.g. byte ranges which have changed since the last backup operation if the source is able to track these). It would make writing the helper applications quite difficult too.
Any thoughts?
I agree that it would be nice to have a "standard" for doing this type of thing, but I'm not convinced that this functionality need to be in the core of borg. I have a wrapper script that I built to back up my KVM VMs and the way I deal with this is by dumping the VM config followed by the snapshots of the disk images. I just name the archives such that you know that all of the pieces go together. I suppose you could take it further by generating an "index" archive that contains a text file mentioning the names of the other archives in this set along with instructions (which could be a script, I suppose) on how to put it all back together.
Just as an update, there was / is progress related to this in master branch:
recursion and processing was separated
instead of the builtin fs recursion, it is now also possible to feed filenames via stdin, so it is e.g. now possible to have find output going into borg and then borg backups exactly these fs files. #5492
TODO: in a similar, but different (more flexible) way, file object descriptions could be fed into borg (e.g. json, fs filenames, FDs, ...).
fixed size chunker
better for disks than the variable size chunker: lower cpu needs, better fit (DONE)
sparsemap / filemap support, WIP #5561
Unclear: if we do not have a seekable (fs) file, we can not use SEEK_HOLE/SEEK_DATA to skip holes / all-zero ranges.
So, what is a good way for:
- VM disks (if they are not raw disk files)
- data streams