borg
borg copied to clipboard
Feature Request: Differential Extraction
Currently, extraction of large archives can be quite slow, since at worst case, the entire archive must be fetched from the backup machine. For large-ish archives (>100 GB) and especially slow networks, this can be very time consuming.
In most cases, I already possess a directory tree with most of the data I wish to "recover". However, it may not be clear to me what data is missing or corrupted. (In my case, I am backing up virtual machines disks.) In this case, only downloading the missing or corrupted data could reduce the required time and bandwidth by orders of magnitude.
I'm currently experimenting with borg mount + rsync; but for slow storage devices, the extra disk read overhead of checksumming the entire archive more than cancels out the gain of transmitting only the differential. (I intend to test on a backup to an SSD and see if that helps, although I don't think I could sustain backup to an SSD for any amount of time since it's just barely large enough to fit the data now.)
However, since borg already records the rolling checksum of the data in the archive (and also potentially cached on the client side), this extraction could be done with minimal additional I/O overhead, which would be awesome.
Related: #963
Entire archive yes, but only the metadata stream, not the files' content data, you can use:
borg extract repo::archive what/i/want
borg mount + rsync only may help if you use the (default?) quick rsync mode that relies on mtime, then it won't read all the content data of the files.
borg does not record the rolling checksum (buzhash), but it has the sha256 of each file content chunk.
maybe check this issue tracker if we do not already have a feature request about "updating" / "syncing" and existing file tree to an archived file tree.
A while ago I worked on extract --continue (#1665) which would be close, but doesn't exactly do this (only completes existing files, otherwise doesn't touch them, so corrupted files wouldn't be fixed). I guess if someone bothered to fix the issue mentioned in the PR it would be relatively easy to add this; it would just be chunking existing files, and replacing chunks from the file that don't match the chunks in the archive. Like so:
item.chunks.reverse() # we'll do destructive iteration 'cause it's easy
file = open(item.path, updating)
for chunk in chunkify(file):
# chunk file and find differing chunks
correct_chunk = item.chunks.pop()
if chunk.id != correct_chunk.id:
file.seek(-chunk.size, relative)
file.write( get(correct_chunk) )
# probably need to reset chunker here
while item.chunks:
# file is too short, complete it
chunk = item.chunks.pop()
file.write( get(chunk) )
This is more efficient than rsync could be, because it only needs to read modified data from the repository, not all file data, to calculate the applied delta.
@ThomasWaldmann
Entire archive yes, but only the metadata stream, not the files' content data, you can use:
borg extract repo::archive what/i/want
But this only works when I know what I want to restore?
borg mount + rsync only may help if you use the (default?) quick rsync mode that relies on mtime, then it won't read all the content data of the files.
Ah, I ended up passing --ignore-times since the first time around I had passed -u (out of habit) and nothing happened. However, these are virtual machine disks, so they're large and most are modified. (Although the modifications may be extremely small.) I did another run with a SSD and the default settings, and it was indeed faster, but mostly because the untouched VMs were not restored. Rsync still needed to read the entire modified disk images.
borg does not record the rolling checksum (buzhash), but it has the sha256 of each file content chunk.
Ah sorry. However, I suppose that particular detail doesn't affect the potential performance gains.
maybe check this issue tracker if we do not already have a feature request about "updating" / "syncing" and existing file tree to an archived file tree.
I already read all open issues with the keyword "extract". I didn't see any duplicate, other than the related feature: #963.
@enkore
Wouldn't this be able to do everything borg extract --continue does as well?
yes
Some notes:
- borg does not explicitly archive directory contents (like having a directory item that includes all fs object names in that directory), this information is only implicitly stored in the pathes of all the archived fs items.
- thus, even the simple "delete superfluous stuff" case would need additional bookkeeping code.
- we would also need to switch from "just extract it" to a "compare to what we have" mode.
- we could use the files cache maybe to quickly skip unchanged files.
- for changed files, first implementation could just kill the local file and do a normal extraction.
- later, this could be optimized to use the chunker on the local file and not fetch chunks from the repository that are already present in that file.
What's the state of this? I got here from #2872
There is only what you see here, so it is still open.
I would like to mention a few additional use cases. Besides (virtual) disk images, this feature will always be useful if the backed up data is not readable with a simple pager but you only need to restore a part of it. For instance:
- Restoring a database table when backing up the database storage location instead of SQL dumps
- Seafile stores data in chunks (git like) so restoring a single file requires the entire data folder.
- Managing backups of large binary/proprietary save files.