borg Feature Request: Differential Extraction

Currently, extraction of large archives can be quite slow, since at worst case, the entire archive must be fetched from the backup machine. For large-ish archives (>100 GB) and especially slow networks, this can be very time consuming.

In most cases, I already possess a directory tree with most of the data I wish to "recover". However, it may not be clear to me what data is missing or corrupted. (In my case, I am backing up virtual machines disks.) In this case, only downloading the missing or corrupted data could reduce the required time and bandwidth by orders of magnitude.

I'm currently experimenting with borg mount + rsync; but for slow storage devices, the extra disk read overhead of checksumming the entire archive more than cancels out the gain of transmitting only the differential. (I intend to test on a backup to an SSD and see if that helps, although I don't think I could sustain backup to an SSD for any amount of time since it's just barely large enough to fit the data now.)

However, since borg already records the rolling checksum of the data in the archive (and also potentially cached on the client side), this extraction could be done with minimal additional I/O overhead, which would be awesome.

Related: #963

Dec 30 '16 04:12 PythonNut

Entire archive yes, but only the metadata stream, not the files' content data, you can use:

borg extract repo::archive what/i/want

borg mount + rsync only may help if you use the (default?) quick rsync mode that relies on mtime, then it won't read all the content data of the files.

borg does not record the rolling checksum (buzhash), but it has the sha256 of each file content chunk.

maybe check this issue tracker if we do not already have a feature request about "updating" / "syncing" and existing file tree to an archived file tree.

Dec 30 '16 11:12 ThomasWaldmann

A while ago I worked on extract --continue (#1665) which would be close, but doesn't exactly do this (only completes existing files, otherwise doesn't touch them, so corrupted files wouldn't be fixed). I guess if someone bothered to fix the issue mentioned in the PR it would be relatively easy to add this; it would just be chunking existing files, and replacing chunks from the file that don't match the chunks in the archive. Like so:

item.chunks.reverse()  # we'll do destructive iteration 'cause it's easy
file = open(item.path, updating)

for chunk in chunkify(file):
    # chunk file and find differing chunks
    correct_chunk = item.chunks.pop()
    if chunk.id != correct_chunk.id:
        file.seek(-chunk.size, relative)
        file.write( get(correct_chunk) )
        # probably need to reset chunker here

while item.chunks:
    # file is too short, complete it
    chunk = item.chunks.pop()
    file.write( get(chunk) )

This is more efficient than rsync could be, because it only needs to read modified data from the repository, not all file data, to calculate the applied delta.

Dec 30 '16 12:12 enkore

@ThomasWaldmann

Entire archive yes, but only the metadata stream, not the files' content data, you can use:

borg extract repo::archive what/i/want

But this only works when I know what I want to restore?

borg mount + rsync only may help if you use the (default?) quick rsync mode that relies on mtime, then it won't read all the content data of the files.

Ah, I ended up passing --ignore-times since the first time around I had passed -u (out of habit) and nothing happened. However, these are virtual machine disks, so they're large and most are modified. (Although the modifications may be extremely small.) I did another run with a SSD and the default settings, and it was indeed faster, but mostly because the untouched VMs were not restored. Rsync still needed to read the entire modified disk images.

borg does not record the rolling checksum (buzhash), but it has the sha256 of each file content chunk.

Ah sorry. However, I suppose that particular detail doesn't affect the potential performance gains.

maybe check this issue tracker if we do not already have a feature request about "updating" / "syncing" and existing file tree to an archived file tree.

I already read all open issues with the keyword "extract". I didn't see any duplicate, other than the related feature: #963.

@enkore

Wouldn't this be able to do everything borg extract --continue does as well?

Dec 31 '16 01:12 PythonNut

yes

Dec 31 '16 02:12 enkore

Some notes:

borg does not explicitly archive directory contents (like having a directory item that includes all fs object names in that directory), this information is only implicitly stored in the pathes of all the archived fs items.
thus, even the simple "delete superfluous stuff" case would need additional bookkeeping code.
we would also need to switch from "just extract it" to a "compare to what we have" mode.
we could use the files cache maybe to quickly skip unchanged files.
for changed files, first implementation could just kill the local file and do a normal extraction.
later, this could be optimized to use the chunker on the local file and not fetch chunks from the repository that are already present in that file.

Oct 08 '17 23:10 ThomasWaldmann

What's the state of this? I got here from #2872

Apr 02 '21 09:04 mkg20001

There is only what you see here, so it is still open.

Apr 03 '21 18:04 ThomasWaldmann

I would like to mention a few additional use cases. Besides (virtual) disk images, this feature will always be useful if the backed up data is not readable with a simple pager but you only need to restore a part of it. For instance:

Restoring a database table when backing up the database storage location instead of SQL dumps
Seafile stores data in chunks (git like) so restoring a single file requires the entire data folder.
Managing backups of large binary/proprietary save files.

Aug 10 '22 06:08 xwst

borg borg copied to clipboard

Feature Request: Differential Extraction

borg
borg copied to clipboard