restic icon indicating copy to clipboard operation
restic copied to clipboard

Parse tar data backed up via stdin

Open Kidswiss opened this issue 5 years ago • 27 comments

Output of restic version

restic 0.9.4 compiled with go1.11.4 on darwin/amd64

What should restic do differently? Which functionality do you think we should add?

If someone streams tar data to restic to do a backup:

tar -cf - -C /veryimportantfolder | restic backup --stdin

The whole thing will be saved as a single file. This will make restoring a single file very tedious as the whole tar has to be restored, it will get more painful the large the tar file gets.

If restic would parse the tar file and "convert" the entries into restic native file trees, it would be possible to create a virtual folder snapshot. This way a tar file is backed up, but single file restore is still available.

What are you trying to do?

We use restic quite heavily in Kubernetes and OpenShift workloads where it's not always possible to give direct filesystem access to restic. So we stream quite a lot of tar files between containers to get the backups. This creates the problem described above.

This feature would complement #2123.

What do you think? Would something like this make sense?

Did restic help you or made you happy in any way?

Restic rocks :)

Kidswiss avatar Mar 29 '19 14:03 Kidswiss

Duplicate of #437.

cdhowie avatar Mar 29 '19 22:03 cdhowie

Ah, I actually like the idea. We even have an abstraction layer now (fs.FS) which could be used to implement a tar file system maybe.

fd0 avatar Apr 27 '19 20:04 fd0

This would also help in a situation, where firewall rules forbid connection from the system-to-be-backed-up to the backup-storage but not the reverse direction. We can have a simple script on the system-to-be-backed-up that is invoked via ssh and tars to stdout and by-pass the necessity to make the whole system available through sshfs.

eikevons avatar May 01 '19 19:05 eikevons

@fd0 does that mean Restic plans to support streaming tar data to stdin?

alallier avatar Jun 18 '19 00:06 alallier

Something I do often on machines where I don't want to install software or credentials is ssh machine.home.arpa tar cv ~. It would be awesome to be able to pipe that into restic and have it understand as a filesystem.

FiloSottile avatar Apr 26 '20 17:04 FiloSottile

This would also be great for backing up volumes from within docker which also uses tar under the hood, for example:

docker cp running-or-stopped-container:/path/to/volume - | restic backup --stdin

jinnko avatar Nov 08 '20 15:11 jinnko

I think it is also important that the tar stdin is not completely stored in ram, because if the backup is huge it would not fit. This would allow to stream data from one remote source to a backup without storing the "source" on the local file system.

Legion2 avatar Dec 31 '20 15:12 Legion2

Note that this may not be a good solution for securely backing up remote systems. On a LAN it might work, but restic has no way to communicate to the sending side that it can skip a file based on the contents of the parent snapshot. The sender has to send every single byte regardless of what is already in the repository, and restic has to receive all of that data even if it is just going to discard it because the file didn't change. This could be incredibly slow over a WAN connection, and it also requires the sender to read all of the data from disk, which might be very slow.

This feature could be useful in some niche cases, but I would argue that it should not be used across the board for secure remote backups as it would be horribly inefficient. A different solution would be required to implement this efficiently.

cdhowie avatar Jan 01 '21 15:01 cdhowie

This would also be nice for things like postgres tar dumps, eg something like

pg_dump --format=tar | restic backup --stdin

jcotton42 avatar May 18 '22 01:05 jcotton42

This would also be nice for backing up proxmox VMs / LXCs vzdump 103 --mode snapshot --stdout | restic backup --stdin

jniggemann avatar Jun 21 '22 16:06 jniggemann

:+1: on stdin backups for database dumps. It's a great way to make a clean DB backup that doesn't disturb the app since it only holds a read lock.

@Kidswiss as for the tar case specifically, how about using instead using zip to stdout with 0 compression, and then mounting the backup via FUSE? That should allow zip to directly access the index and only read the parts it needs. Tar doesn't have an index.

wmertens avatar Sep 27 '22 13:09 wmertens

Kidswiss as for the tar case specifically, how about using instead using zip to stdout with 0 compression, and then mounting the backup via FUSE? That should allow zip to directly access the index and only read the parts it needs. Tar doesn't have an index.

But then one needs to store the zip locally in order to have it mounted. (If one is dumping multi-TiB data sources, with tar one only needs the patience to stream it, meanwhile with zip one also needs to store it temporarily.)

The main use-case for streaming a tar, but backing-it-up via restic as it would be a proper file-system, is as @FiloSottile has mentioned being able to ssh into an untrusted server, create a full tar of the target file-system, stream it over ssh on a trusted staging server (but one that perhaps doesn't have the storage capacity to temporarily store the tar), and feed it to restic.

cipriancraciun avatar Sep 27 '22 20:09 cipriancraciun

But then one needs to store the zip locally in order to have it mounted

I meant, you mount the backup, which reveals the .zip file, and then you use zip on that (but you can of course use FUSE again to mount the .zip)

I understand the streaming use case, it's just that it seems a bit specific. Tar isn't the nicest format, and it won't support vzdump either because that's not tar. OTOH tar is really popular so if restic were to support something like that, tar seems a good candidate.

wmertens avatar Sep 27 '22 21:09 wmertens

I meant, you mount the backup, which reveals the .zip file, and then you use zip on that (but you can of course use FUSE again to mount the .zip)

Given how restic chunks the data, backing-up a large proper file-system, or a single zip with all the contents, wouldn't yield the same boundaries at least for the first and last chunk of each file.

Thus, if the zip creation is not deterministic, or if lots of small files keep changing, then the "single zip" route would just create lots of changed chunks, when in fact not that much has changed.

cipriancraciun avatar Sep 27 '22 23:09 cipriancraciun

Given how restic chunks the data, backing-up a large proper file-system, or a single zip with all the contents, wouldn't yield the same boundaries at least for the first and last chunk of each file.

Isn't that what the rolling hash is for? https://restic.net/blog/2015-09-12/restic-foundation1-cdc/

restic will find regions in the zip file that start a new boundary, and if you make a change in the zip file, it will only change a few chunks, especially if also need to turn off zip compression.

wmertens avatar Sep 28 '22 11:09 wmertens

Isn't that what the rolling hash is for? https://restic.net/blog/2015-09-12/restic-foundation1-cdc/ restic will find regions in the zip file that start a new boundary, and if you make a change in the zip file, it will only change a few chunks, especially if also need to turn off zip compression.

First of all, there is the issue of deterministic zip creation. If there are lots of small files, and their order change non deterministically, then for certain deduplication would not work properly, unless the chunk size is well below the average file size. (In case of restic, the documentation states it aims at 1 MiB chunk size, thus well above the average small file size.)

Then there is the issue of the zip format itself. It seems that each file data is prefixed by a file header which contains the modification time. Thus if something touches a file (without changing the contents), then that chunk will be seen as changed, thus not deduplicated. If restic operates on a proper file-system, the data is not stored, only a new file entry is created.

Also, given that restic aims at a chunk of 1 MiB in size, it means that changing a file of 1 KiB, would imply storing a new chunk (from the zip stream), thus a 99.9% waste. On the other side, if restic operates on a proper file-system, it would just store that 1 KiB and move on.

cipriancraciun avatar Sep 29 '22 11:09 cipriancraciun

@cipriancraciun very good points and they also hold for tar.

You make a good case indeed for restic supporting GNU tar input as a virtual filesystem :+1:

IMHO it would have to be as a separate flag though. If it were to parse any tar file as a subdirectory, there's no guarantee that it can generate the exact same tar file, and if the file were corrupted it would have to abort the backup. I suppose it could retry a failed tar as a regular file when reading from disk, but not when reading from stdin.

wmertens avatar Sep 29 '22 12:09 wmertens

If it were to parse any tar file as a subdirectory, there's no guarantee that it can generate the exact same tar file, and if the file were corrupted it would have to abort the backup.

This is exactly what this ticket proposes: to use tar over stdin as an alternative to walking the file-system (in essence a tar contains all the meta-data restic would obtain from the proper file-system). Thus after restic consumes the tar and creates the snapshot, there would be no more mention of the initial tar, and the newly created snapshot would be identical to a similar snapshot created by using the proper file-system.

cipriancraciun avatar Oct 04 '22 12:10 cipriancraciun

That would essential mean implementing borg import-tar for restic.

MichaelEischer avatar Oct 04 '22 18:10 MichaelEischer

My idea for using this feature: Google takeout works by giving you access to a series of very large .tgz files that you have to download. You can (and should) download those to your local computer/NAS, but if you want to do an offsite backup, you're going to be pushing a tonne of data over your slow home internet upload speeds.

Instead, you can temporarily create a very small/cheap VPS in the cloud somewhere near the storagebox for your offsite backups and do something like:

curl https://path/to/takeout.tgz | gzip -d | restic backup --from-tar - ...

and have nothing from your data need to be stored on disk.

allisonkarlitskaya avatar Sep 18 '23 08:09 allisonkarlitskaya