bupstash icon indicating copy to clipboard operation
bupstash copied to clipboard

Comparison to other deduping backups?

Open Macavirus opened this issue 4 years ago • 13 comments

Specifically the following seem to occupy the same space:

  • borg
  • restic
  • duplicacy

Curious how your implementation seeks to improve these!

Macavirus avatar Nov 20 '20 06:11 Macavirus

Hi there,

I think bupstash is quite similar to borg and restic, though has some advantages:

  • bupstash has offline decryption keys unlike borg or restic as far as I am aware. Restic forced me to store a password on my machine making backups, which was unacceptable to me.
  • bupstash has encrypted metadata search unlike borg, though it seems restic might have this.
  • bupstash is 2x-6x faster than both borg and restic for uploads and restores. I will post benchmarks in the future.
  • bupstash uses around 10x less ram than either borg or restic in some quick tests.
  • In some tests I did, bupstash created 2x smaller repositories than borg or restic. (This is probably because small files are packed into single chunks that compress better)
  • 'bupstash gc' can be up to 100x faster than the restic equivalent, this is due to some fundamental design decisions.
  • I think bupstash offers better privacy due to hmac addresses and packing multiple small files into a single chunk, though I need to verify this is unique to bupstash.
  • I found bupstash works especially well over high latency network connections compared to restic.
  • I tried to make access controls easier to setup and understand than borg, and it seems restic does not have them at all.

(For most of the tests so far I was simply snapshotting a linux kernel source tree over ssh.)

bupstash also has some downsides:

  • bupstash does not have directory mounting like these tools, which is one reason bupstash is able to out perform them. This is because of our different internal data structures. A browse command is planned as a replacement.
  • bupstash is required on both sides of the network, unlike restic.
  • bupstash is alpha software so might take a while to stabilize.

Something entirely up to the user, is I hoped to make bupstash easier to use, this may or may not be the case.

I haven't compared bupstash to duplicacy.

andrewchambers avatar Nov 20 '20 06:11 andrewchambers

Just wanted to chime in here to give a quick anecdote when comparing with kopia:

I am backing up a bunch of personal and work code projects along with their many build artifacts, plus a lot of tiny configuration files. I've set my kopia backups to use zstd compression and they do very well there (even if the creators said they use a speed-focused zstd implementation): I get initial repository size of ~2.58GB with kopia vs ~4.67GB with bupstash.

That normally wouldn't be a problem at all. But currently I am doing redundant backups to several cloud storage providers on the free tier where space varies between 5GB to 10GB and I'd love it if my backups are as small as humanly possible. Would you consider adding configurable levels of zstd compression to bupstash?

Very grateful for your work! bupstash is a breath of fresh air in a group of tools some of which aren't that easy and quick to start with. So far it's my favourite and I have tried most of the popular ones (except for borg) but when repository size is concerned, kopia is the clear winner for now. I do want to use bupstash; its tagging capability is amazing and I love it!


Side note: I am backing up things from a very fast NVMe SSD yet bupstash shows ~50MB/s bandwidth. Maybe some parallelization will help in achieving I/O saturation?


EDIT: In the interest of 100% fairness, I just discovered that kopia has a setting that ignores cache directories. Still not sure what that means and I'll read up on it but it might be a factor in the smaller final backup size. It also says there are currently no ignore rules so right now I am not sure how to interpret the output of kopia policy show <path_to_repo>.

dimitarvp avatar Jan 17 '21 14:01 dimitarvp

I get initial repository size of ~2.58GB with kopia vs ~4.67GB with bupstash

Thanks, It sounds like something worth investigating.

Would you consider adding configurable levels of zstd compression to bupstash?

Yeah, I am considering a --compression= option.

Very grateful for your work! bupstash is a breath of fresh air in a group of tools some of which aren't that easy and quick to start with.

Thanks, I appreciate this. One of my goals was just being easy to use, with not too many confusing options like some other tools I have tried.

andrewchambers avatar Jan 17 '21 20:01 andrewchambers

Just wanted to add duplicity to the comparison list.

gunar avatar Nov 15 '21 15:11 gunar

EDIT: In the interest of 100% fairness, I just discovered that kopia has a setting that ignores cache directories. Still not sure what that means and I'll read up on it but it might be a factor in the smaller final backup size. It also says there are currently no ignore rules so right now I am not sure how to interpret the output of kopia policy show <path_to_repo>.

That's where feature request https://github.com/andrewchambers/bupstash/issues/274 would come in handy, since it would allow to drastically reduce backup sizes by filtering junk/cache directories based on regexes.

deajan avatar Jul 30 '22 17:07 deajan

I forgot to post that here, but I made a more detailed comparison between bupstash and kopia here:

  • https://github.com/nh2/bupstash-kopia-comparison

Somebody else made a comparison between bupstash, kopia, borg, and JMBB here:

  • https://masysma.lima-city.de/37/backup_tests_borg_bupstash_kopia.xhtml

nh2 avatar Aug 24 '22 00:08 nh2

@nh2 I think bupstash has slightly more parallelism now than it did then, but there is still more on the table to be added.

andrewchambers avatar Aug 24 '22 00:08 andrewchambers

Yes, that benchmark is from before some of those parallelism improvements.

nh2 avatar Aug 24 '22 00:08 nh2

@nh2 It must be backup comparaison season. I've just finished a first round comparaison of borg, restic, duplicacy, kopia and bupstash. I wanted to post results next week, but then this thread got updated.

@andrewchambers I am seeking comments from the devs so I can improve the backup benchmark for next round. Your input is highly welcome (as issue on the git repo)

See https://github.com/deajan/backup-bench

deajan avatar Aug 24 '22 08:08 deajan

@deajan I am not sure why the repository size changed for remote runs, it should not, so I am curious the reason - the fact it changed for another tool too might be a hint there was some problem with the methodology.

andrewchambers avatar Aug 24 '22 09:08 andrewchambers

@andrewchambers The backup repos are reinitialized between local and remote runs, the git source is too. I'll have the whole source redownloaded from scratch after local backup runs instead of doing a git checkout just to make sure that there isn't anything I missed. Thanks for having a look, I intend to expand the benchmark script so next iterations will be as easy to benchmark. If you could perhaps review the init_bupstash_repository and backup_bupstash functions in order to check for sanity, that would be nice.

deajan avatar Aug 24 '22 09:08 deajan

Regarding https://github.com/andrewchambers/bupstash/issues/26#issuecomment-730882871 I wondered about the details of:

small files are packed into single chunks

I asked @andrewchambers on the chat channel about it, documenting answers here so they are easier to find:

so bupstash packs small files within the same directory into a chunk bupstash internally uses directory boundaries as a sort of deduplication boundary that helps the dedup resync as a consequence it means it must split the chunk at directory boundaries and isn't able to pack more small files into it I am considering other heuristics that can group directories technically it would be possible to pack many directories into a single chunk it doesn't affect the repository format - it is something we can change bupstash concatenates all files into a giant stream all file data into a giant stream and dedups that by just splitting it into chunks our index is another stream that stores file names and their offsets into the data stream it is quite different to every other backup tool I have seen they tend to make a far more complex structure I imagine it would be hard for them to apply the same trick of course we have read amplification when you fetch a single file since you might grab the chunk with X other files in it too originally bupstash simply stored a deduplicated tarball however that doesn't support --pick which I considered super important

@nh2:

Within a directory, in which order does it concatenate files into a chunk? getdents() order, or sorted by file name, or something else?

@andrewchambers:

sorted by filename it just uses its own format for everything except for 'bupstash get' where it generates the tarball on the client side it just keeps the .tar name as a default for historic reasons I tried some fancier sorts, like sorted by reverse filename (to group by extension, but it didn't make much difference)

nh2 avatar Aug 24 '22 15:08 nh2

@deajan I am not sure why the repository size changed for remote runs, it should not, so I am curious the reason - the fact it changed for another tool too might be a hint there was some problem with the methodology.

@andrewchambers I think I found the culprit. bupstash and duplicity create a lot of small files, whereas borg, kopia and restic create only a couple of files in their repository. Being that I store those repositories remotely on ZFS instead of XFS, that might be the issue since ZFS is configured out of the box with a 128k recordsize. I need to make a new set of benchmarks with a remote target using XFS.

deajan avatar Sep 06 '22 19:09 deajan