backup-bench Duplicacy

Duplicacy

Open blackbit47 opened this issue 1 year ago • 2 comments

Hi @deajan, Awesome work!

I took a look at your script and I have a few suggestions:

1-For backup and restore commands, please use the -threads option with 8 threads for your setup. It will significantly increase speed.

Increase -threads from 8 until you saturate the network link or see a decrease in speed.

2-During init please play with chunk size:

-chunk-size, -c the average size of chunks (default is 4M) -max-chunk-size, -max the maximum size of chunks (default is chunk-size*4) -min-chunk-size, -min the minimum size of chunks (default is chunk-size/4)

With homogeneous data, you should see smaller backups and better deduplication. see Chunk size details

3-Some clarifications for your shopping list on Duplicacy:

1-Redundant index copies : duplicacy doesn't use indexes. (or db) 2-Continue restore on bad blocks in repository: yes, and Erasure Coding 3-Data checksumming: yes 4-Backup mounting as filesystem: No (fuse implementation PR but not likely short term) 5-File includes / excludes bases on regexes: yes 6-Automatically excludes CACHEDIR.TAG(3) directories: No 7-Are metadatas encrypted too ?: yes 8-Can encrypted / compressed data be guessed (CRIME/BREACH style attacks)?: No 9-Can a compromised client delete backups?: No (with pub key and immutable target->requires target setup) 10-Can a compromised client restore encrypted data? No (with pub key) 11-Does the backup software support pre/post execution hooks?: yes, see Pre Command and Post Command Scripts 12-Does the backup software provide a crypto benchmark ?: there is a Benchmark command.

Important:

13- Duplicacy is serverless: Less cost, less maintenance, less attack surface.. This also means that D will always be a bit slower since it has to list before it uploads a particular chunk. 14: Duplicacy works with a ton of storage backends: Infinitely scalable and more secure. 15-No indexes or databases.

16-You should test partial restore 17-Test data should be a little bit more diverse. But I guess this is difficult Hope this helps a bit. Feel free to join the Forum.

Keep up the good work.

Sep 07 '22 01:09 blackbit47

I've updated the comparaison table with your remarks.

13- Duplicacy is serverless: Less cost, less maintenance, less attack surface.. 14: Duplicacy works with a ton of storage backends: Infinitely scalable and more secure.

Does duplicacy have a preferred self hosted backend ?

15-No indexes or databases.

I'm a bit puzzled. Since there are data chunks, there need to be somewhere a description of where they are linked to... something like an index...?

For now, I've added the -threads option for the next test round.

If I go the chunk size route, I'll have to do this for all backup solutions.

Sep 07 '22 18:09 deajan

Hi ,

Indeed, the lack of index or db is one of the most amazing design features of Duplicacy Let me quote from the Lock free deduplication algorithm

"What is novel about lock-free deduplication is the absence of a centralized indexing database for tracking all existing chunks and for determining which chunks are not needed any more. Instead, to check if a chunk has already been uploaded before, one can just perform a file lookup via the file storage API using the file name derived from the hash of the chunk. This effectively turns a cloud storage offering only a very limited set of basic file operations into a powerful modern backup backend capable of both block-level and file-level deduplication. More importantly, the absence of a centralized indexing database means that there is no need to implement a distributed locking mechanism on top of the file storage."

Sep 07 '22 19:09 blackbit47

backup-bench backup-bench copied to clipboard

Duplicacy

backup-bench
backup-bench copied to clipboard