kind duplicacy and other tools benchmark report
Hello,
I'm currently doing benchmarks for deduplication backup tools, including duplicacy. I decided to write a script that would:
- Install the backup programs
- Prepare the source server
- Prepare local targets / remote targets
- Run backup and restore benchmarks
- Use public available data (linux kernel sources as git repo) and checkout various git tags to simulate user changes in the dataset
The idea of the script would be to have reproductible results, the only changing factor being the machine specs & network link between sources and targets.
So far, I've run two sets of benchmarks, each done locally and remotely. You can find the results at https://github.com/deajan/backup-bench
I'd love you to review the recipe I used for duplicacy, and perhaps guide me on what parameters to use to get maximum performance. Any remarks / ideas / PRs are welcome.
I've also made a comparaison table of some features of the backup solutions I'm benchmarking. I still miss some informations for some of the backup programs. Would you mind having a look at the comparaison table and fill the question marks related to the features of duplicacy ? Also, if duplicacy has an interesting feature I didn't list, I'll be happy to extend the comparaison.
PS: I'm trying to be as unbiased as possible when it comes to those benchmarks, please forgive me if I didn't treat your program with the parameters it deserves.
Also, I've created the same issue in every git repo of the backup tools I'm testing, so every author / team / community member can judge / improve the instructions for better benchmarking.
Hi @deajan, Awesome work!
I took a look at your script and I have a few suggestions:
1-For backup and restore commands, please use the -threads option with 8 threads for your setup. It will significantly increase speed.
Increase -threads from 8 until you saturate the network link or see a decrease in speed.
2-During init please play with chunk size:
-chunk-size, -c
With homogeneous data, you should see smaller backups and better deduplication. see Chunk size details
3-Some clarifications for your shopping list on Duplicacy:
1-.Redundant index copies : duplicacy doesn't use indexes. 2-Continue restore on bad blocks in repository: yes, and Erasure Coding 3-Data checksumming: yes 4-Backup mounting as filesystem: No (fuse implementation PR) 5-File includes / excludes bases on regexes: yes 6-Automatically excludes CACHEDIR.TAG(3) directories: No 7-Are metadatas encrypted too ?: yes 8-Can encrypted / compressed data be guessed (CRIME/BREACH style attacks)?: No 9-Can a compromised client delete backups?: No (with pub key and immutable target->requires target setup) 10-Can a compromised client restore encrypted data? No (with pub key) 11-Does the backup software support pre/post execution hooks?: yes, see Pre Command and Post Command Scripts 12-Does the backup software provide a crypto benchmark ?: there is a Benchmark command.
Important:
13- Duplicacy is serverless: Less cost, less maintenance, less attack surface. 14: Duplicacy works with a ton of storage backends: Infinitely scalable and more secure. 15-No indexes or databases.
Hope this helps a bit. Feel free to join the Forum.
Keep up the good work.
Thanks for your time. Table was updated with I've seen very bad restore speeds using duplicacy. Anything in mind I could try ?
Also, would you have a link to something explaining why CRIME/BREACH style attacks are not feasable perhaps ?
Thinking of it, it seems that duplicacy has bigger repository sizes as it's contenders. What's the default compression algorithm and level of duplicacy, and how does one change it ? All other programs use zstd:3.
Thanks for your time. Table was updated with I've seen very bad restore speeds using duplicacy. Anything in mind I could try ?
Also, would you have a link to something explaining why CRIME/BREACH style attacks are not feasable perhaps ?
Could I suggest trying out something like Backblaze's B2 with Duplicacy? I just today experimented with restore times on SSH vs. B2 and B2 was 10x faster than SSH (and that was SSH to multiple remote hosts just to confirm)