Bundle or chunk
This PR modifies duplicacy's bundle-and-chunk algorithm into bundle-OR-chunk, where files are processed in two passes while splitting chunks at file boundaries when possible (without chunk size limits). See lengthy discussion in issue #334.
In the first pass, files smaller than the minimum chunksize are bundled together, while in the second pass all larger files are processed and split into chunks as before.
In addition, all files smaller than the minimum chunksize have their content hash checked if the same content is already present in the current or previous snapshot, allowing for direct deduplication.
With these changes duplicacy will handle file copying, renaming and moving of files flawlessly. Small files are immediately recognized as duplicates based on the hash of their content, while larger files that might get split into multiple chunks always start from the beginning of a chunk and thus result in identical chunks as before. There should never be unnecessary chunks formed just because files are differently packed than before.
It sounds like this would be able to be applied to storage with existing chunks. Is that correct? Or would enabling this require starting over or maybe using '-hash'?
from the discussion, i also believe that it will just work, w/o a reset of all the backups.
It will work with existing storage for sure, since technically everything is still compatible. However the patch will significantly change where chunk boundaries are formed, so running with -hash is absolutely not recommended (small files should be recognized fine, but any larger files will be aligned at chunk boundaries causing chunks at the beginning and the end of each file to not match previous chunks).
Switching to this implementation on the fly with an existing storage should probably not have any significant short-time negative effect (although I have not tested/measured it, and there are certainly some extreme scenarios where it might look bad on the first run). Anyone care to compare incremental dry-runs for this patch versus the original code and share the results?
Since normal incremental backups only process files that are modified, these files must be bundled and/or split into chunks and efficiency all boils down to how many of the formed chunks are identical to previously stored chunks in the storage. The original bundle-AND-chunk algorithm have the files more or less randomly placed within and over chunks, while this bundle-OR-chunk algorithm provides more stability by aligning file boundaries at chunk boundaries which in my opinion should result in better deduplication.
I like the two-pass approach for its simplicity but I have reservation on the small file deduplication. One side effect of the small file deduplication is that the referenced chunk will remain in the storage even if all other files contained in that chunk have been removed. It may be more space-efficient by repacking small files into new chunks.
Crystal ball is broken, I cannot guess when and which files are removed that could allow us to remove an old chunk. Always repacking small files into new chunks is guaranteed to waste storage up front, in the hope that old chunks might someday be removed. Reusing parts of old chunks for identical files saves us the storage now, and someone else can worry about what the future holds. That is my vote.
Perhaps a refined prune could repack leftover parts of chunks allowing them to be properly discarded?
For this PR to move forward, I would like to keep only the two-pass changes (with a -two-pass option), and move the small file deduplication to a different PR. If this is too much work for you, we can leave this as it is and I'll create a new one just for the two-pass changes.
I do not really have time to work on this right now. Feel free to improve my solution in any way you wish, but I still argue all parts of this PR support each other and no significant part should be left out. Put everything behind a -alternate-universe option or whatever you feel comfortable with. :)
Related to the small file deduplication: may i suggest that files are packed together based on their folder? Because generally when a user deletes something, he deletes the files and then the folder.
More details:
- pack until either of { the minimum chunk size is reached, the last file going over the min chunk size is also in the same chunk and not split in 2 different chunks }, but not more (not reaching averageChunkSize): this gives a little bit more reassurance that the (smaller for my suggestion) chunks can be deleted sooner/easier
- sort the small files by folder, so that small files belonging to the same folder (which in my opinion will be deleted together) will be packed in the same chunk
- i am not sure if 1 folder per chunk is a good idea -- that might create waaaay too many chunks, but packing until minChunkSize could work
Perhaps a refined prune could repack leftover parts of chunks allowing them to be properly discarded?
I've wanted to add another column to the tabular "check" output that calculates how much "wasted" space there is. Something that figures out how much data could be freed by deleting a chunk, but is locked up because some valid data is still in the chunk. Presenting this in a meaningful way could be tricky without getting very verbose.
I really like the bundle-or-chunk algorithm and would love to see it implemented. @gchen, what do you need for this to be merged?
@mobad Unless I misunderstand what you are asking for, those tests have been done extensively in issue https://github.com/gilbertchen/duplicacy/issues/334
As an aside, I'd also like to invite anyone interested in duplicacy to join the discussion forum at https://forum.duplicacy.com
This pull request has been mentioned on Duplicacy Forum. There might be relevant details there:
http://forum.duplicacy.com/t/how-to-investigate-growing-size/1299/6
I'm evaluating Duplicacy and would love to see this feature accepted, regardless of the following note, since moving/renaming files is quite common.
Test: I compiled @kairisku's branch and tested it on a repository containing a few photos(> min-chunk-size) and a few small files (i.e. < min-chunk-size). All the small files end up in a single chunk. Moving files into subfolders and/or renaming them did not result in new file chunks in general and worked correctly with restore. Very nice! But, I did observe two quirks:
- The renamed file was not reported unless
-dis specified. Perhaps always reporting this activity would make the behavior clearer (i.e. indicate that a changed file was found)? ex with-d:
DEBUG FILE_HASH Hash of file cA.txt matches previously known file xA.txt
- Renaming/moving large files works as expected. However for the small files, if
-hashwas specified, renaming/moving any but the last small file results in a new data chunk the size of all the files. (Renaming the last file doesn't change the order that-hashscans the files.) In other words, it appears that with -hash, it doesn't check if all individual files are already in storage. Again, I would not want to hold up merging this code into the next release because of this odd edge case, but it seems worth noting. (And of course, it's no worse than the current release.)
Example: create and populate initial repository, then rename xA.txt to cA.txt then:
>..\bin\duplicacy backup -stats -dry-run ... Files: 11 total, 34,168K bytes; 0 new, 0 bytes File chunks: 6 total, 34,168K bytes; 0 new, 0 bytes, 0 bytes uploaded Metadata chunks: 3 total, 2K bytes; 1 new, 1K bytes, 1K bytes uploaded All chunks: 9 total, 34,170K bytes; 1 new, 1K bytes, 1K bytes uploaded Total running time: 00:00:01
>..\bin\duplicacy backup -stats -dry-run - hash ... Backup for E:\go\data at revision 2 completed Files: 11 total, 34,168K bytes; 11 new, 34,168K bytes File chunks: 6 total, 34,168K bytes; 1 new, 463K bytes, 432K bytes uploaded Metadata chunks: 3 total, 2K bytes; 2 new, 2K bytes, 1K bytes uploaded All chunks: 9 total, 34,170K bytes; 3 new, 466K bytes, 434K bytes uploaded
(Widows 10, Storage on disk, FWIW.)
An incremental backup using bundle_or_chunk with fixed-sized chunks (maxChunkSize == minChunkSize) recently failed with an error such as
The snapshot contains an error: The file xxx has a size of 640 but the total size of chunks is 0
I had difficulty producing a simplified example, but it was consistent on a 7GB dataset after two incrementals had succeeded. Eventually I traced it to issues related to zero-length files.
I've fixed it on my end but am stymied in two ways as to how to add it here.
- it appears to involve a "silent error" in the main branch, so do I submit a new pull request on the main branch (where it fixes something that may not matter) or on bundle_or_chunk, where it really matters (and is also what I've already based it on)?
- Regardless, if modifying bundle_or_chunk, for this or another mod to this branch, do I submit a pull request to @kairisku or @gilbertchen GitHub repo? (Especially if a fix addresses a comment mentioned above!) 2b. If to Kai, GitHub won't let me fork Kai's fork since I've already forked Gilbert's original (of course I've pulled Kai's branches to my local repo, in order to modify). I can see that I can push and then submit a pull request to Kai but have never done this before. Can someone advise how this is normally handled? (Is there some way to submit directly to this PR, for example?)
P.S. if anyone's wondering, I'm the same person as the previous poster but this is my personal account, which seems more appropriate in retrospect.
When fixed-size chunking is activated, Duplicacy doesn't bundle and split files, because fixed-size chunking is sensitive to insertion and deletion. Instead, files are split and uploaded individually.
So there is no point of using bundle_or_chunk with fixed-size chunking.
@gilbertchen, thank you for the reply. I think Duplicacy is awesome and really appreciate what you have done. I'm also looking forward to getting the Web interface when it's out of beta.
Let me clarify my previous comment:
- I have fixed the bug and am mainly asking whether to make a separate PR or add it to this one?
- I have confirmed that the bug, and my fix, are entirely in your master branch (but see 4, below)
- The problem is that references are generated to a zero-length chunk, and then associated with non-zero-length files.
- For example, I used Duplicacy 2.1.2 on my C:\Users... repository and then wrote code to evaluate the saved snapshot. I get:
2717 Entries reference Zero-length chunks. 0 of these files are actually zero-length. The files total 291,891,961 bytes.
- While so far this is just a minor annoyance, if you were to merge this branch into master, it would fail, as mentioned above (https://github.com/gilbertchen/duplicacy/pull/456#issuecomment-471156048)
- As long as the bug is present, future revisions (or perhaps even the current version) could break in the same way.
- My fix is relatively simple & straightforward.
- (If you want to evaluate/use my proposed fix, go back to item 1 ;)
When fixed-size chunking is activated, Duplicacy doesn't bundle and split files, because fixed-size chunking is sensitive to insertion and deletion. Instead, files are split and uploaded individually. So there is no point of using bundle_or_chunk with fixed-size chunking.
@gilbertchen At the risk of further muddying the waters, bundle_or_chunk, with its small-file deduplication, can fix this problem too. In fact, I've already coded it locally and am testing it now. So far it works spectacularly, cutting down the number of chunks by a factor of 10 for 1M fixed-size chunks to 2-3x for 256KB chunks, without affecting incremental upload size compared to v2.1.2. (when backing my C:\User\ folders). This opens the possibility for using fixed-size chunking of regular files for people wanting to minimize upload size without generating thousands of tiny files -- especially with cloud providers such as S3 who charge a minimum billable size per file. Another possible use-case is for a mixed repository of large "container" files and small files. (Or in some future revision, to package small chunks into bigger chunks for uploading, as in Duplicati, but that may never be worth doing.)
| version | strategy | chunks |
|---|---|---|
| v2.1.2 | fixed 1M | 57,748 |
| bundle_or_chunkv2* | fixed 1M | 5,800 |
| v2.1.2 | variable 1M | 5,847 |
* with my new code, initial tests.
At the risk of further muddying the waters, bundle_or_chunk, with its small-file deduplication, can fix this problem too. @arikorn I doubt that. When you bundle small files into fixed-size chunks, you'll have problem generating the same set of chunks independently on two computers from two almost identical sets of files (assume one computer misses a few files). Small-file deduplication won't help here, because the snapshot file from one computer may not be available to the other.
@gilbertchen wrote: When fixed-size chunking is activated, Duplicacy doesn't bundle and split files, because fixed-size chunking is sensitive to insertion and deletion. ... So there is no point of using bundle_or_chunk with fixed-size chunking. @arikorn wrote: At the risk of further muddying the waters, bundle_or_chunk, with its small-file deduplication, can fix this problem too. @gilbertchen wrote: I doubt that. When you bundle small files into fixed-size chunks, you'll have problem generating the same set of chunks independently on two computers from two almost identical sets of files (assume one computer misses a few files). Small-file deduplication won't help here, because the snapshot file from one computer may not be available to the other.
TL;DR: Well, if you don't like the word "fix," perhaps you would concede "improve?" ;-) However, at the end of this message, I'll suggest a way to address your concern. And just to keep the original question in mind: this was part of the reason for considering fixing a bug in Duplicacy! (or more precisely for considering a fix I already implemented). See (https://github.com/gilbertchen/duplicacy/pull/456#issuecomment-471824451), above
So, @gilbertchen, it sounds like we agree that there are times when variable chunking is recommended and times when fixed-size chunking is recommended. Likewise, we agree that fixed-size chunking in Duplicacy v2.1.2 is advisable only under very restrictive conditions (i.e. both 1. very few files and 2. fixed-length files where insertions are unlikely). In that light, perhaps you'll agree that small-file deduplication could expand the use-cases in which fixed-length chunking is an available option: namely, by removing the "few files" restriction and allowing fixed-length chunking for files of all sizes in which insertions are unlikely as long as any of the following are true
- only a single computer is associated with the storage or
- the multiple computers share the same file structure or even same snapshot name (such as my "jumpstarting" use-case, in which Duplicacy allows me to do the initial backup from a different computer) or
- when the total size of "shared" small files is small. Interestingly, the latter may be common, since by their nature, small files are, well, small! or
- One is willing to forgive some lost deduplication opportunities in exchange, for example, for smaller individual backups on each source computer. Again, we expect the excess to be small because the files the were missed are small. Whereas the improved backup size is due to the fixed-size chunking of larger files.
Here's a real-world example, to make it more concrete: I backup a User directory in which the overwhelming majority of files are small, but they still make up only 15% of the total size. In this case, most of the large files are fixed-length DBs, and probably no files experience insertions. In my ongoing benchmarks, my implementation of fixed-size chunking with small-file deduplication results in the smallest uploads and smallest storage size.
Back to your specific concern: I can think of two ways to make small-file, fixed-chunk, deduplication more compliant with your cross-computer requirement, (or more specifically: make it no worse than the penalty one would pay using variable chunking). One you already mentioned: check the other snapshots in the storage, which of course, must be accessible to all computers that can access the storage. Perhaps simpler, though, would be to try and break chunks at directory boundaries once minChunkSize has been reached. This could work similarly to the look-ahead I mentioned to you elsewhere. This latter approach would make the @TheBestPessimist happy too (https://github.com/gilbertchen/duplicacy/pull/456#issuecomment-401684393) ;)
(Regarding my original point about fixing the misplaced chunks bug, I'll submit a PR, soon, and let you choose if and when you want it on its own merit.)
I'm new to Duplicacy. I downloaded it and started backing up to B2 a week ago. I'm on version "2.2.3 (58387C)". Does this use the new "bundle or chunk" method by default, with no configuration and/or flags required? I have 7TB to back up, and high level directories are moved and/or renamed often, so efficient deduplication is one of my top concerns.
Edit: Oh wait, it looks like this hasn't been merged into master. I doubt begging works, but I'm begging you to merge this! :-D Merging this could save me a huge pile of cash in B2 bills. Without it, I have to resume looking at other products. (In months of research I haven't found a better option, but there are two-step options that use a local intermediary--and I also back up to local anyway--that might work brilliantly.)
Also, this would allow the future addition of options to change the file order for backing up. The current alphabetical (or node ID based?) order doesn't work for my particular use-case, since A) I have 7TB [varies between 5 and 15TB] of data to back up, B) The older my data gets, the more ways it's already backed up both locally and in the cloud, and C) I literally cannot wait for 1 or 2 years for my newest data to get backed up to the cloud. So I wind up having to do mount --bind and filter tricks to get Duplicacy to better serve my requirements...which further also involves prioritizing photos first as a secondary priority to the first of "newest". (And mount --bind tricks further compound the "rename or move" problem that this PR addresses.
This pull request has been mentioned on Duplicacy Forum. There might be relevant details there:
https://forum.duplicacy.com/t/change-order-of-files-backing-up/2621/6
This pull request has been mentioned on Duplicacy Forum. There might be relevant details there:
https://forum.duplicacy.com/t/change-order-of-files-backing-up/2621/8
This pull request has been mentioned on Duplicacy Forum. There might be relevant details there:
https://forum.duplicacy.com/t/change-order-of-files-backing-up/2621/1
This pull request has been mentioned on Duplicacy Forum. There might be relevant details there:
https://forum.duplicacy.com/t/roadmap-for-the-cli/2709/1
This pull request has been mentioned on Duplicacy Forum. There might be relevant details there:
https://forum.duplicacy.com/t/possibility-of-a-hybrid-chunker/6991/2
This pull request has been mentioned on Duplicacy Forum. There might be relevant details there:
https://forum.duplicacy.com/t/how-to-add-webdav-storage-with-http-only-in-web-edition/7227/26