jdupes
jdupes copied to clipboard
Deduping files on APFS results in data loss if the source file uses APFS compression
When using jdupes to deduplicate files on macOS with APFS, data loss can occur when a duplicate file is linked to a file that's using APFS compression.
After some brief experimentation, it seems that the issue specifically occurs when an uncompressed file is linked to (cloned from) a compressed file. When this is attempted, the uncompressed file is replaced with a file that appears to be empty. I believe the compressed contents are actually placed in the extended attributes of the file as a resource fork, but for all intents and purposes, I'd still consider this to be data loss since the file can't be read normally after the deduplication. (I also didn't check the contents of the fork to see if the file was intact.)
I didn't notice any issues when going the other direction (deduping a compressed file by cloning an uncompressed file). When doing that, the compressed file is just replaced with the uncompressed one which is what I'd expect.
Things also seem to be okay when both files being deduplicated are compressed. However, I'm not sure whether this scenario will always work. (For example, if APFS supports multiple compression algorithms, I'm not sure what would happen when deduping two duplicate files that use different compression algorithms.)
Environment
macOS: 11.6 (20G165) jdupes: 1.20.0 (2021-05-12) 64-bit (installed using Homebrew) afsctool: RJVB/afsctool built from source (main branch, currently commit e93b3c7)
Although I'll be using afsctool in the reproduction steps to create a compressed file, this is mainly just to simplify things. I actually ran into the issue when deduplicating two copies of Xcode, one of which wasn't compressed. If you'd like to try that instead of using afsctool, you can extract Xcode.xip, clone or move Xcode.app/Contents/Info.plist to a/hello.txt, copy (not clone) Xcode.app/Contents/Info.plist to b/hello.txt, and then try to deduplicate the two files as in the rest of my reproduction steps.
Reproduction Steps
-
First, create a large, compressible text file. (I just ran
find ~ > hello.txtfor a few seconds.) -
Copy the large file to two directories:
$ mkdir a b $ cp hello.txt a $ cp hello.txt b -
Obtain file hashes, disk usage and character counts (used for future comparisons)
$ md5sum **/*.txt e34d0738b1d6b397a9e91d3df0eac358 a/hello.txt e34d0738b1d6b397a9e91d3df0eac358 b/hello.txt e34d0738b1d6b397a9e91d3df0eac358 hello.txt $ du -hs **/*.txt 43M a/hello.txt 43M b/hello.txt 43M hello.txt $ wc -c **/*.txt 44699648 a/hello.txt 44699648 b/hello.txt 44699648 hello.txt 134098944 total -
Compress
a/hello.txt$ afsctool -c a/hello.txt -
Verify checksums remain unchanged and that
a/hello.txtuses less disk space since it has been compressed:$ md5sum **/*.txt e34d0738b1d6b397a9e91d3df0eac358 a/hello.txt e34d0738b1d6b397a9e91d3df0eac358 b/hello.txt e34d0738b1d6b397a9e91d3df0eac358 hello.txt $ du -hs **/*.txt 6.8M a/hello.txt 43M b/hello.txt 43M hello.txt $ wc -c **/*.txt 44699648 a/hello.txt 44699648 b/hello.txt 44699648 hello.txt 134098944 total -
Dedupe files using jdupes, using
-Oto ensure thata/hello.txtis used as the source for the APFS clone.$ jdupes --dedupe -r -O a b Scanning: 2 files, 2 items (in 2 specified) [SRC] a/hello.txt -##-> b/hello.txt -
Observe that although the filesize of
b/hello.txtnow matches that of the compresseda/hello.txtfile, the actual character count and hash are now incorrect:$ md5sum **/*.txt e34d0738b1d6b397a9e91d3df0eac358 a/hello.txt d41d8cd98f00b204e9800998ecf8427e b/hello.txt e34d0738b1d6b397a9e91d3df0eac358 hello.txt $ du -hs **/*.txt 6.8M a/hello.txt 6.8M b/hello.txt 43M hello.txt $ wc -c **/*.txt 44699648 a/hello.txt 0 b/hello.txt 44699648 hello.txt 89399296 totalAdditionally, if you look at the extended attributes of the files, you can see that the file with apparent data loss has a resource fork with the compressed data. This explains why the disk usage for the file is still 6 MB, despite the file appearing to be completely empty. (I suppose this means the data wasn't technically lost, but I'd still consider this data loss from most perspectives.)
$ /bin/ls -al@ **/*.txt -rw-r--r-- 1 rpendleton staff 44699648 Oct 15 23:19 a/hello.txt -rw-r--r--@ 1 rpendleton staff 0 Oct 15 23:19 b/hello.txt com.apple.ResourceFork 7102986 com.apple.decmpfs 16 -rw-r--r-- 1 rpendleton staff 44699648 Oct 15 23:19 c/hello.txt -rw-r--r-- 1 rpendleton staff 44699648 Oct 15 23:02 hello.txt
Expected Result
Ideally, after deduplicating a file using a compressed one as the source of the clone, I'd like both files to be compressed and fully intact. However, at the very least, I'd just expect to not lose data.
Additional Notes
If you copy a compressed file with cp -c or Finder, you can verify that a normal APFS clone doesn't result in data loss. This would lead me to believe that the issue can be resolved in jdupes:
$ cp hello.txt hello-source.txt
$ afsctool -c hello-source.txt
$ cp -c hello-source.txt hello-dest1.txt
$ cp hello-source.txt hello-dest2.txt
$ md5sum hello-*
e34d0738b1d6b397a9e91d3df0eac358 hello-dest1.txt
e34d0738b1d6b397a9e91d3df0eac358 hello-dest2.txt
e34d0738b1d6b397a9e91d3df0eac358 hello-source.txt
$ du -hs hello-*
6.8M hello-dest1.txt
43M hello-dest2.txt
6.8M hello-source.txt
$ wc -c hello-*
44699648 hello-dest1.txt
44699648 hello-dest2.txt
44699648 hello-source.txt
134098944 total
I looked into this a bit more and think I know what's going on.
When HFS+ and APFS compress files, an extended attribute named com.apple.decmpfs is added to the file. The value of the attribute starts with a 16-byte header. For small files, the header is followed by the compressed (or even uncompressed) data. For larger files, the extended attribute only includes the header and the compressed data is stored as a resource fork instead. (This resource fork is actually exposed as an extended attribute as well, but it's separate from the com.apple.decmpfs attribute.)
Although this kinda explains what the com.apple.decmpfs and com.apple.ResourceFork extended attributes are, it doesn't explain why the attributes only appear on the deduped file and not the source file it was cloned from. For that, it turns out one of the flags a file can have is UF_COMPRESSED. This flag indicates that the file is using file-system compression, and when it's present, the com.apple.decmpfs extended attribute (and possibly the resource fork) should respected.
In jdupes' case, immediately after calling clonefile, the original dupfile's metadata is preserved by calling copyfile(..., COPYFILE_METADATA). It seems this call results in the UF_COMPRESSED flag being overridden, which means the extended attributes are no longer respected. (Although I find it interesting that the attributes aren't deleted. Maybe they'd be overwritten if they existed on the dupfile?)
Possible Solution
I made some rough changes (https://github.com/rpendleton/jdupes/commit/06f9e0e9421d9ef3fdc6e2d250daf94fb501c891) to test this theory, and by saving the source file's UF_COMPRESSED flag and reapplying it after the clonefile and copyfile are complete, I was able to dedup files using compressed sources without any issues.
For now, I'm not planning on opening a PR for these changes. I haven't tested them that thoroughly so I'm not really confident in them. Additionally, I feel like there should be a better way to handle this. (Perhaps a better way of copying metadata?) I'm also somewhat concerned about the fact that the copyfile is replacing some of the metadata, but evidently not all of it.
Regardless of those concerns, I figured I'd post this update either way since it might lead to a more robust solution.
I concur.
This is an obvious bug in macOS. It doesn't mean that a workaround isn't justified (it'd still be necessary for OS revisions prior to a fix), but Apple's clonefile() call not doing what it's supposed to do is the actual problem here. Your workaround is probably a fine solution. I don't use Macs the vast majority of the time, so I haven't tested it myself.
I would strongly prefer that any solution not add to the global stat() cache. It would be better to pull those special flags on demand during final processing, even if it means making a few redundant calls. Everything added to that cache increases memory and CPU usage for every file examined, and there are people who use this program to deduplicate across millions of files at a time, and the only need for those flags is at dedupe time (so far).
I agree with all of those points.
In addition to moving away from the cache, I think there are a few more changes that should be made:
- It seems
chflagsresults in the modified date being updated, so that should probably be fixed - The compression flag likely won't differ between most files, so it would be good to check whether the flags are actually different before doing unnecessary changes after the
copyfile. - The
unlinksin my error handling likely aren't necessary. I also wasn't sure which errors should be considered failures. - If the
unlinksare actually necessary, they should probably be changed toremovesince that's portable. - It looks like the function I modified is using static variables instead of local variables. If that's important, then the new variables I defined should likely be changed to statics.
Thanks.
Would you please make a PR for me so I can bring this in?
I haven't opened a PR yet since I still have to fix the loss of timestamp metadata and actually test my most recent changes. Once I've done that though, I'll open a PR.
Have you had a chance to test that @rpendleton? I was contemplating building a sample data test set for this, but if you already have I'll hold off for that.
@tvwerkhoven has a script that seems to be getting some interest at StackExchange which protects metadata. It works (since it's using cp -c), but it's a little more roundabout than running jdupes clone directly.
Does anyone have an update on this? I've noticed a lot of buzz in the @tvwerkhoven script comments but no pull requests from anyone to fix this yet.