jdupes Deduping files on APFS results in data loss if the source file uses APFS compression

When using jdupes to deduplicate files on macOS with APFS, data loss can occur when a duplicate file is linked to a file that's using APFS compression.

After some brief experimentation, it seems that the issue specifically occurs when an uncompressed file is linked to (cloned from) a compressed file. When this is attempted, the uncompressed file is replaced with a file that appears to be empty. I believe the compressed contents are actually placed in the extended attributes of the file as a resource fork, but for all intents and purposes, I'd still consider this to be data loss since the file can't be read normally after the deduplication. (I also didn't check the contents of the fork to see if the file was intact.)

I didn't notice any issues when going the other direction (deduping a compressed file by cloning an uncompressed file). When doing that, the compressed file is just replaced with the uncompressed one which is what I'd expect.

Things also seem to be okay when both files being deduplicated are compressed. However, I'm not sure whether this scenario will always work. (For example, if APFS supports multiple compression algorithms, I'm not sure what would happen when deduping two duplicate files that use different compression algorithms.)

Environment

macOS: 11.6 (20G165) jdupes: 1.20.0 (2021-05-12) 64-bit (installed using Homebrew) afsctool: RJVB/afsctool built from source (main branch, currently commit e93b3c7)

Although I'll be using afsctool in the reproduction steps to create a compressed file, this is mainly just to simplify things. I actually ran into the issue when deduplicating two copies of Xcode, one of which wasn't compressed. If you'd like to try that instead of using afsctool, you can extract Xcode.xip, clone or move Xcode.app/Contents/Info.plist to a/hello.txt, copy (not clone) Xcode.app/Contents/Info.plist to b/hello.txt, and then try to deduplicate the two files as in the rest of my reproduction steps.

Reproduction Steps

First, create a large, compressible text file. (I just ran find ~ > hello.txt for a few seconds.)

Copy the large file to two directories:

$ mkdir a b
$ cp hello.txt a
$ cp hello.txt b

Obtain file hashes, disk usage and character counts (used for future comparisons)

$ md5sum **/*.txt
e34d0738b1d6b397a9e91d3df0eac358  a/hello.txt
e34d0738b1d6b397a9e91d3df0eac358  b/hello.txt
e34d0738b1d6b397a9e91d3df0eac358  hello.txt

$ du -hs **/*.txt
 43M	a/hello.txt
 43M	b/hello.txt
 43M	hello.txt

$ wc -c **/*.txt
 44699648 a/hello.txt
 44699648 b/hello.txt
 44699648 hello.txt
 134098944 total

Compress a/hello.txt
```
$ afsctool -c a/hello.txt
```

Verify checksums remain unchanged and that a/hello.txt uses less disk space since it has been compressed:

$ md5sum **/*.txt
e34d0738b1d6b397a9e91d3df0eac358  a/hello.txt
e34d0738b1d6b397a9e91d3df0eac358  b/hello.txt
e34d0738b1d6b397a9e91d3df0eac358  hello.txt

$ du -hs **/*.txt
6.8M	a/hello.txt
 43M	b/hello.txt
 43M	hello.txt

$ wc -c **/*.txt
 44699648 a/hello.txt
 44699648 b/hello.txt
 44699648 hello.txt
 134098944 total

Dedupe files using jdupes, using -O to ensure that a/hello.txt is used as the source for the APFS clone.

$ jdupes --dedupe -r -O a b
Scanning: 2 files, 2 items (in 2 specified)
[SRC] a/hello.txt
-##-> b/hello.txt

Observe that although the filesize of b/hello.txt now matches that of the compressed a/hello.txt file, the actual character count and hash are now incorrect:

$ md5sum **/*.txt
e34d0738b1d6b397a9e91d3df0eac358  a/hello.txt
d41d8cd98f00b204e9800998ecf8427e  b/hello.txt
e34d0738b1d6b397a9e91d3df0eac358  hello.txt

$ du -hs **/*.txt
6.8M	a/hello.txt
6.8M	b/hello.txt
 43M	hello.txt

$ wc -c **/*.txt
 44699648 a/hello.txt
       0 b/hello.txt
 44699648 hello.txt
 89399296 total

Additionally, if you look at the extended attributes of the files, you can see that the file with apparent data loss has a resource fork with the compressed data. This explains why the disk usage for the file is still 6 MB, despite the file appearing to be completely empty. (I suppose this means the data wasn't technically lost, but I'd still consider this data loss from most perspectives.)

$ /bin/ls -al@ **/*.txt
-rw-r--r--  1 rpendleton  staff  44699648 Oct 15 23:19 a/hello.txt
-rw-r--r--@ 1 rpendleton  staff         0 Oct 15 23:19 b/hello.txt
	com.apple.ResourceFork	 7102986
	com.apple.decmpfs	      16
-rw-r--r--  1 rpendleton  staff  44699648 Oct 15 23:19 c/hello.txt
-rw-r--r--  1 rpendleton  staff  44699648 Oct 15 23:02 hello.txt

Expected Result

Ideally, after deduplicating a file using a compressed one as the source of the clone, I'd like both files to be compressed and fully intact. However, at the very least, I'd just expect to not lose data.

Additional Notes

If you copy a compressed file with cp -c or Finder, you can verify that a normal APFS clone doesn't result in data loss. This would lead me to believe that the issue can be resolved in jdupes:

$ cp hello.txt hello-source.txt
$ afsctool -c hello-source.txt
$ cp -c hello-source.txt hello-dest1.txt
$ cp hello-source.txt hello-dest2.txt

$ md5sum hello-*
e34d0738b1d6b397a9e91d3df0eac358  hello-dest1.txt
e34d0738b1d6b397a9e91d3df0eac358  hello-dest2.txt
e34d0738b1d6b397a9e91d3df0eac358  hello-source.txt

$ du -hs hello-*
6.8M	hello-dest1.txt
 43M	hello-dest2.txt
6.8M	hello-source.txt

$ wc -c hello-*
 44699648 hello-dest1.txt
 44699648 hello-dest2.txt
 44699648 hello-source.txt
 134098944 total

Oct 16 '21 06:10 rpendleton

I looked into this a bit more and think I know what's going on.

When HFS+ and APFS compress files, an extended attribute named com.apple.decmpfs is added to the file. The value of the attribute starts with a 16-byte header. For small files, the header is followed by the compressed (or even uncompressed) data. For larger files, the extended attribute only includes the header and the compressed data is stored as a resource fork instead. (This resource fork is actually exposed as an extended attribute as well, but it's separate from the com.apple.decmpfs attribute.)

Although this kinda explains what the com.apple.decmpfs and com.apple.ResourceFork extended attributes are, it doesn't explain why the attributes only appear on the deduped file and not the source file it was cloned from. For that, it turns out one of the flags a file can have is UF_COMPRESSED. This flag indicates that the file is using file-system compression, and when it's present, the com.apple.decmpfs extended attribute (and possibly the resource fork) should respected.

In jdupes' case, immediately after calling clonefile, the original dupfile's metadata is preserved by calling copyfile(..., COPYFILE_METADATA). It seems this call results in the UF_COMPRESSED flag being overridden, which means the extended attributes are no longer respected. (Although I find it interesting that the attributes aren't deleted. Maybe they'd be overwritten if they existed on the dupfile?)

Possible Solution

I made some rough changes (https://github.com/rpendleton/jdupes/commit/06f9e0e9421d9ef3fdc6e2d250daf94fb501c891) to test this theory, and by saving the source file's UF_COMPRESSED flag and reapplying it after the clonefile and copyfile are complete, I was able to dedup files using compressed sources without any issues.

For now, I'm not planning on opening a PR for these changes. I haven't tested them that thoroughly so I'm not really confident in them. Additionally, I feel like there should be a better way to handle this. (Perhaps a better way of copying metadata?) I'm also somewhat concerned about the fact that the copyfile is replacing some of the metadata, but evidently not all of it.

Regardless of those concerns, I figured I'd post this update either way since it might lead to a more robust solution.

Oct 16 '21 09:10 rpendleton

I concur.

Oct 18 '21 13:10 neowinston

This is an obvious bug in macOS. It doesn't mean that a workaround isn't justified (it'd still be necessary for OS revisions prior to a fix), but Apple's clonefile() call not doing what it's supposed to do is the actual problem here. Your workaround is probably a fine solution. I don't use Macs the vast majority of the time, so I haven't tested it myself.

I would strongly prefer that any solution not add to the global stat() cache. It would be better to pull those special flags on demand during final processing, even if it means making a few redundant calls. Everything added to that cache increases memory and CPU usage for every file examined, and there are people who use this program to deduplicate across millions of files at a time, and the only need for those flags is at dedupe time (so far).

Oct 18 '21 14:10 jbruchon

I agree with all of those points.

In addition to moving away from the cache, I think there are a few more changes that should be made:

It seems chflags results in the modified date being updated, so that should probably be fixed
The compression flag likely won't differ between most files, so it would be good to check whether the flags are actually different before doing unnecessary changes after the copyfile.
The unlinks in my error handling likely aren't necessary. I also wasn't sure which errors should be considered failures.
If the unlinks are actually necessary, they should probably be changed to remove since that's portable.
It looks like the function I modified is using static variables instead of local variables. If that's important, then the new variables I defined should likely be changed to statics.

Oct 19 '21 04:10 rpendleton

Thanks.

Oct 26 '21 14:10 neowinston

Would you please make a PR for me so I can bring this in?

Oct 26 '21 15:10 jbruchon

I haven't opened a PR yet since I still have to fix the loss of timestamp metadata and actually test my most recent changes. Once I've done that though, I'll open a PR.

Oct 28 '21 04:10 rpendleton

Have you had a chance to test that @rpendleton? I was contemplating building a sample data test set for this, but if you already have I'll hold off for that.

@tvwerkhoven has a script that seems to be getting some interest at StackExchange which protects metadata. It works (since it's using cp -c), but it's a little more roundabout than running jdupes clone directly.

Dec 07 '21 21:12 Botts85

Does anyone have an update on this? I've noticed a lot of buzz in the @tvwerkhoven script comments but no pull requests from anyone to fix this yet.

Sep 03 '22 18:09 jbruchon