rdfind
rdfind copied to clipboard
Provide option for deduplication subprocess
I've noted that there seems to be btrfs deduplication in the roadmap. I would suggest that it might be easier to provide an option for a sub-process which performs the deduplication on rdfind's behalf.
Then regardless of system configuration a programmer may plug in a tool that simply does whatever "deduplication" is on their filesystem.
:+1: Personally I would like to be able to use rdfind with equivalent of cp -c
on APFS for de-duplication.
Yep, this is definitely of high interest. I like delegating the tricky work of being compatible with different OS-es and file systems, by just invoking for instance gnu cp with the --reflink=auto flag. The performance of this approach is probably low since one has to launch a process for each file, but the end result is nice.
I have just attempted to do a MacOS variant using platform native clonefile
in my fork. I could change it to invoke external, user defined process as you guys have suggested.
Testing it made me realize that there is inherent problem with cloning as de-duplication mechanism. It all works fine for the first time. Second time you run it, it comes across cloned files and thinks there are duplicates and tries to clone them again, whatever mechanism for cloning you use. Contrary to hard-links, clone will have different inode. This effectively mean that all clones are re-cloned on every run.
I am not sure if this is acceptable. Theoretically re-cloning would not involve any data copying, but will change file inodes on every run. That can lead to strange behavior with fs snapshots, or file sync/backup software that depends on inode numbers as a mechanism to track filesystem differences.
Do you guys have any idea what could be done about this?
The only think I could think of is using extended attributes to track files that have been cloned by rd-find so it at least does not try to re-clone files over and over again. It would still mean that files clone by other software will be re-cloned on first run.
Deduplication is a different thing since it's transparent to the filesystem while the goal of rdfind is to explicitly clean the filesystem from duplicated files.
I usually do both (a combination of) things.
See also 'duperemove -vdr