rdfind
rdfind copied to clipboard
Better processing of duplicate inodes
As described in readme, currently identical inodes can be dealt with in one of two ways:
- with
-removeidentinode true
, one might need to runrdfind
several times to merge two hardlinked groups of identical files (each run will do md5 computation on these two inodes) - with
-removeidentinode false
, this can be done in one pass, at expense of doing md5 computation on each of hardlinks (I assume so).
I suggest improving this in such a way that instead of "collapsing a group of hardlinked files to a single entry" and dealing with this entry alone, on step 5 of the algorithm we keep a list of all hardlinks linked to this entry and, after step 12 on the algorithm (after checksum comparison), if "main entry" is still in list of duplicates - add all its hardlinks to the list, too
This would perhaps work, but it complicates the most sensitive part of the deduplication and I am very scared of introducing bugs there. From the email people send after having seen the reachout in the manual page (I just love those messages!) I know rdfind is used on a lot of real systems and I don't want to upset anyone with losing their files. If you want to help, please provide (extend the existing) unit tests to cover this case and I will feel more comfortable changing this. Thanks, Paul
This issue affects me too. I can use -removeidentinode false to get around it, but it makes the process take a lot longer since it hashes the same file (dev/inode) for each link.
A couple ideas I thought of that might help, and may be simpler to implement than keeping a list of hardlinks for each file:
-
If you use rdfind repeatedly over time to deduplicate files, you'll probably end up with 1 version of the file with many links and then any new copies will have 1 link each. In this case, rdfind has to "choose" which copy to be the link source and which copy to be the link target. An enhancement to choose the copy with the most links to be the one to "keep" and then link the 1-link versions to that will reduce the occurrence of multiple hardlinked groups of identical files.
-
When scanning files to find duplicates (1st byte, last byte, checksums), store a table of the hashes by dev/inode, so if rdfind encounters another hardlink to a file that it has hashed already, reuse the existing hash instead of re-hashing. This would make -removeidentnode false a lot faster if a lot of hardlinked duplicates exist already, and it wouldn't affect the deduplication code.
Hey @pauldreik. I agree with @Lex-2008, this would be very useful. My C++ skills aren't fantastic but after quickly glazing over the code it feels like a small change that Lex suggests. Perhaps change so that "-makehardlinks group" is a valid argument, keeping rdfind backwards-compatible. Not sure what you mean by unit test for this, but I threw together a small bash-script that tests what would be useful for me.
I also believe that if -removeidentinode false
were more efficient to avoid multiple hashing, perhaps -removeidentinode true
would become useless. Then, -removeidentinode false
could become the default, which would cause rdfind to behave in the expected behaviour "by default" when hard-linked files exist in the input.