jdupes Output sort by hard link reference count

No error, but more of a question:

I plan to use jdupes to dedupe some rsnapshot backup trees by using hardlinks (rsnapshot uses hardlinks for unchanged files between backup generations, but does not catch moved or renamed files). I don't want to regularly scan all old backup generations, but rather only the current backup against the preceding one.

It is important that when a duplicate is found the file in the current backup is replaced by a hardlink to the file in the preceding backup (and not the other way round) because the file from the preceding backup might already be hardlinked multiple times into even older backup generations.

eg. I want the two hardlink groups (gen0) (gen1 gen2 gen3) to become a single hardlink group (gen0 gen1 gen2 gen3) rather than two different groups (gen0 gen1) (gen2 gen3) when I instruct jdupes to only check gen0 and gen1.

Is there any kind of stable sort order regarding which file of a duplicate set is retained and which file is replaced by a hardlink? Could this be achieved by using the --paramoder parameter?

Mar 21 '17 21:03 mmitch

The files are generally hard linked to the first file that is listed in the group when you print the matches, assuming all other options are the same. I'm going to call this a feature request for a way to have -L mode order files by hard link count in descending order. Does that sound like it will help?

Mar 21 '17 21:03 jbruchon

That sounds good! I've also thought about the hard link count and "always add to the bigger group". I'm not 100% sure that this will cover all cases one could encounter (I don't have a counter-example, it's more of a feeling in my gut), but for any case involving more than 2 files "always add to the bigger group" sounds way better than "link to one of possibly multiple files that lies somewhere within the directory given first on the commandline" :-)

Mar 21 '17 21:03 mmitch

Let's say you have a group of 5 hard links, one of 3, and one of 2. Some of the group of 5 are in the duplicate results but not all. All of the 3 and 2 groups are included. If we delete the members of the 5-group by some outcome of chance in the final ordering and link to a member of one of the other two groups, we detach them from that 5-group and add them to a new group including the other 3 and 2 that were previously separate. In this case we end up with two link groups (3 and 7 hard links, respectively) and therefore two copies still present.

Now let's examine what happens with the maximum reference count being used as the basis for selecting a source file. The 5-group members have a nlink of 5, the others 3 and 2, respectively. Because the 5-group has the highest nlink, the 3/2 groups will be linked to a member of the 5-group, merging them all into a single hard link group of 10 files.

Let's add another 5-group, so we have 5/5/3/2. Let's assume two files in each 5-group are picked up by the duplicate scan, so only 2+2+3+2 files will be linked. One 5-group's members will be absorbed into the other for 7 linked files, then the 3/2 groups, for a final count of 12 and 3 leftovers from the second 5-group that weren't part of the duplicate scan.

Let's instead take four 5-groups 5/5/5/5 of which 3+2+2+1 files are in the scanned set. The first group will absorb these. We'll have four groups of 10/3/3/4 files each with no real benefit. This is the worst-case scenario.

If files are not included in the duplicate scan, they can't be considered, so the worst case is not "fixable" without scanning the locations containing the missing files in the set. However, assuming that the highest nlink count should be prioritized to absorb the other files into its link group stands the best chance of reducing the total number of link groups and thus the number of copies of a file on disk.

Mar 25 '17 22:03 jbruchon

I'm currently working on a proof-of-concept fix for the triangle problem. It will group all hardlinks together before doing anything else (so basically, file_t will represent an inode, while filenames will be a list referenced by file_t).

Doing that, for every inode (or group in your example above), we know:

total link count (nlink)
links we are processing (number of filenames scanned for the inode)
number of hardlinks outside of our scanned files (subtract filename count from nlink)

After that it should just be finding the correct entry in the duplicate list (eg. by looking for the highest nlink value) in act_linkfiles.c and swap that entry with srcfile.

You're right with the worst case scenario – perhaps then a note or warning should be shown: "jdupes does not know where are, but there are still duplicates of your files and you still have separate sets"

Mar 26 '17 15:03 mmitch

That's not what the triangle problem is. It's when two files that shouldn't be grouped are indirectly grouped anyway by a third intermediate file that both separately group with.

I don't think all of the effort you are considering is going to be a better solution for the issue at hand. Linking to the highest nlink files first provides the best case results in all possible scenarios that I can come up with and there is no solution for the worst case that is practical.

I agree that there should be a warning when linking about missing link partners. It's too bad that no Unix filesystems have an easy way to find those names without a full filesystem scan.

On March 26, 2017 11:57:21 AM EDT, Christian Garbs [email protected] wrote:

I'm currently working on a proof-of-concept fix for the triangle problem. It will group all hardlinks together before doing anything else (so basically, file_t will represent an inode, while filenames will be a list referenced by file_t).

Doing that, for every inode (or group in your example above), we know:

total link count (nlink)

links we are processing (number of filenames scanned for the inode)

number of hardlinks outside of our scanned files (subtract filename count from nlink)

After that it should just be finding the correct entry in the duplicate list (eg. by looking for the highest nlink value) in act_linkfiles.c and swap that entry with srcfile.

You're right with the worst case scenario – perhaps then a note or warning should be shown: "jdupes does not know where are, but there are still duplicates of your files and you still have separate sets"

-- You are receiving this because you commented. Reply to this email directly or view it on GitHub: https://github.com/jbruchon/jdupes/issues/42#issuecomment-289293203

-- Sent from my Android phone with K-9 Mail. Please excuse my brevity.

Mar 26 '17 16:03 jbruchon

I'm trying to do the same as the parent, cleaning up rsync trees.

The current version (v1.9) does not combine hardlink groups even if all the files are in the arguments, say 3 sets: (gen0 gen1) (gen2 gen3 gen4) (gen5 gen6) running jdupes -L gen0 gen1 gen2 gen3 gen4 gen5 gen6 results in: (gen0 gen1 gen2 gen5) (gen3 gen4) (gen6) My current solution is to run it multiple times, but this is a little inefficient.

I'm not sure the root cause is the hardlink sorting, I think its a case that all hardlinks in a set are not moved to the new inode.

Any chance it can be changed that an entire set is moved across, i.e. will result in: (gen0 gen1 gen2 gen3 gen4 gen5 gen6)

Dec 06 '17 21:12 GreasyMonk

@GreasyMonk Have you tried it with the -H option as well? The algorithm omits hard linked pairs from duplicate pairing by default. I am curious as to whether you get different results and I think that -H being forced on for -L would make sense.

Dec 06 '17 21:12 jbruchon

No, -H does not do the trick. I created a test with 3 sets: (gena genb genc) (gend gene genf) (geng genh)

~/test$ stat --printf="%n %i %h\n" *
gena 92150354 3
genb 92150354 3
genc 92150354 3
gend 92150353 3
gene 92150353 3
genf 92150353 3
geng 92150355 2
genh 92150355 2
~/test$ jdupes -HL *
Scanning: 3 files, 3 items (in 7 specified)
[SRC] gena
----> gend
----> geng

~/test$ stat --printf="%n %i %h\n" *
gena 92150354 5
genb 92150354 5
genc 92150354 5
gend 92150354 5
gene 92150353 2
genf 92150353 2
geng 92150354 5
genh 92150355 1

~/test$ jdupes -v
jdupes 1.9 (2017-12-03) 64-bit
Compile-time extensions: none
Copyright (C) 2015-2017 by Jody Bruchon

Dec 07 '17 07:12 GreasyMonk

Well, you found a bug introduced by my code that accepts files as arguments: it rejects hard links even if -H is passed. I'll fix that and try again.

Dec 07 '17 14:12 jbruchon

I fixed that bug; using jdupes -HL on the same set configuration you just tried now produces 8 files all hard linked together. See commit f8af2397bfca5a72e6bce2027b4c6b1878dceef8

Dec 07 '17 14:12 jbruchon