Make -m hard link aware and clarify what "occupying X bytes" means
Using jdupes to further dedupe my rsnapshot backups I came upon some very high numbers of occupied space using the -m parameter (eg. 3 GB), but when I deduplicated the files with -L, only some 100 MB were freed.
How is the "occupied space" shown in the -m statistics calculated?
I've done this small test:
~/git/jdupes$ mkdir tdir
~/git/jdupes$ echo -n 0123456789 > tdir/A
~/git/jdupes$ echo -n 0123456789 > tdir/B
~/git/jdupes$ ln tdir/A tdir/A1
~/git/jdupes$ ln tdir/A tdir/A2
~/git/jdupes$ ln tdir/B tdir/B1
~/git/jdupes$ ln tdir/B tdir/B2
~/git/jdupes$ ln tdir/B tdir/B3
~/git/jdupes$ ls -lin tdir/
total 84
1310944 -rw-r--r-- 3 1000 1000 10 Mar 21 22:10 A
1310944 -rw-r--r-- 3 1000 1000 10 Mar 21 22:10 A1
1310944 -rw-r--r-- 3 1000 1000 10 Mar 21 22:10 A2
1310945 -rw-r--r-- 4 1000 1000 10 Mar 21 22:10 B
1310945 -rw-r--r-- 4 1000 1000 10 Mar 21 22:10 B1
1310945 -rw-r--r-- 4 1000 1000 10 Mar 21 22:10 B2
1310945 -rw-r--r-- 4 1000 1000 10 Mar 21 22:10 B3
~/git/jdupes$ ./jdupes -m tdir/
Scanning: 7 files, 1 dirs (in 1 specified)
4 duplicate files (in 1 sets), occupying 40 bytes
I would have expected -m to show either 10 bytes if "size that can be freed" is meant or 20 bytes if the number shows the total occupied size.
As there are only two inodes used with 10 bytes each, the shown size of 40 bytes is a bit high.
…Thinking about this: Is this an instance of the triangle problem?
If yes, could you please still tell me, which size -m prints?
(my guess: "size that can be freed")
My understanding is that the summary output is supposed to indicate how much excess space is used. I didn't write that code and it also wasn't updated to make it hard link aware. It should be made hard link aware.
I think this really is an ocurrence of the triangle problem. The 4 duplicate files are:
(A A1 A2) --> B
(A A1 A2) --> B1
(A A1 A2) --> B2
(A A1 A2) --> B3
Based on this, the calculated 40 bytes are acceptable. I'm currently pondering the triangle problem and have some ideas – but like you already wrote, it would mean changing quite some code.