fdupes icon indicating copy to clipboard operation
fdupes copied to clipboard

fdupes: option to replace duplicates with hard links

Open sandrotosi opened this issue 8 years ago • 34 comments

From @sandrotosi on December 20, 2015 14:4

From matrixhasu on October 08, 2009 22:13:05

Debian bug #284274 - http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=284274 From: Rupert Levene [email protected]

It would be nice to have the option of telling fdupes to replace duplicate files with hard links. This would be a more symmetric behaviour than using symlinks.


From: Javier Fernández-Sanguino Peña [email protected]

Attached is a patch to the program sources (through the use of a dpatch patch in the Debian package) that adds a new -L / --linkhard option to fdupes. This option will replace all duplicate files with hardlinks which is useful in order to reduce space.

It has been tested only slightly, but the code looks (to me) about right.

Attachment: 284274_fdupes_hardlink_repace.diff

Original issue: http://code.google.com/p/fdupes/issues/detail?id=8

Copied from original issue: sandrotosi/fdupes-issues#5

sandrotosi avatar Dec 20 '15 14:12 sandrotosi

From [email protected] on July 28, 2010 03:42:27

There is a typo in the help output in this patch, after the --debug line, "each set of duplicates without prompting the user" is spuriously output again, repeated from the --linkhard output.

sandrotosi avatar Dec 20 '15 14:12 sandrotosi

From matrixhasu on August 01, 2010 06:34:25

I can see this (it's on the output of 'fdupes --help'); I'd just remove

  • printf(" \teach set of duplicates without prompting the user\n");

after --debug.

sandrotosi avatar Dec 20 '15 14:12 sandrotosi

From [email protected] on August 26, 2010 10:23:00

the manpage has typo: the advertised option isn't "--hardlink", but "--linkhard".

sandrotosi avatar Dec 20 '15 14:12 sandrotosi

From [email protected] on December 03, 2010 22:06:27

I ran fdupes with this option. Unfortunately, I forgot that I was on a vfat partition. It looks like this basically deleted all files. Recovery in progress using photorec.

sandrotosi avatar Dec 20 '15 14:12 sandrotosi

From [email protected] on October 29, 2011 19:30:22

I have an independently developed patch to do the same thing. It's a bit more careful about creating the links, so it should fail safely on filesystems that don't support links. It also merges sets of previously hardlinked files correctly (see issue #22 ).

The patch includes some code cleanups and optimizations as well, which I split out separately:

0001-Whitespace-cleanups.patch

  • No functional changes, just whitespace changes to improve readability.

0002-Use-strdup-instead-of-malloc-strcpy-pairs.patch

  • Use strdup instead of separate malloc/strcpy calls.
  • Use calloc instead of individually clearing members
  • Avoid multiple passes over strings when concatenating dir and dirinfo->d_name

0003-Cache-stat-2-results.patch

  • Cache results from stat(2) calls for improved performance.
  • Reorganize grokdir() flow to avoid multiple free/continue blocks.

0004-Add-relink-support.patch

  • Rework relink function to start with a temp file in the right directory then rename it to the correct name. This ensures that any likely errors are detected before the file is lost.
  • relinkfiles correctly handles duplicate files across multiple filesystems, linking together those files that reside on the same filesystem.
  • Update sort to:
    • Prefer the file with more hardlinks
    • Fall back to a filename comparison if link counts and mtimes are the same to ensure stable results.

Relink mode automatically enables --hardlinks, it doesn't make much sense to ignore existing hardlinked files while crating new ones. It also skips empty files, hard links save nothing in this case. If the user really wants to link the empty files, we provide a --relinkempty mode that will do so.

Attachment: 0001-Whitespace-cleanups.patch 0002-Use-strdup-instead-of-malloc-strcpy-pairs.patch 0003-Cache-stat-2-results.patch 0004-Add-relink-support.patch

sandrotosi avatar Dec 20 '15 14:12 sandrotosi

From sandro.tosi on June 17, 2012 12:27:19

The patch as presented in the original post is bugged, in case some of the files are on 2 different filesystems, as reported on http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=677419 :

By mistake, I have been using fdupes -rL on what i thought were directories, but one was a symbolic link to a directory on another file system. It found duplicates accross file systems, ... and removed one end. For the little story, the end where it removed files was /lib, and I was deduplicating chroots, so it found plenty of duplicates... try running anything without ld.so...

Anyways, this can easily be reproduced. Assuming /mnt/a and /mnt/b are two different filesystems: $ echo a > /mnt/a/a $ echo a > /mnt/a/b $ echo a > /mnt/b/a $ fdupes -rL /mnt/a /mnt/b [+] /mnt/a/a [h] /mnt/a/b -- unable to create a hardlink for the file: Invalid cross-device link [!] /mnt/a/a $ ls /mnt/b $ <<<

sandrotosi avatar Dec 20 '15 14:12 sandrotosi

+1 for this feature

xambroz avatar May 27 '16 15:05 xambroz

I think user should be able to choose between hardlinks and symlinks. Hardlinks make more sense most of the time, but sometimes you might want to use symlinks just to make it more obvious that there is a link.

Harvie avatar Feb 09 '17 22:02 Harvie

Is there a reason this is not merged? There are many versions of fdupes out there, and most of them have this functionality, so it is weird that this one doesn't have it.

Furthermore, the incremental flag for the deleting could also apply for hardlinking. Then there would also be no problem with ignoring files linked to the same inode: While traversing, keep track of which inodes you linked to other inodes, and whenever you encounter that same old inode, you link it to the new inode. You can drop the old inode from cache as soon as its refcount is 0.

wmertens avatar Apr 10 '17 13:04 wmertens

I agree with @wmertens, this feature should be merged, I'm tired of doing deduplication with fslint-gui...

aviogit avatar Apr 15 '17 19:04 aviogit

Thanks, I'll give it a try.

aviogit avatar Apr 15 '17 19:04 aviogit

The reason it isn't merged is the question of what to do should the link operation fail. The program would be breaking the promise that files will be linked instead of deleted. Indeed, some files cannot be hardlinked at all, such as when they lie on different volumes.

On Sat, Apr 15, 2017, 3:51 PM Avio [email protected] wrote:

Thanks, I'll give it a try.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/adrianlopezroche/fdupes/issues/46#issuecomment-294314362, or mute the thread https://github.com/notifications/unsubscribe-auth/AF8J_cgUrDt5HXufMXiQX-Ob6nrvOTQHks5rwR_MgaJpZM4G41Pw .

adrianlopezroche avatar Apr 15 '17 20:04 adrianlopezroche

On Linux one can use stat (http://stackoverflow.com/a/8483435/1396334) to look at the inode number and the number of hard links for each file (before and after the hard link operation). As @jbruchon did, the file should only be renamed at first (like rsync does, for example) and if the hardlink fails, it should be safely renamed back to the original name. I know, this will require a lot of testing, but sounds feasible.

aviogit avatar Apr 15 '17 20:04 aviogit

First link a temp name, and if that succeeds, remove the file and rename the new link. If it fails, keep and use the file as the new inode for linking identical files.

On Sat, Apr 15, 2017, 10:41 PM Jody Bruchon [email protected] wrote:

It should be sufficient to receive a return value indicating success from the link call. Checking the inode count would be redundant.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/adrianlopezroche/fdupes/issues/46#issuecomment-294316730, or mute the thread https://github.com/notifications/unsubscribe-auth/AADWloyb2q-8x7lZlGlnGyXq7zhywkWaks5rwSt8gaJpZM4G41Pw .

wmertens avatar Apr 16 '17 18:04 wmertens

There's another problem: making sure not to overwrite an existing file when the dupe is renamed. Easy with rename2 on Linux, but I can't find a portable call that's guaranteed not to overwrite an existing file.

On Sat, Apr 15, 2017, 4:34 PM Jody Bruchon [email protected] wrote:

That is a serious problem I ran into with that patch. I decided to rename the original file and on any sort of failure it would be renamed back, being deleted only once the hard link operation succeeds. Unfortunately, that also led to a lot of extra error checking code being added. It's not pretty.

— You are receiving this because you commented.

Reply to this email directly, view it on GitHub https://github.com/adrianlopezroche/fdupes/issues/46#issuecomment-294316391, or mute the thread https://github.com/notifications/unsubscribe-auth/AF8J_QPctFM9JyDkztORz8cDQ5tfjZNSks5rwSnHgaJpZM4G41Pw .

adrianlopezroche avatar Apr 17 '17 19:04 adrianlopezroche

Err - the idea is to overwrite the existing file with the new link, no?

On Mon, Apr 17, 2017, 9:09 PM Adrian Lopez [email protected] wrote:

There's another problem: making sure not to overwrite an existing file when the dupe is renamed. Easy with rename2 on Linux, but I can't find a portable call that's guaranteed not to overwrite an existing file.

On Sat, Apr 15, 2017, 4:34 PM Jody Bruchon [email protected] wrote:

That is a serious problem I ran into with that patch. I decided to rename the original file and on any sort of failure it would be renamed back, being deleted only once the hard link operation succeeds. Unfortunately, that also led to a lot of extra error checking code being added. It's not pretty.

— You are receiving this because you commented.

Reply to this email directly, view it on GitHub < https://github.com/adrianlopezroche/fdupes/issues/46#issuecomment-294316391 , or mute the thread < https://github.com/notifications/unsubscribe-auth/AF8J_QPctFM9JyDkztORz8cDQ5tfjZNSks5rwSnHgaJpZM4G41Pw

.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/adrianlopezroche/fdupes/issues/46#issuecomment-294562596, or mute the thread https://github.com/notifications/unsubscribe-auth/AADWlsGeGXqHLXr_bANlfWrn0pg0Ag38ks5rw7jngaJpZM4G41Pw .

wmertens avatar Apr 17 '17 20:04 wmertens

That's the last step. First you move the dupe aside by renaming it. The choice of new name is critical.

On Mon, Apr 17, 2017, 4:40 PM Wout Mertens [email protected] wrote:

Err - the idea is to overwrite the existing file with the new link, no?

On Mon, Apr 17, 2017, 9:09 PM Adrian Lopez [email protected] wrote:

There's another problem: making sure not to overwrite an existing file when the dupe is renamed. Easy with rename2 on Linux, but I can't find a portable call that's guaranteed not to overwrite an existing file.

On Sat, Apr 15, 2017, 4:34 PM Jody Bruchon [email protected] wrote:

That is a serious problem I ran into with that patch. I decided to rename the original file and on any sort of failure it would be renamed back, being deleted only once the hard link operation succeeds. Unfortunately, that also led to a lot of extra error checking code being added. It's not pretty.

— You are receiving this because you commented.

Reply to this email directly, view it on GitHub <

https://github.com/adrianlopezroche/fdupes/issues/46#issuecomment-294316391

, or mute the thread <

https://github.com/notifications/unsubscribe-auth/AF8J_QPctFM9JyDkztORz8cDQ5tfjZNSks5rwSnHgaJpZM4G41Pw

.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub < https://github.com/adrianlopezroche/fdupes/issues/46#issuecomment-294562596 , or mute the thread < https://github.com/notifications/unsubscribe-auth/AADWlsGeGXqHLXr_bANlfWrn0pg0Ag38ks5rw7jngaJpZM4G41Pw

.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/adrianlopezroche/fdupes/issues/46#issuecomment-294586354, or mute the thread https://github.com/notifications/unsubscribe-auth/AF8J_Th9XTmBWnOFpUdOjpt8Jgp4nzg5ks5rw85LgaJpZM4G41Pw .

adrianlopezroche avatar Apr 17 '17 21:04 adrianlopezroche

You can create the link anywhere on the same filesystem, for example in a temporary directory with no other files…

On Mon, Apr 17, 2017, 11:56 PM Adrian Lopez [email protected] wrote:

That's the last step. First you move the dupe aside by renaming it. The choice of new name is critical.

On Mon, Apr 17, 2017, 4:40 PM Wout Mertens [email protected] wrote:

Err - the idea is to overwrite the existing file with the new link, no?

On Mon, Apr 17, 2017, 9:09 PM Adrian Lopez [email protected] wrote:

There's another problem: making sure not to overwrite an existing file when the dupe is renamed. Easy with rename2 on Linux, but I can't find a portable call that's guaranteed not to overwrite an existing file.

On Sat, Apr 15, 2017, 4:34 PM Jody Bruchon [email protected] wrote:

That is a serious problem I ran into with that patch. I decided to rename the original file and on any sort of failure it would be renamed back, being deleted only once the hard link operation succeeds. Unfortunately, that also led to a lot of extra error checking code being added. It's not pretty.

— You are receiving this because you commented.

Reply to this email directly, view it on GitHub <

https://github.com/adrianlopezroche/fdupes/issues/46#issuecomment-294316391

, or mute the thread <

https://github.com/notifications/unsubscribe-auth/AF8J_QPctFM9JyDkztORz8cDQ5tfjZNSks5rwSnHgaJpZM4G41Pw

.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <

https://github.com/adrianlopezroche/fdupes/issues/46#issuecomment-294562596

, or mute the thread <

https://github.com/notifications/unsubscribe-auth/AADWlsGeGXqHLXr_bANlfWrn0pg0Ag38ks5rw7jngaJpZM4G41Pw

.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub < https://github.com/adrianlopezroche/fdupes/issues/46#issuecomment-294586354 , or mute the thread < https://github.com/notifications/unsubscribe-auth/AF8J_Th9XTmBWnOFpUdOjpt8Jgp4nzg5ks5rw85LgaJpZM4G41Pw

.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/adrianlopezroche/fdupes/issues/46#issuecomment-294605903, or mute the thread https://github.com/notifications/unsubscribe-auth/AADWlrVMKLpegHBmLVTbVnP8xiwUSPFRks5rw-ACgaJpZM4G41Pw .

wmertens avatar Apr 17 '17 22:04 wmertens

What's wrong with ln -f src dest? If linking fails, dest is not touched. For the benefit of people wanting a quick solution, here's a bash script:

#!/bin/bash

previous=""
while read -r line; do
    if [[ -n "$previous" ]] && [[ -n "$line" ]]; then
        echo "$line => $previous"
        ln -f "$previous" "$line"
    fi
    previous=$line
done

You can then use it with fdupes <args> <path> | pasted-script.sh

hsoft avatar Jul 06 '18 16:07 hsoft

@hsoft if you're going to do that, you can also just implement the whole thing in bash with md5sum, sort etc.

wmertens avatar Jul 06 '18 17:07 wmertens

@wmertens I don't understand what you mean. You mean not using fdupes? That would be needless repetition of work. It already takes care of inode checking and the size -> md5 -> byte-to-byte escalation. Why would I want to re-implement that?

hsoft avatar Jul 06 '18 17:07 hsoft

Well, why do you want to re-implement this PR? :)

wmertens avatar Jul 06 '18 20:07 wmertens

Because it's not merged and it's quicker to use a bash script than to apply patches and recompile fdupes? Don't take it personally, I'm just sharing a solution to people's problems.

I stumbled on this issue by looking for a way to accomplish this. This was the first hit. I provide an immediate and quick solution.

hsoft avatar Jul 06 '18 21:07 hsoft

Sorry for coming over as salty. I appreciate your efforts, but it seems like a bit of a hack.

Of course, if this works 100% of the time in all edge cases, maybe it would be better to wrap fdupes in a script that does this. Less code is better code.

wmertens avatar Jul 06 '18 21:07 wmertens

I don't think this is so hard to implement in C. I know you can handle this in bash, but then you lose the power of community reviewed and tested code. You might lose data if you'll be reimplementing it every time you need such functionality.

Is there reason for not having this in fdupes? Given that somebody will came up with pull request...

Harvie avatar Jul 08 '18 13:07 Harvie

thanks @hsoft it worked flawlessly in my case.

etique57 avatar Aug 05 '18 08:08 etique57

Bump; there's a 3+ year old fork that implements this, why is it not merged?

gsakkis avatar Aug 03 '20 13:08 gsakkis

@gsakkis I can't answer for @adrianlopezroche directly, but I can say that merging that hard link code is a bad decision. Just to demonstrate why, this is all the stuff my fork has to do to safely replace with links. The code in that fork you linked simply deletes the files and makes hard links to them, but in the event that anything goes wrong with hard linking, you've just deleted your files and that's the end of it. While the patch would be safe in success cases and in the case of a delete failure, it is not safe in the event of a link failure.

I've had incidents with the code I linked where my paranoia has resulted in not losing files. A good example is if you mistakenly try to hard link on a FAT filesystem which can't take hard links. Oops, your data is gone and there is no way to get it back!

The proper way to do it is to move the target to a temporary file name, do the link, then delete the temporary file name. Any failure = undo each operation performed until you are back at the original state. The worst-case there is that the file gets renamed but can't be renamed back to its original name, but I used a funky extension that's easily lopped off with a shell script or even manually renamed. At least you don't lose the duplicate file paths in that case.

jbruchon avatar Aug 03 '20 14:08 jbruchon

@jbruchon thanks for the warning; I just found a link to the fork in another issue and didn't pay close attention to it, let alone review the code. I wasn't aware of jdupes; I'll definitely try it out :+1:

gsakkis avatar Aug 03 '20 15:08 gsakkis

@gsakkis I appreciate it, but it would be helpful if you could do something to help with getting this feature in fdupes, especially since 2.0 brings an ncurses interface that is not found with most other duplicate scanners.

jbruchon avatar Aug 03 '20 15:08 jbruchon