mergerfs-tools icon indicating copy to clipboard operation
mergerfs-tools copied to clipboard

RFC/WIP: moving hard links along with file

Open kinghrothgar opened this issue 5 years ago • 5 comments

Explination

I wanted to open this PR to see if you would be willing to merge one that added support for moving hard links of a file along with a file via an option. If you are interested, I will work this PR into a fully featured one. I heavily use hard links for various file sorting and organization systems I have. As such, without this modification I can't use mergerfs.balance as it duplicates the files and breaks the links.

Proposed solution:

If you enable the links option, it will use gnu find to locate hard linked files using the -samefile argument. You have already enabled rsync hard link handling with the -H option, so all I had to do was pass the other hard linked files along with the "original" as src files in the rsync command and rsync handles the rest.

Questions/Complications:

  1. Should the found hard links be subject to the various include and exclude options? I personally don't want the links filtered, but I could see a user expecting or wanting this. As such, maybe this should be configurable via an option?
  2. If we do filter the hard links, what do we do when a found hard link is excluded? Simply excluding it would result in duplicating the file. This doesn't seem like the behavior a user who has already told the program to preserve hard links would want. The two options I think are to either exit with a message or modify find_a_file() to support skipping seen files.

kinghrothgar avatar Jun 05 '19 23:06 kinghrothgar

The problem is that that solution is really expensive since you'll end up scanning the filesystem N times rather than 1. Preferably you'd scan the filesystem once at the beginning and then add to the source.

trapexit avatar Jun 06 '19 00:06 trapexit

Are you suggesting I load the full filesystem file list with inode info for each srcmount at program start? The downside of that is the race conditions presented by the fact a run of the program can take many hours so there is a long time for the FS to change and diverge from the loaded state. This will lead one of two things:

  1. The program will exit because of a failed rsync caused by a link no longer existing
  2. A link will be missed and the file will be duplicated

I can do it that way if that is what you prefer.

kinghrothgar avatar Jun 06 '19 00:06 kinghrothgar

There will always be a race condition. It's a matter of degrees. Not kind. Your current method is no different. Scanning the drive will take time. The first file it sees could be a link of the file to move. So could the last file seen. It could take an hour to scan the drive if you had a ton of files. The first file could be removed before you even find the last one.

If you do as you do now you still have a race condition and you will ultimately have N * M file scans. N = number of files on the drive and M = number of files being moved. That's a hell of a lot more expensive than 1 * M.

So yes. I'm fine with this feature so long as it's optional (given the overhead) but it shouldn't have an O(N*M) runtime.

trapexit avatar Jun 06 '19 01:06 trapexit

Alright, I'll load the file list beforehand.

Do you care how I handle either of the two Questions in the original comment?

kinghrothgar avatar Jun 06 '19 02:06 kinghrothgar

If you can see either way being valid make it optional.

trapexit avatar Jun 06 '19 03:06 trapexit