unbalance icon indicating copy to clipboard operation
unbalance copied to clipboard

Moving breaks hardlinks

Open sanderai opened this issue 1 year ago • 2 comments

On disk1 I have two folders filled with files, one is a hardlink to the other and takes no space. If I use the plugin to move these folders to another drive, they take up twice as much space because the hardlinks break and both folders take up actual disk space (I selected them both when doing the move action). After running jdupes command in a separate console on those folders, the disk size is back to where it was only taking up half the amount but this is a lengthy process because jdupes has to calculate the hashes for all files an has no previous knowledge of the move.

Unraid's own mover also takes up double the space during the move but still re-instates the hardlinks after the mover has finished (without recalculating the hashes afaik), so no space is lost at the end of moves.

Is there a way to make this plugin also respect hardlinks?

sanderai avatar Sep 09 '24 21:09 sanderai

you can add custom flag for the underlying rsync command, check the settings page

jbrodriguez avatar Sep 14 '24 02:09 jbrodriguez

I had a test run and I guess it's harder than just using a flag because it runs a separate rsync command for each file separately. When I only selected the two folders that had one file in them that are originally hardlinked, they are copied as separate files and no way for rsync to link them back later.

One option could be that before each run the files are checked for hardlinks and if there are any, check if any of the other files checked for transfer also have the same unique identifer (ls -i shows this for ex.) and if they do, move those files together in the same rsync command. But then the problem is that they could have a different destination folder on the new drive and I don't know if that can be set up as one rsync command.

One more hacky option I found was described by someone on this level1techs forum post where they do something similar but first move those files to one temp folder, then move the whole folder as one command with the -H flag and then afterward move the files back to their respective positions on the new drive. Also kind of a hassle though.

I wonder how the original unraid mover does this - it seems to copy everything over and then add back hardlinks as the last step of the move (if you somehow stop the mover before it finishes, the hardlinks stay broken).

So there might not be a simple answer to this and would require quite a bit of extra work both when setting up the commands pre-transfer and also some post-processing after transfers.

sanderai avatar Sep 23 '24 23:09 sanderai

The issue with adding -H is that, eg: I have a share 'tv' and the subfolders are 'media' and 'downloads', unbalanced runs each subfolder separately, so even though -H is passed through the hardlinks are killed because both inodes of the file are moved in separate rsync executions. I previously mentioned this as an issue on the Unraid forums.

The rsync commands would need to be run at the root of a given share on a given disk instead of the current implementation to prevent this where it seems to dig down into each subfolder for its own rsync progress.

undaunt avatar Oct 30 '24 20:10 undaunt

@undaunt it actually runs from the root of the source folder

i figured the need for hardlink support once i saw an article or video ? about optimizing unraid performance, i also remember *arr apps have that optimization suggestion as well

as @sanderaido mentioned though, it does seem like additional logic is needed

dont have the bandwidth to take this on atm, if someone has ideas or even better a PR, i will consider merging

having said that, i don't think this is an easy task

jbrodriguez avatar Oct 30 '24 20:10 jbrodriguez

I have run into this problem right now. Wish I knew about it before I started the move/transfer/scatter, so perhaps I could have avoided the issues.

I am moving data off of a 6TB drive onto a new 12 TB drive. And so far, I have transferred 8TB from my a 6TB drive! And the transfer is not complete yet.

I have two issues/questions now:

  1. Is my 12TB drive going to be large enough?
  2. How am I going to restore all of my hardlinks and recover the drive space?
  3. How do I avoid this situation in the future?

Thank you.

Snake883 avatar Nov 06 '24 14:11 Snake883

  • Is my 12TB drive going to be large enough?

If you only had two locations for each file, then you might be good. It's possible to have 2+ hardlinks for a file and then this growth could be infinite, depending on the amount, but I doubt you had more than one copy.

  • How am I going to restore all of my hardlinks and recover the drive space?

Install the Nerdtools plugin and from that, install/activate jdupes. Then in your console, run something like jdupes --recurse --link-hard /mnt/user/media/downloads/ /mnt/user/media/movies/ (or whatever libraries/files you have). Or the shorthand: jdupes -rL /mnt/user/media/downloads/ /mnt/user/media/movies/ This command will take a long time but it will calculate hashes etc for all files and find all matching entries and relink them, after it completes your disk usage should halve again.

  • How do I avoid this situation in the future?

I haven't found a good solution for hardlinked files yet, better to not manually move them and let Unraid mover handle them (mover keeps hardlinks)

sanderai avatar Nov 06 '24 15:11 sanderai

Great!...thank you for the path forward!

As a potential "solution".... Perhaps jdupes can be integrated in Unbalanced. Or perhaps Unbalance does a hard link scan, display warning, provide additional information/guidance.

Wish the Unraid mover let me choose the drive to move to.

Snake883 avatar Nov 06 '24 17:11 Snake883

@jbrodriguez I'm not sure what you mean by it runs at the root of the source folder. If I have a mount point at, for example, /mnt/disk1/movies, and there are subfolders of /mnt/disk1/movies/downloads and /mnt/disk1/movies/plex, two separate rsync jobs appear to be kicked off each at the plex and movies subfolder level. This is what breaks the hardlinks and creates extra data usage on the disk.

To @sanderai's point, yes jdupes, fdupes, rdfind, czkawka, etc. can be used after the fact, but avoiding this in the first place would speed up processing time and reduce disk usage temporarily being inflated.

I wound up simply manually running a custom bash function with rsync inside to speed this process up via the command line. I called it 'rsafe' for no specific reason. It requires screen being installed (I think via nerdtools, if its not already included) and takes an input path of "diskx/sharename", and an output path of "disky". It names a screen session after the data being moved to track things easily later with screen -ls.

Eg: to copy /mnt/disk1/movies to /mnt/disk6, you would run `rsafe disk1/movies disk8' or similar. It will ensure a trailing slash exists on the destination, and not the source, such that things land in (in this example) /mnt/disk6/movies.

rsafe() {
    local source="$1"
    local destination="$2"

    # Store original arguments for messaging
    local original_source="$source"
    local original_destination="$destination"

    # Remove trailing slash from source if it exists
    source="${source%/}"

    # Ensure trailing slash on destination if it's not present
    [[ "$destination" != */ ]] && destination="$destination/"

    # Get absolute paths
    local abs_source
    abs_source="$(readlink -f "$source")"
    local abs_destination
    abs_destination="$(readlink -f "$destination")"

    # Extract disk and folder names from the source and destination paths
    # Adjust field numbers if your path structure is different
    local source_disk
    source_disk="$(echo "$abs_source" | awk -F'/' '{print $3}')"
    local source_folder
    source_folder="$(echo "$abs_source" | awk -F'/' '{print $4}')"
    local destination_disk
    destination_disk="$(echo "$abs_destination" | awk -F'/' '{print $3}')"

    # Create a session name
    local session_name="rsafe_${source_disk}_${source_folder}_to_${destination_disk}"

    # Replace any spaces with underscores in session name
    session_name="${session_name// /_}"

    # Display the session name
    echo "Starting transfer from '$abs_source' to '$abs_destination' in screen session '$session_name'."

    # File to store exit status
    local exit_status_file="/tmp/rsafe_exit_status_${session_name}"

    # Run rsync and deletion inside screen session
    screen -dmS "$session_name" bash -c "
        rsync -aHAXvpP --info=progress2 '$abs_source' '$abs_destination'
        rsync_exit_status=\$?
        if [ \$rsync_exit_status -eq 0 ]; then
            find '$abs_source' -mindepth 1 -delete
            deletion_exit_status=\$?
            exit_status=\$((rsync_exit_status + deletion_exit_status))
            echo \$exit_status > '$exit_status_file'
            if [ \$deletion_exit_status -eq 0 ]; then
                echo 'Transfer from \"$original_source\" to \"$original_destination\" completed successfully. Source files deleted.'
            else
                echo 'Transfer completed, but failed to delete source files.' >&2
            fi
        else
            echo \$rsync_exit_status > '$exit_status_file'
            echo 'rsync from \"$original_source\" to \"$original_destination\" failed. Not deleting source files.' >&2
        fi
    "

    # Attach to the screen session (optional)
    screen -r "$session_name"

    # After detaching, wait for the screen session to finish
    while screen -list | grep -q "$session_name"; do
        sleep 10
    done

    # Read the exit status
    if [ -f "$exit_status_file" ]; then
        exit_status=$(cat "$exit_status_file")
        rm -f "$exit_status_file"
    else
        exit_status=1  # Assume failure if exit status file not found
    fi

    # Final message based on exit status
    if [ "$exit_status" -eq 0 ]; then
        echo "Process completed successfully."
    else
        echo "Process failed or was terminated."
    fi
}

However, for users who just want to run a simple command, manually, and watch a transfer, run:

rsync -aHAXvpP --info=progress2 /mnt/diskx/sharename /mnt/disky/

undaunt avatar Nov 06 '24 17:11 undaunt

3. How do I avoid this situation in the future?

not sure how to do that, but thankfully there are "solutions", as shown by @sanderai (nice stuff!)

@jbrodriguez I'm not sure what you mean by it runs at the root of the source folder

using your example rsync -aHAXvpP --info=progress2 /mnt/diskx/sharename /mnt/disky/, unbalanced does

cd /mnt/diskx/
rsync -avPR -X "sharename" "/mnt/disky"

don't exactly remember, why i implemented it like this, but this was WAYYYYY back, talking about ~2017

it's interesting the fact that mover just works, iirc mover is a shell script

@undaunt 's script also uses just an rsync command, haven't read through the script code, but the fact that it handles hardlinks properly means i could potentially incorporate it into unbalanced, thanks for sharing it !

jbrodriguez avatar Nov 06 '24 17:11 jbrodriguez

it's interesting the fact that mover just works, iirc mover is a shell script

I don't know exactly how the mover does this since I haven't looked into it's code, but it's not creating the hardlinks on the fly per file. It's probably storing the hard-link info beforehand (ls -i command shows that info for example) and then relinks them all again after the mover job completes. If you stop the process before it can finish, all the hardlinks are broken also. This is also evident by looking at the destination drive usage, it's growing by 2x the size until the end of the job when it cuts to the "normal" hardlinked size again.

But that last part is very fast and they definitely don't do recalculations after the job, just a simple reapply since they already know which files were linked to which before they started the whole move. jdupes should only be used if you truly lost that information and want to recalculate it again (it can also keep a hash-log actually with a few extra parameters for faster subsequent usages on the same file tree)

sanderai avatar Nov 06 '24 20:11 sanderai

Why not add the -H to rsync by default? It may not always work, but at least it might work for some of the scenarios, and save time recreating the hard links.

Any reason to not use -H ?

Snake883 avatar Nov 06 '24 20:11 Snake883

The -H flag only works if you transfer the linked files together in one rsync command. If you do separate rsync commands for both files in their separate locations, that link is broken regardless of the flag. And currently unBalanced seems to create lots of small rsync transfers based on selected folders and files and not transfering them all in one big command. This would need some bigger rewrites I guess.

I unfortunately don't have more time to test this or work towards a PR right now.

sanderai avatar Nov 06 '24 21:11 sanderai

@sanderai The mover enhanced plugin actually moves files and the hardlinks side by side unlike native mover which does all the hardlinks at the end. So, for heavy hardlink workloads, the plugin may add value.

undaunt avatar Nov 07 '24 19:11 undaunt

Out of curiousity, wouldnt it be possible, to instead of using multiple small rsync commands, just combine them all into a single rsync command? The approach @jbrodriguez descibed earlier using cd /mnt/diskx/ rsync -avPR -X "sharename" "/mnt/disky" wont work with that, but it could be simply changed to rsync -avPRH -X --no-relative "/mnt/diskx/sharename1" "/mnt/diskx/sharename2" "/mnt/diskx/sharename2" "/mnt/disky"

That should atleast work with the gather function and preserve hardlinks without issues. For the scatter function i have no idea yet, but this seems atleast not too hard to implement. Sadly i am not even close familiar with php, typescript or javascript and what else this project uses and i yet have no idea how to compile unbalance and swap the executable on my server to test it, but i guess maybe chatgpt or claude could help with that. Also i am not sure if there is any specific reason why unbalance does multiple rsync commands instead of a single one, but i assume there is or was a reason why this had been done.

Also as a side note, unraids mover and the mover that the mover tuning plugin uses are both scripts, but as far as i know, they both at the end invoke "/usr/libexec/unraid/move" which is an executable by limetech without open source code (atleast to my knowledge). So the move, not mover, does the hard link preservation. Also same for mover tuning. They just invoke the move differently

EDIT: Ok, i tried to look into this using claude, but now i am stuck building it. It keeps complaining about a bunch of typescript errors. So @jbrodriguez could you please provide some commands how you setup your dev environment so i could try and build it on my end? Would be helpful :)

Joly0 avatar Dec 14 '24 00:12 Joly0

Also i am not sure if there is any specific reason why unbalance does multiple rsync commands instead of a single one, but i assume there is or was a reason why this had been done.

mostly because users can select multiple separate folders

/mnt/user/films/movie1 /mnt/user/films/movie2

that doesn't mean they want to move the entire /mnt/user/films folder

So the move, not mover, does the hard link preservation. Also same for mover tuning. They just invoke the move differently

aa that's interesting, yea there might be some custom handling in that executable

So @jbrodriguez could you please provide some commands how you setup your dev environment so i could try and build it on my end? Would be helpful :) right, i think the easier think to do would be install the repo locally, then

cd ui
npm install

then on the root folder either make release

which builds a linux binary to run on unraid

there are some alternatives for debugging, but i havent implemented them because i havent needed them :)

jbrodriguez avatar Dec 14 '24 12:12 jbrodriguez

Not on topic (haven't had time to test this further and haven't really needed to manually move stuff lately since I got my mover settings dialed in), but the last part about building and running locally would be good to add to the general README.md ;)

sanderai avatar Dec 20 '24 11:12 sanderai

but the last part about building and running locally would be good to add to the general README.md

you're right, ill eventually get around to it, in general terms it's about

  • set up node/vite dev env
  • set up go dev env

make release creates the linux executable rsync to the server, debug there

i could set up a proxy in the vite config, but haven't found too hard pressed to do it

jbrodriguez avatar Dec 22 '24 19:12 jbrodriguez

Idea... Integrating with something like jdupe's database. Once the move is complete, calculate the checksum, compare with jdupe's database, and if there's a match, create hardlink.

Snake883 avatar Jan 03 '25 22:01 Snake883

Idea 2: This is an idea for those that use Servarr (Radarr/Sonarr/*arr)....

Situation:

  1. The main situation I have is the hardlinks between the folder "Servarr" hardlinks with my "Torrents" folder.
  2. When I move files, and because I use hardlinks between Servarr and Torrents, the total storage that gets moved is at least doubled.
  3. Problem is worse, and multiplied by the number of cross-seeds that are hardlinked. 4) I don't have double (or multiplied by the cross-seed hardlinks) the storage to do an entire move in a single move plan.

Manual Solution (multiple moves plan, since I cannot do everything in a single move):

  1. I need to know how much I can transfer in a single move, based on the target free space and the folders I select to move. Unbalance does not do this for me (or if it does, please let me know). So I use a file manager to pre-select the folder to move, and calculate the storage required, and make sure the target has enough storage for the move.

This gets a little tricky, because I'm only describing Servarr/Torrents folder. Which get duplicated when moved. Therefore I try to select hardlinked folders in the movie folders of both Servarr/Torrents, which helps reduce the duplicated moves in a single move plan.

The other way I deal with mitigating running out of storage on the target caused by hardlink moves is to only move the movie or tv folder from both Servarr and Torrent folder.

  1. After the move operation, I run jdupes on the Servarr/movie folder and the Torrents/movie folder, and re-establish the hardlinks, and gain 50% of free strorage back.

  2. Then I repeat the move plan, finding and selecting enough storage to transfer to the target, while verifying I didn't select too much storage to transfer, caused by duplicate hardlink moves, and running out of target storage.

After going through this manual process a few times, some other ideas came to mind that may make this faster, not simple, perhaps a programmatic script could do this better.

Ideas:

  1. As part of the Unbalance Plan, Unbalance can scan and determine all the hardlink files, and create a list of these hardlinks.
  2. Unbalance would only need to move ONE file. Unbalance won't need to move the other hardlinked files.

This now solves the three problems: A) Breaking the hardlink. B) Increasing the storage by the number of times the file was hardlinked. C) Increasing the transfer time by the number of times the file was hardlinked.

  1. After the move operation, Unbalance would go back and recreate the hardlinks (from the hardlinks identified from the move plan).
  2. Re-hardlinking would be done after the move, and before the source is deleted.
  3. Unbalance will delete all the source files/links that were hardlinked (especially if cross-seeding was used with slightly different folder/file names). If this is not done, the files would still exist on the source, and then become duplicates! This can become a problem multiplier, because the user probably won't know this, and will try to scatter those files again, and now duplicates exist on other drives. And this problem can be multiplied depending on the number of hardlinks and the number of drives, and the number of times the user tries to scatter. This is a huge problem that should be avoided.

Thank you.

Snake883 avatar Jan 05 '25 18:01 Snake883

As a work around, transfering by inode would be beneficial.

For example, I have files in both "torrents" and "servarr" that are hardlinked: Image

But Unbalance organizes the move per folder, so the "torrents" folder is moved first. Breaks the hardlink. Then copies the "servarr" folder, without the hardlink, and 2x plus the storage.

If Unbalance could move by inode instead, it would move both the file/link in whichever folder the link exists in. Then I can run jdupe to re-link. Would be great if Unbalance could run jdupe after each inode transfer for relinking.

Snake883 avatar Jan 16 '25 19:01 Snake883

Unbalance could move by inode instead

i don't recollect ever hearing about moving by inode, how can that be done ?

jbrodriguez avatar Jan 16 '25 19:01 jbrodriguez

Unbalance could move by inode instead

i don't recollect ever hearing about moving by inode, how can that be done ?

Perhaps Unbalance can group the file moves/transfer by inode.

For example:

  1. Select the drive.
  2. Select the folders.
  3. Search and group the files of the same inode together.
  4. Transfer the files within the same inode group.
  5. [Bonus] Run jdupes after each inode group transfer to relink/create hardlinks.

Snake883 avatar Jan 16 '25 19:01 Snake883

The other way I thought about doing this is to scan the folder(s), compute hashes, and transfer files of the same hash, then recreate/relink the hardlink.

Hashing can take a long time, so would be ideal to create a hash cache.

Snake883 avatar Jan 16 '25 19:01 Snake883