mpifileutils icon indicating copy to clipboard operation
mpifileutils copied to clipboard

Hardlinks support

Open rezib opened this issue 9 months ago • 4 comments

Dear mpifileutils developers,

This is my proposal to add support of hardlinks in many mpifileutils commands: dwalk, dcp, dcmp, dsync and dtar.

During tree walk with details, regular files with more than one nlink are temporarily placed in a hardlinks flist. This flist is then globally ordered by names and ranked to select one reference path per inode, and flag all other paths to this inodes as hardlinks. The sorted hardlinks flist is finally merged in global flist with all other items. The paths name ordering is performed to ensure reproducibility between two similar trees, thus minimizing the differences for dcmp and dsync eventually.

[!NOTE] You may find more implementations details in respective commits messages.

The pull request introduces a cache format v5, to support encoding of files nlink and hardlinks references paths.

This pull request also includes a functional test suite that relies on Python standard unittest library. This suite is designed to be easy to execute:

  • Set two environment variables to define respectively the path to mpifileutils binaries and arguments provided to mpirun, eg:
$ export MFU_BIN=~/dev/bin
$ export MFU_MPIRUN_ARGS="--bind-to none --oversubscribe -N 4"
  • And run all the tests:
$ python3 -m unittest discover -v test
  • Or:
$ pytest  # require pytest

It is also designed to be easy to integrate in continuous integration systems. The pull request even provides a GitHub action workflow to execute this test suite on every pull requests and merges in main branch (example run).

For the record, this test suite has already helped detect and fix the following bugs:

  • ~https://github.com/hpc/mpifileutils/pull/625~ (merged)
  • ~https://github.com/hpc/mpifileutils/issues/628~ → ~https://github.com/hpc/mpifileutils/pull/629~ (merged)

Please let me know what you think! I can also remove the tests and GitHub actions workflow if you don't like the technical approach.

[!IMPORTANT] Note this feature does not work properly without this fix for a bug in DTCMP: https://github.com/LLNL/dtcmp/pull/20

[!IMPORTANT] There is one limitation with dcp/dsync --dereference when symlinks point to path with more than one link. In this specific case, mpifileutils will consider the symlink as one more additional path to the same inode and create one more hardlink on this inodes in destination directory. For reference, this case is coverered by test test_dsync_symlink_dereference_target_nlinks.

[!NOTE] I would like to emphasize that this work is sponsored by @cea-hpc.

fix #417 #336

rezib avatar Mar 21 '25 15:03 rezib