rmlint icon indicating copy to clipboard operation
rmlint copied to clipboard

Rmlinting large separated repositories

Open gorn opened this issue 6 years ago • 6 comments

I am sorry is this has easy solution already, I did not find it. I wonder if it is safe (or even possible) to solve this case - maybe with the --replay option

There are two machines A and B far far away from each other. The task is to delete files from A which are already on B (and transfer the others). It is very impractical to copy whole A to B and run rmlint locally. Is there a way for rmlint to scan files on machine B and with the knowledge saved into file and transferred to A do the job?

gorn avatar Jan 03 '19 22:01 gorn

I'm pretty sure it's implemented already (although a bit clunky) - see #199

SeeSpotRun avatar Jan 04 '19 05:01 SeeSpotRun

There are two machines A and B far far away from each other. The task is to delete files from A which are already on B (and transfer the others). It is very impractical to copy whole A to B and run rmlint locally. Is there a way for rmlint to scan files on machine B and with the knowledge saved into file and transferred to A do the job?

I remember we had a similar discussion in the issue @SeeSpotRun linked. To me this sounds you're asking for something like rsync. Is there anything that speaks against its usage?

sahib avatar Jan 04 '19 10:01 sahib

Yes obviously ... seems I start to repeat myself :( thanks

gorn avatar Jan 05 '19 10:01 gorn

Is there an FAQ entry or something that describes this usage?

I guess I need to run rmlint -c json:unique -mk // /mnt/here on machine 1, upload the file to machine 2, and run rmlint -mk /mnt/there // --replay rmlint.json ?

Am I right?

ghost avatar Dec 21 '20 01:12 ghost

Is there an FAQ entry or something that describes this usage?

Probably worth adding something to https://rmlint.readthedocs.io/en/latest/tutorial.html.

SeeSpotRun avatar Mar 30 '21 07:03 SeeSpotRun

Actually this doesn't (yet) work, because rmlint --replay checks first that:

  • the files listed in the .json cache still exist and
  • still have the same mtime.

If creating a json cache on one machine and then removing duplicates on a second machine, rmlint on the second machine can't verify whether the files still exist on the first. We would need some sort of option to skip this check, like:

$ rmlint --no-verify-cache [--yes-I-am-really-sure] /path/to/local/files // --replay other_PC_cache.json -km

But it would be dangerous. For example this would potentially generate a script to delete all your files:

$ rmlint /path/to/files --hash-uniques                                        # generates rmint.json
$ rmlint --no-verify-cache /path/to/local/files // --replay rmlint.json -km   # everything will match!!

I'm not sure I want to go there.

Maybe a better alternative now that NFS supports xattr would be:

On remote machine:
$ rmlint --xattr -T df --hash-uniques /path/to/files      # generates xattr checksums locally on remote machine
On local machine:
$ sudo mount -t nfs remote:/path/to/files /mnt/remote
$ rmlint --xattr -T df --hash-uniques /local/path         # generates xattr checksums on local machine
$ rmlint --xattr -km /local/path // /mnt/remote           # should find xattr hash matches

So then the only nfs network traffic is to search for and stat the remote files, check their mtime and read their xattr.

I think that's more aligned with the rmlint philosopy.

SeeSpotRun avatar Mar 30 '21 21:03 SeeSpotRun