jdupes icon indicating copy to clipboard operation
jdupes copied to clipboard

Compare "source" files/dirs against "destination" ones and only act on the source

Open nodecentral opened this issue 5 years ago • 41 comments

Hi,

I have just moved every photo I could find from various different devices into one folder /dump/photo2sort/ and I now want to compare that folder and all it’s contents against my main multimedia folder /Multimedia/Pictures - and delete everything in the dump folder that already exists?

Is it this to run a report first...

jdupes -ASr /dump/photos2sort /Multimedia/Pictures > /share/Public/jdupes_photos2sort_already_exists.txt

and then to delete...

jdupes -ASrdN /dump/photos2sort /Multimedia/Pictures > /share/Public/jdupes_photos2sort_deleted.txt

nodecentral avatar Jan 10 '20 20:01 nodecentral

Please excuse all the edits, I’ve been trying to work out how best to position this matter, as I’ve looked at the various command options and also read through similar issues or enhancements requests that have been reported, but I can’t a match.

As I have a house full of devices, with everyone sharing things around - I want to have a place (share/folder) where everyone can go to dump/back up their data, and then periodically I want to check against that share/folder structure to see what already exists in the others shares/folders.

If anything exists already, then have the option to report or delete it.

nodecentral avatar Jan 11 '20 21:01 nodecentral

I'm thinking that what we need is a generic "the next parameter is read-only" option. Combined with the isolation option, it would allow for what you're requesting.

jbruchon avatar Jan 11 '20 22:01 jbruchon

Hi @jbruchon , thanks so much for responding...

Just to be clear; deletions should only occur against the source file/folder structure (which is the dump files/folders) not the destination (the rest of the NAS). - I’ve modified the title to reflect that.

The desired process is to check if anything being uploaded/dumped on the NAS already exists; and if it does - then jdupes should - report it or delete it or maybe move it somewhere .

If I could just confirm your response - are you suggesting this is possible to do today ? Or would this capability be an enhancement request ?

If it’s possible today, great ! - please could you help me out with what would the exact command line would look like?

nodecentral avatar Jan 11 '20 23:01 nodecentral

If you use -I (don't allow intra-parameter matching) and -O (always sort using parameter order) and put the folder that shouldn't be modified earlier in the command line, it should do what you want:

jdupes -rIO folder_of_stuff folder_with_things_to_be_deleted

jbruchon avatar Jan 12 '20 11:01 jbruchon

Huge thanks @jbruchon ,

/share/CACHEDEV1_DATA/Multimedia/Pictures = All the files and folder to keep /share/Public/2sort/ = All the files/folders to be checked to see if they are present in the above

I tried the following

jdupes -ASrIO /share/CACHEDEV1_DATA/Multimedia/Pictures /share/Public/2sort/ > /share/Public/jdupes_2sort_comparrison.txt

Running it from the command line...

/] # jdupes -ASrIO /share/CACHEDEV1_DATA/Multimedia/Pictures /share/Public/2sort/ > /share/Public/jdupes_2sort_comparrison_v2.txt Scanning: 65343 files, 1025 items (in 2 specified)

But it just found 7 duplicates, yet from other scans there should be many more..

nodecentral avatar Jan 12 '20 18:01 nodecentral

There are problems with the isolated option that are not easily fixed. I took a pull request that claimed to fix it, but if you're missing items, that pull doesn't seem to have done so. I won't be able to fix that quickly.

You can write a simple shell script to do what you want using the program's output, or in the case of "copy" files, just grep and pipe to a while-do rm-done loop.

jbruchon avatar Jan 12 '20 21:01 jbruchon

Due to the amount of duplicates likely involved, I’m wondering if the ‘isolated option’ used here is only returning unique duplicates (by that I mean where only one instance of the duplicate exists in both locations) - I maybe way off - so I’ll continue testing, just thinking out loud :-)

nodecentral avatar Jan 13 '20 21:01 nodecentral

Isolation is supposed to prevent any matching between items within the same command-line parameter. If you type jdupes -Ir 1 2 3 then any duplicate pairs within 1 or 2 or 3 exclusively will not show up, but a match pair or set that spans in 1/2, 1/3, 2/3, or 1/2/3 will show up. Yes, a side effect of this is only one item in each parameter showing in each match set. Look in the documentation and read about the "triangle problem" for more info.

jbruchon avatar Jan 13 '20 21:01 jbruchon

Hi @jbruchon

Would a work around to this be to create a symlink to the ‘source’ at the lowest possible location/directory within the destination e.g.

/Multimedia/Pictures/XXXXXX/XXXXX/XXXX/XXX/XX/X/<symlink to /dump/photos2sort/>

And then run ...

jdupes -ASr /Multimedia/Pictures > /share/Public/jdupes_photos2sort_already_exists.txt

It looks like the default sorting order used by jdupes, is alphabetically, based on the characters in the full path to the file (not the file name itself, unless they are in the same location) - Is that correct ?

nodecentral avatar Jan 14 '20 10:01 nodecentral

That is correct. The sort is alphabetical across the full path. I think that the best way to handle this right now is to write a shell script which consumes the output and takes the desired actions. You'll need an outer while loop to handle match set changes (sets are separated by an empty line) and an inner while loop that checks the file/path name against your desired criteria and takes action as desired. There is an example I've written somewhere but I don't have it handy. It's pretty simple if you have some experience with shell scripting.

jbruchon avatar Jan 14 '20 17:01 jbruchon

Thanks @jbruchon

Sadly I have no knowledge/experience with shell scripting, I’m currently taking my duplicate clean up very slowly (over 200GB recovered so far).

One strange observation with the alphabetical sorting is that it’s not perfect, a directory starting with the word xmas and one with a dash ‘-‘ have come out lower than my string of x’s...

/Multimedia/Pictures/Events/ /Multimedia/Pictures/XXXXXXX/ /Multimedia/Pictures/Xmas time/ /Multimedia/Pictures/- name -/

I’m going to have to assume a ‘-‘ (dash) comes later in the sort order - but I’m not sure how/why ‘Xm’ is placed after ‘XX’

nodecentral avatar Jan 14 '20 19:01 nodecentral

The easiest alpha sort is a case-sensitive dumb one. That is an extremely simple matter of checking each character pair's ASCII values and sorting based on that mathematical comparison. jdupes uses this method with two big exceptions: there is extra code to detect numbers and sort them numerically correctly (otherwise 2 would be before 01, for example) and code to make some special characters sort later so that "xyz - Copy" or "xyz (1)" come after "xyz" does, thereby allowing easy automated deletion of accidental drag-and-drop copies. I have not bothered complicating the sort further (i.e. case-insensitivity) as no one has cared.

jbruchon avatar Jan 14 '20 19:01 jbruchon

That’s great, and anything ‘dumb’ works best for me :-) But I assumed it would look at the first letter, and if it’s a match, move on to the next ?

if it’s Ascii, then that alters my expectation of normal alphabetical sorting - as X = 88 and m = 109, , but a ‘-‘ (dash) wouldn’t come in last as that = 45 , unless it’s 196 ?

Ok, I’ll do a quick bit of testing..

nodecentral avatar Jan 15 '20 08:01 nodecentral

Ok, yes, that looks to be it then, the sort order (using ascii) is different to how I assumed it would work, and not something I would have considered as alphabetical - as it will result in lower cases letters appearing much lower in the list than their upper case versions etc.


/share/CACHEDEV1_DATA/Web/jdupes_test/XXXX/vera.php
/share/CACHEDEV1_DATA/Web/jdupes_test/XXXx/vera.php
/share/CACHEDEV1_DATA/Web/jdupes_test/XXxX/vera.php
/share/CACHEDEV1_DATA/Web/jdupes_test/XZXX/vera.php
/share/CACHEDEV1_DATA/Web/jdupes_test/XxXX/vera.php
/share/CACHEDEV1_DATA/Web/jdupes_test/Xzxx/vera.php
/share/CACHEDEV1_DATA/Web/jdupes_test/ZZZZ/vera.php
/share/CACHEDEV1_DATA/Web/jdupes_test/abcd/vera.php
/share/CACHEDEV1_DATA/Web/jdupes_test/xXXX/vera.php
/share/CACHEDEV1_DATA/Web/jdupes_test/xxXX/vera.php
/share/CACHEDEV1_DATA/Web/jdupes_test/xxZZ/vera.php
/share/CACHEDEV1_DATA/Web/jdupes_test/xxxx/vera.php

While it may not be a important feature, it’s certainly a good thing to be aware of..

nodecentral avatar Jan 15 '20 09:01 nodecentral

I'm thinking that what we need is a generic "the next parameter is read-only" option. Combined with the isolation option, it would allow for what you're requesting.

Has a switch like this been implemented? I'm wondering since i don't quite understand the discussion here.

pxssy avatar Oct 07 '20 16:10 pxssy

Isolation is supposed to prevent any matching between items within the same command-line parameter. If you type jdupes -Ir 1 2 3 then any duplicate pairs within 1 or 2 or 3 exclusively will not show up, but a match pair or set that spans in 1/2, 1/3, 2/3, or 1/2/3 will show up. Yes, a side effect of this is only one item in each parameter showing in each match set. Look in the documentation and read about the "triangle problem" for more info.

Am I the only one to think that using the word isolate is a bit misleading? I would expect it to isolate each command line parameter such that matching can only happen between files in the same parameter.

upward4 avatar Mar 30 '21 13:03 upward4

What about something like

jdupes -rf [smaller folder] | while read a ; do rm "$a" ; done
jdupes -r [smaller folder] [bigger folder] | grep "^[smaller folder]" | while read a ; do rm "$a" ; done

(watch out for accidental regex in the [smaller folder] name, or use -F or something)

Now, my [bigger folder] has lots of duplicates already, which I can't do anything about, so this is super slow for me - I wrote https://github.com/jbruchon/jdupes/issues/181 (and https://github.com/jbruchon/jdupes/issues/182) as a work around.

Where [smaller folder] is your /dump/photos2sort and [bigger folder] is your /Multimedia/Pictures :

If you jdupes -rf [smaller folder] and delete everything it reports. This will dedup your source as -f omits the first match.

Then jdupes -rFe [smaller folder] [bigger folder] will tell you everything that is in the smaller folder that is already in the bigger folder. -F only compares files that have one copy in the small folder, a huge time saving in my case, where -e only lists files in the smaller folder so you can delete them (that's probably laziness on my behalf, but it saves any question about the grep).

Finally, just move the remainder from your smaller folder over to your bigger folder.

jyukumite avatar Jun 08 '21 06:06 jyukumite

Ran this, would've liked a space report at the end but not sure it's possible.

jdupes.exe --recurse --isolate --param-order --delete --no-prompt e:\keepall e:\delete_from

RollingStar avatar Jan 18 '22 07:01 RollingStar

Isolate doesn't work properly.

jbruchon avatar Jan 18 '22 15:01 jbruchon

Are there any recent updates on how to perform the OP the correct way? Or any recommendations?

Thanks!

MaxFranklin5 avatar May 02 '22 19:05 MaxFranklin5

I've been considering adding -X extfilters that let you set up criteria and make matches or non-matches "no-modify" so that no destructive operations are allowed on any items that are positively filtered. This would allow matching against those items while prohibiting actions being done upon them.

jbruchon avatar May 02 '22 20:05 jbruchon

That's awesome. Is this the option you're referring to (from jdupes man page)?

-X --extfilter=spec:info exclude/filter files based on specified criteria; general format:

          jdupes -X filter[:value][size_suffix]

`nostr:text_string' exclude all paths containing the substring text_string. This scans the full file path, so it can be used to match directories: -X nostr:dir_name/

So something like:

jdupes -rS --delete --noprompt -X nostr:/DONT_DELETE_LOCATION /DONT_DELETE_LOCATION /DELETE_LOCATION | tee -a ~/jdupes_delete_log.txt

I also see this open here are these two opens duplicates?

Thanks! Love jdupes btw :)

MaxFranklin5 avatar May 02 '22 20:05 MaxFranklin5

I'm referring to adding a future option. You CAN use nostr and onlystr to choose files (based on path substring absence or presence) for matching consideration, but that means they are never matched against at all if they end up excluded. What I'm talking about is an "action" rule instead of a "loading" rule: when it's time to -dN auto-delete files, the files that matched a no-modify rule would expressly NOT be deleted no matter what.

A hypothetical example: the directory video_clips/ has matching files food_eating.mp4 and eating_out.mp4 and dining_footage.mp4 that turn out to be identical clips. You want to protect all clips starting with food_ from deletion.

jdupes -rdN -X protect:'food_*.mp4' video_clips/

food_eating.mp4 would not be deleted no matter what, and in conjunction with other files that matched it, it'd be the single file that the -dN auto-delete skips over. If another file matched and started with food_ as well, it would also be skipped over.

Now that I'm thinking about it, I should probably change the format of rules so that e.g. nostr/onlystr are just str and use ! to invert the meaning...

jbruchon avatar May 02 '22 20:05 jbruchon

I see.

Is there any method in jdupes for scanning for duplicates in source and destination and deleting only in destination? Or is that still an open?

MaxFranklin5 avatar May 02 '22 21:05 MaxFranklin5

No. What I'm talking about would make that possible. There is no concept of "source" and "destination" in the program; all parameters are combined into one unified list of things to scan and the list is acted on as one combined unit. I suppose I could also add a filter rule that would allow specifying "no-modify" by parameter order (as already used by -O) as well:

jdupes -rdN dir1/ dir2/ -X protectorder:1 would add a protect: rule for file/folder parameter 1 (dir1 in this case).

jbruchon avatar May 02 '22 21:05 jbruchon

Yeah, that would be great.

Ok, thanks so much for your help and your program. I really appreciate it.

MaxFranklin5 avatar May 02 '22 21:05 MaxFranklin5

Thanks @jbruchon

I’m still tracking this thread, would love to have something that allows me to only delete from one area/location (and protect the other) :)

nodecentral avatar May 03 '22 15:05 nodecentral

Hi,

Thank you Jody for the great piece of software.

My small contribution for this particular use case.

Since I cannot say in advance in which folder the duplicates will be deleted, I have written a little bash script to automate moving the files to the directory of my choice for the simple case where many duplicates sit in just two folders.

  1. Run jdupes with the -l option (make relative symlinks for duplicates w/o prompting) jdupes -l /path/to/deleteFiles path/to/keepFiles
  2. Then run the following script bash copyOriginalsToSymlinkLocation.sh path/to/keepFiles true

Here is the script (copyOriginalsToSymlinkLocation.sh)

#! /bin/bash
DIR=$1
while read -r line
do
	echo ______________________________________
	echo Start new file : "$line"
	CUR_LINK_PATH="$(readlink "$line")"
	echo File: "$DIR$CUR_LINK_PATH" "-> move to" "$DIR"
	if [ -z "$2" ]
		then echo "Do not perform operation"
		elif [ $2 = "true" ]
			then 
			echo "OK perform operation"
			rm "$line"
			mv "$DIR$CUR_LINK_PATH" "$DIR"
	fi
done <<< $(find "$DIR" -type l)

This will move back any file left in "deleteFiles" back to "keepFiles" and delete the symlinks. Running the script without argument "true" at the end will show operations to be performed but not alter anything. Hope this helps, Cheers, Ed

Ed-Ross avatar Jun 20 '22 19:06 Ed-Ross

The broken isolation feature has been removed in v1.21.0 released today.

jbruchon avatar Sep 03 '22 18:09 jbruchon

I found this thread having recently started using jdupes. I am coming from a Perl script I wrote decades ago that does much of the same thing that jdupes does, but so much more slowly. The missing feature in jdupes is, I believe, exactly what @jbruchon is promoting in the -X protect:xxx, though I wonder if it wouldn't be easier to use if we adopted something like a --readonly DIR option. For example, in my aforementioned perl script, I allowed interposed options and paths such that one could do something like:

finddupes --recursive --delete --no-prompt DIR1 --read-only DIR2 DIR3 --read-only DIR4...

In that case, all of DIR[1-4] would be scanned, but duplicates would only be removed from DIR1 and DIR3.

If there are technical or other reasons for not allowing that form of argument passing, the underlying functionality would be exceedingly useful.

Is there any progress currently being made towards this functionality? If not, I might make time to help out, though my C is rusty.

eengstrom avatar Oct 19 '22 15:10 eengstrom