jdupes
jdupes copied to clipboard
Compare "source" files/dirs against "destination" ones and only act on the source
Hi,
I have just moved every photo I could find from various different devices into one folder /dump/photo2sort/ and I now want to compare that folder and all it’s contents against my main multimedia folder /Multimedia/Pictures - and delete everything in the dump folder that already exists?
Is it this to run a report first...
jdupes -ASr /dump/photos2sort /Multimedia/Pictures > /share/Public/jdupes_photos2sort_already_exists.txt
and then to delete...
jdupes -ASrdN /dump/photos2sort /Multimedia/Pictures > /share/Public/jdupes_photos2sort_deleted.txt
Please excuse all the edits, I’ve been trying to work out how best to position this matter, as I’ve looked at the various command options and also read through similar issues or enhancements requests that have been reported, but I can’t a match.
As I have a house full of devices, with everyone sharing things around - I want to have a place (share/folder) where everyone can go to dump/back up their data, and then periodically I want to check against that share/folder structure to see what already exists in the others shares/folders.
If anything exists already, then have the option to report or delete it.
I'm thinking that what we need is a generic "the next parameter is read-only" option. Combined with the isolation option, it would allow for what you're requesting.
Hi @jbruchon , thanks so much for responding...
Just to be clear; deletions should only occur against the source file/folder structure (which is the dump files/folders) not the destination (the rest of the NAS). - I’ve modified the title to reflect that.
The desired process is to check if anything being uploaded/dumped on the NAS already exists; and if it does - then jdupes should - report it or delete it or maybe move it somewhere .
If I could just confirm your response - are you suggesting this is possible to do today ? Or would this capability be an enhancement request ?
If it’s possible today, great ! - please could you help me out with what would the exact command line would look like?
If you use -I (don't allow intra-parameter matching) and -O (always sort using parameter order) and put the folder that shouldn't be modified earlier in the command line, it should do what you want:
jdupes -rIO folder_of_stuff folder_with_things_to_be_deleted
Huge thanks @jbruchon ,
/share/CACHEDEV1_DATA/Multimedia/Pictures = All the files and folder to keep /share/Public/2sort/ = All the files/folders to be checked to see if they are present in the above
I tried the following
jdupes -ASrIO /share/CACHEDEV1_DATA/Multimedia/Pictures /share/Public/2sort/ > /share/Public/jdupes_2sort_comparrison.txt
Running it from the command line...
/] # jdupes -ASrIO /share/CACHEDEV1_DATA/Multimedia/Pictures /share/Public/2sort/ > /share/Public/jdupes_2sort_comparrison_v2.txt Scanning: 65343 files, 1025 items (in 2 specified)
But it just found 7 duplicates, yet from other scans there should be many more..
There are problems with the isolated option that are not easily fixed. I took a pull request that claimed to fix it, but if you're missing items, that pull doesn't seem to have done so. I won't be able to fix that quickly.
You can write a simple shell script to do what you want using the program's output, or in the case of "copy" files, just grep and pipe to a while-do rm-done loop.
Due to the amount of duplicates likely involved, I’m wondering if the ‘isolated option’ used here is only returning unique duplicates (by that I mean where only one instance of the duplicate exists in both locations) - I maybe way off - so I’ll continue testing, just thinking out loud :-)
Isolation is supposed to prevent any matching between items within the same command-line parameter. If you type jdupes -Ir 1 2 3 then any duplicate pairs within 1 or 2 or 3 exclusively will not show up, but a match pair or set that spans in 1/2, 1/3, 2/3, or 1/2/3 will show up. Yes, a side effect of this is only one item in each parameter showing in each match set. Look in the documentation and read about the "triangle problem" for more info.
Hi @jbruchon
Would a work around to this be to create a symlink to the ‘source’ at the lowest possible location/directory within the destination e.g.
/Multimedia/Pictures/XXXXXX/XXXXX/XXXX/XXX/XX/X/<symlink to /dump/photos2sort/>
And then run ...
jdupes -ASr /Multimedia/Pictures > /share/Public/jdupes_photos2sort_already_exists.txt
It looks like the default sorting order used by jdupes, is alphabetically, based on the characters in the full path to the file (not the file name itself, unless they are in the same location) - Is that correct ?
That is correct. The sort is alphabetical across the full path. I think that the best way to handle this right now is to write a shell script which consumes the output and takes the desired actions. You'll need an outer while loop to handle match set changes (sets are separated by an empty line) and an inner while loop that checks the file/path name against your desired criteria and takes action as desired. There is an example I've written somewhere but I don't have it handy. It's pretty simple if you have some experience with shell scripting.
Thanks @jbruchon
Sadly I have no knowledge/experience with shell scripting, I’m currently taking my duplicate clean up very slowly (over 200GB recovered so far).
One strange observation with the alphabetical sorting is that it’s not perfect, a directory starting with the word xmas and one with a dash ‘-‘ have come out lower than my string of x’s...
/Multimedia/Pictures/Events/ /Multimedia/Pictures/XXXXXXX/ /Multimedia/Pictures/Xmas time/ /Multimedia/Pictures/- name -/
I’m going to have to assume a ‘-‘ (dash) comes later in the sort order - but I’m not sure how/why ‘Xm’ is placed after ‘XX’
The easiest alpha sort is a case-sensitive dumb one. That is an extremely simple matter of checking each character pair's ASCII values and sorting based on that mathematical comparison. jdupes uses this method with two big exceptions: there is extra code to detect numbers and sort them numerically correctly (otherwise 2 would be before 01, for example) and code to make some special characters sort later so that "xyz - Copy" or "xyz (1)" come after "xyz" does, thereby allowing easy automated deletion of accidental drag-and-drop copies. I have not bothered complicating the sort further (i.e. case-insensitivity) as no one has cared.
That’s great, and anything ‘dumb’ works best for me :-) But I assumed it would look at the first letter, and if it’s a match, move on to the next ?
if it’s Ascii, then that alters my expectation of normal alphabetical sorting - as X = 88 and m = 109, , but a ‘-‘ (dash) wouldn’t come in last as that = 45 , unless it’s 196 ?
Ok, I’ll do a quick bit of testing..
Ok, yes, that looks to be it then, the sort order (using ascii) is different to how I assumed it would work, and not something I would have considered as alphabetical - as it will result in lower cases letters appearing much lower in the list than their upper case versions etc.
/share/CACHEDEV1_DATA/Web/jdupes_test/XXXX/vera.php
/share/CACHEDEV1_DATA/Web/jdupes_test/XXXx/vera.php
/share/CACHEDEV1_DATA/Web/jdupes_test/XXxX/vera.php
/share/CACHEDEV1_DATA/Web/jdupes_test/XZXX/vera.php
/share/CACHEDEV1_DATA/Web/jdupes_test/XxXX/vera.php
/share/CACHEDEV1_DATA/Web/jdupes_test/Xzxx/vera.php
/share/CACHEDEV1_DATA/Web/jdupes_test/ZZZZ/vera.php
/share/CACHEDEV1_DATA/Web/jdupes_test/abcd/vera.php
/share/CACHEDEV1_DATA/Web/jdupes_test/xXXX/vera.php
/share/CACHEDEV1_DATA/Web/jdupes_test/xxXX/vera.php
/share/CACHEDEV1_DATA/Web/jdupes_test/xxZZ/vera.php
/share/CACHEDEV1_DATA/Web/jdupes_test/xxxx/vera.php
While it may not be a important feature, it’s certainly a good thing to be aware of..
I'm thinking that what we need is a generic "the next parameter is read-only" option. Combined with the isolation option, it would allow for what you're requesting.
Has a switch like this been implemented? I'm wondering since i don't quite understand the discussion here.
Isolation is supposed to prevent any matching between items within the same command-line parameter. If you type
jdupes -Ir 1 2 3then any duplicate pairs within 1 or 2 or 3 exclusively will not show up, but a match pair or set that spans in 1/2, 1/3, 2/3, or 1/2/3 will show up. Yes, a side effect of this is only one item in each parameter showing in each match set. Look in the documentation and read about the "triangle problem" for more info.
Am I the only one to think that using the word isolate is a bit misleading? I would expect it to isolate each command line parameter such that matching can only happen between files in the same parameter.
What about something like
jdupes -rf [smaller folder] | while read a ; do rm "$a" ; done
jdupes -r [smaller folder] [bigger folder] | grep "^[smaller folder]" | while read a ; do rm "$a" ; done
(watch out for accidental regex in the [smaller folder] name, or use -F or something)
Now, my [bigger folder] has lots of duplicates already, which I can't do anything about, so this is super slow for me - I wrote https://github.com/jbruchon/jdupes/issues/181 (and https://github.com/jbruchon/jdupes/issues/182) as a work around.
Where [smaller folder] is your /dump/photos2sort and [bigger folder] is your /Multimedia/Pictures :
If you jdupes -rf [smaller folder] and delete everything it reports. This will dedup your source as -f omits the first match.
Then jdupes -rFe [smaller folder] [bigger folder] will tell you everything that is in the smaller folder that is already in the bigger folder. -F only compares files that have one copy in the small folder, a huge time saving in my case, where -e only lists files in the smaller folder so you can delete them (that's probably laziness on my behalf, but it saves any question about the grep).
Finally, just move the remainder from your smaller folder over to your bigger folder.
Ran this, would've liked a space report at the end but not sure it's possible.
jdupes.exe --recurse --isolate --param-order --delete --no-prompt e:\keepall e:\delete_from
Isolate doesn't work properly.
Are there any recent updates on how to perform the OP the correct way? Or any recommendations?
Thanks!
I've been considering adding -X extfilters that let you set up criteria and make matches or non-matches "no-modify" so that no destructive operations are allowed on any items that are positively filtered. This would allow matching against those items while prohibiting actions being done upon them.
That's awesome. Is this the option you're referring to (from jdupes man page)?
-X --extfilter=spec:info exclude/filter files based on specified criteria; general format:
jdupes -X filter[:value][size_suffix]
`nostr:text_string' exclude all paths containing the substring text_string. This scans the full file path, so it can be used to match directories: -X nostr:dir_name/
So something like:
jdupes -rS --delete --noprompt -X nostr:/DONT_DELETE_LOCATION /DONT_DELETE_LOCATION /DELETE_LOCATION | tee -a ~/jdupes_delete_log.txt
I also see this open here are these two opens duplicates?
Thanks! Love jdupes btw :)
I'm referring to adding a future option. You CAN use nostr and onlystr to choose files (based on path substring absence or presence) for matching consideration, but that means they are never matched against at all if they end up excluded. What I'm talking about is an "action" rule instead of a "loading" rule: when it's time to -dN auto-delete files, the files that matched a no-modify rule would expressly NOT be deleted no matter what.
A hypothetical example: the directory video_clips/ has matching files food_eating.mp4 and eating_out.mp4 and dining_footage.mp4 that turn out to be identical clips. You want to protect all clips starting with food_ from deletion.
jdupes -rdN -X protect:'food_*.mp4' video_clips/
food_eating.mp4 would not be deleted no matter what, and in conjunction with other files that matched it, it'd be the single file that the -dN auto-delete skips over. If another file matched and started with food_ as well, it would also be skipped over.
Now that I'm thinking about it, I should probably change the format of rules so that e.g. nostr/onlystr are just str and use ! to invert the meaning...
I see.
Is there any method in jdupes for scanning for duplicates in source and destination and deleting only in destination? Or is that still an open?
No. What I'm talking about would make that possible. There is no concept of "source" and "destination" in the program; all parameters are combined into one unified list of things to scan and the list is acted on as one combined unit. I suppose I could also add a filter rule that would allow specifying "no-modify" by parameter order (as already used by -O) as well:
jdupes -rdN dir1/ dir2/ -X protectorder:1 would add a protect: rule for file/folder parameter 1 (dir1 in this case).
Yeah, that would be great.
Ok, thanks so much for your help and your program. I really appreciate it.
Thanks @jbruchon
I’m still tracking this thread, would love to have something that allows me to only delete from one area/location (and protect the other) :)
Hi,
Thank you Jody for the great piece of software.
My small contribution for this particular use case.
Since I cannot say in advance in which folder the duplicates will be deleted, I have written a little bash script to automate moving the files to the directory of my choice for the simple case where many duplicates sit in just two folders.
- Run jdupes with the -l option (make relative symlinks for duplicates w/o prompting)
jdupes -l /path/to/deleteFiles path/to/keepFiles - Then run the following script
bash copyOriginalsToSymlinkLocation.sh path/to/keepFiles true
Here is the script (copyOriginalsToSymlinkLocation.sh)
#! /bin/bash
DIR=$1
while read -r line
do
echo ______________________________________
echo Start new file : "$line"
CUR_LINK_PATH="$(readlink "$line")"
echo File: "$DIR$CUR_LINK_PATH" "-> move to" "$DIR"
if [ -z "$2" ]
then echo "Do not perform operation"
elif [ $2 = "true" ]
then
echo "OK perform operation"
rm "$line"
mv "$DIR$CUR_LINK_PATH" "$DIR"
fi
done <<< $(find "$DIR" -type l)
This will move back any file left in "deleteFiles" back to "keepFiles" and delete the symlinks. Running the script without argument "true" at the end will show operations to be performed but not alter anything. Hope this helps, Cheers, Ed
The broken isolation feature has been removed in v1.21.0 released today.
I found this thread having recently started using jdupes. I am coming from a Perl script I wrote decades ago that does much of the same thing that jdupes does, but so much more slowly. The missing feature in jdupes is, I believe, exactly what @jbruchon is promoting in the -X protect:xxx, though I wonder if it wouldn't be easier to use if we adopted something like a --readonly DIR option. For example, in my aforementioned perl script, I allowed interposed options and paths such that one could do something like:
finddupes --recursive --delete --no-prompt DIR1 --read-only DIR2 DIR3 --read-only DIR4...
In that case, all of DIR[1-4] would be scanned, but duplicates would only be removed from DIR1 and DIR3.
If there are technical or other reasons for not allowing that form of argument passing, the underlying functionality would be exceedingly useful.
Is there any progress currently being made towards this functionality? If not, I might make time to help out, though my C is rusty.