dupeguru
dupeguru copied to clipboard
Option to only compare reference entries to normal entries, ignore normal vs normal
Is your feature request related to a problem? Please describe. I often times am running dupeguru with reference folders against quite a few "normal" folders and am only interested in removing files that have a match to a reference folder. Matching/comparisons to between dupes in normal folders with no equivalent in reference can take quite a long time during the scanning process
Describe the solution you'd like An option toggle to prevent comparisons between items in folders marked "normal"
Describe alternatives you've considered Another option would be some sort of label other than "normal" that does the same thing.
I don't think the scanning process can be shortened. After all, it has to check every file in the normal side to see if it has a match on the ref side. The fact that they then might also match the normal side is a side-effect. I'd suspect (without looking at the code) that a checksum or signature is generated for every file (using whatever comparison method), regardless of location, and then the results are then compared by signature before sorting into ref vs normal. So the time taken is always a function of the total number of files.
However your issue sounds like another variation on the logical ANDing of top-level folders that I raised in #692 although coming at it from a different angle. I'm currently trying out a suggestion in https://github.com/arsenetar/dupeguru/issues/386#issuecomment-244274941 which involves sorting the results by location before selecting.
Seems that ensuring you have a reference folder setup is important as that ensures that it will return only matches between reference AND normal, and normal AND normal. Matches between reference AND reference are not returned.
I don't think this will work reliably for more than one reference and one normal though as the results may be a bit complicated to interpret without making mistakes, so in my testing I'm proceeding with caution and making the top-level folder selection simple and not intermixing reference and normal sub-folders. Excluding sub-folders such as trash etc. isn't an issue.
Then sort by the results folder (which acts on the locked reference lines), so that the reference AND normal matches rise to the top. You can be sure that the delete-able files are then from your normal top-level. The normal AND normal matches would then be at the bottom of the list and can be ignored or removed from results.
That's disappointing to hear. Obviously all of the files in every folder have to be scanned. To be clear, I am referring to what seems like the last step in the process when the UI is showing "X Matches Found". I swear that based on how long it takes during that process and how the progress bar proceeds that there is a point where it shifts from comparing the reference folders to the normal folders and then starts comparing the normal folders to the normal folders. Let's say that the search find tens of thousands of matches, but less than 100 are matches between reference and normal. So the progress bar rapidly fills till like 80% & 100ish matches, but then slows down enormously and rapidly counts up tens of thousands of more matches before eventually hitting 100% 10 minutes later.
#692 does sound somewhat similar. Dupeguru is an amazing piece of software that I get much use out of, but some being able to apply some more logical options would be amazing. Another request that I have wanted to make it so be able to have dupeguru scan folders and their MANY subfolders, but only return results of duplicates that exist in the same folders. The only workaround I have now for that is to scan fewer folders at a time and rely on the visual indication of whether the folders are different when "Delta Values" is enabled, but that takes much longer and gets tough when there are more than 2 files matches per unique file.
I do in fact rely on the sorting aspect you are talking about currently, and that helps quite a bit, but it still eats up a bunch of time on my end.
This would be a very useful improvement! The case I've just hit is comparing a reference folder to normal folder which had some saved web pages. All the common images, javascript files and other resources were marked to be deleted as duplicates, when my intention was to leave them as they're in different folders. I also saw this with data from other programs but can't remember which.
FYI, this feature is implemented in another program, which should have been added to dupeguru