jdupes icon indicating copy to clipboard operation
jdupes copied to clipboard

Find duplicates by name only

Open nodecentral opened this issue 5 years ago • 1 comments

Hi,

I’ve got a load of large videos files (some 20gb+) that I’d like to clean up, but when I run dupes it gets stuck on 1% and seems to go into a loop on the hashing ? I can only assume it’s the size of the files/folders ?

If that’s the case, is there a way I can just check for duplicate file names first ?

nodecentral avatar Apr 09 '20 20:04 nodecentral

The same algorithm problem is halting all features that don't hash the file contents. The binary search tree that "sorts" by size first makes its sorting decision based on the hash for files of the same size. No hash, no proper sort. The problem with using something else as the sort key is that the algorithm can end up failing to detect duplicates. I haven't yet decided how I want to tackle this problem. The existing algorithm is extremely fast for the most common use case, but once a match doesn't hash anything gets involved, it's no longer a usable algorithm. To make it more flexible will require one of two approaches: make a more general algorithm that is slower but much easier to add non-hash matches to, or use different algorithms entirely based on the requested behavior.

Here's where it gets ugly: one of the major enhancements needed is to get multiple behaviors working without special-casing them. For example, there is no reason why you shouldn't be able to summarize and hard link, but the program won't let you because everything was grafted on without any consideration for multiple final actions being a choice. A more generally friendly algorithm (that is also slower) would open up the door to a ton of added flexibility in the actions, too; as it is, the tree is built once, duplicates chained together, and nothing else can be done because filed may be modified and the final actions don't update the tree in any way.

For now, I'll see if I can come up with a good way to hack in the name finding capability reliably. I may just write a whole separate algorithm for that alone.

jbruchon avatar Apr 09 '20 21:04 jbruchon