dupeguru Show non-duplicates

Show non-duplicates

Open hsoft opened this issue 11 years ago • 11 comments

(strangely, although this is a commonly recurring feature request, I had never added a ticket for it)

from email:

The idea here is that I have lots of varied backups on different drives from different dates on different computers and I want to work through all the data to create a new reference volume. What happens is that the user's volumes and backups end up getting a few variations stored on them and I want to be able to easily find the variants and then copy them into the reference directory.

Since my users tend to make partial backups on different days, then they run out of space, I need to clean up computers and resync backup volumes.

The functionality I am looking for is the ability to make the duplicates invisible and then show all non-duplicates on the non-reference volume. (this would be similar to your "Dupes Only" feature) I do not need to see the files in the reference volume, all of them can stay invisible, I need to see only the non matches or non-duplicates on the non-reference volume. Then I want to be able to select some or all those and copy them into the reference volume So after a final scan where everything I want copied is copied to the reference volume I want to make sure that I have everything backed up so I can safely delete it or confirm that the second backup is correct and complete.

Perhaps I should just trust the situation and use your software to delete the files then review the remainder, but before I do that I am looking for a non-destructive alternative that will allow me to organize and sort things out so that I can maintain the computer and secondary backup drives.

The idea is to have an option to, instead of showing duplicates, show non duplicates. For that to work, we need at least one reference folder. The biggest problem is to figure out an elegant way to present the feature, UI-wise.

Jun 22 '13 03:06 hsoft

This is handy if one has the impression that most pictures in a folder are duplicates. With this option you can check if it's save to delete the complete folder which might be faster than marking all the images as duplicates.

Oct 23 '14 09:10 wedi

Big +1 vote for this!

Sep 02 '16 09:09 prohtex

Basically the feature is "Show unique files." Still would really love to see this. My current workflow involves some crazy stuff to accomplish this.

Oct 04 '17 04:10 prohtex

Hi all,

I've just been trying to work out how to do this and I can't find a good way to do it. I can't even find another tool which I think could do it as elegantly as DupeGuru (I looked at WinMerge but I'm really only interested in comparing checksums, not line by line comparisons of text files).

Even if this issue is going nowhere, if anyone has find a way to do this then please share.

Thanks 🙂

May 14 '20 19:05 madb1lly

@madb1lly If you just want to find duplicates using checksums I'd suggest you go the command line tools route. I'm on Linux/Mac so I cannot provide in depth support but googling "powershell find duplicates checksum" brought this tutorial on place two: http://www.readonlymaio.org/rom/2017/10/09/finding-duplicated-identical-files-with-powershell-the-fast-way/

May 18 '20 09:05 wedi

Hi @wedi,

I've been using DupeGuru to find duplicate files without problem. The problem I'm looking for a solution to is to find unique files, not duplicate ones. I suppose I can use that powershell script as a basis and adapt it for my needs.

However I'm sure there must be a way to do this with DupeGuru, after all it's really just an inversion of the results list, i.e. list what is not in the results list.

Cheers 🙂

May 18 '20 12:05 madb1lly

+1 for this.

@hsoft A suggestion for the UI is to just add another checkbox next to 'dupes only', saying 'Different Only' (or Uniques only) and we get a list with the same UI as the dupes only list.

This would probably be easy to do since all the information is already available in dupeGuru processing?

Sep 25 '20 18:09 pjfsilva

Hello, I also vote for this. In fact I was so sure the function was available that I launched a duplicate search on very larges folders, with networks shares, that took several hours to complete, in order to find files that would be unique to one folder... and was frustrated not to be able to show these files... (great work for dupeguru by the way, great and powerful tool)

Feb 07 '21 15:02 Obilolo

I would also like for this feature to exist. I have some Python experience and am willing to do a couple hours of work on it. By any chance, have any of you already looked into implementing this and have an idea of what parts of the code base are involved? I'm not worried about the GUI aspect at this point, just the under-the-hood functionality.

Edit 1: I found the developer guide (for a recent version), here. I'll start with this.

Edit 2: I just realized the developer guide was written by @hsoft (according to its commit history). Thank you! It has been helpful to me so far, in my attempt to learn how dupeGuru works under the hood.

Edit 3: Note for future reference (since I will be out for the rest of the day): based upon what I've read in the developer documentation so far, I think either updating core.engine.getmatches() / core.engine.getmatches_by_contents() to, optionally, return a list of "non-matches" instead of a list of matches; or implementing an alternative function (e.g. core.engine.getnonmatches()), would be a step in the right direction. I'll return to this issue when I get another chance.

Jul 17 '22 23:07 gitname

Here is a rough, untested version of what I have in mind (for the getmatches_by_contents function):

# File: core/engine.py

def getmatches_by_contents(files, bigsize=0, j=job.nulljob):
    """Returns a list of :class:`Match` within ``files`` if their contents is the same.

    :param bigsize: The size in bytes over which we consider files big enough to
                    justify taking samples of the file for hashing. If 0, compute digest as usual.
    :param j: A :ref:`job progress instance <jobs>`.
    """
+
+    # Preserve a reference to "the list of all input files."
+    original_files = files
+
    size2files = defaultdict(set)
    for f in files:
        size2files[f.size].add(f)
    del files

    # ...

+
+    # Get the difference between "the list of all input files" and "the list of
+    # input files that have duplicates." The difference will be "the list of
+    # input files that do not have duplicates."
+    dupeless_files = []
+    for f in original_files:
+        file_is_in_result = False
+        for r in result:
+            if f == r.first or f == r.second:
+                file_is_in_result = True
+                break
+        if not file_is_in_result:
+            dupeless_files.append(f)
+
+    # TODO: Show the contents of `dupeless_files` to the user.
+
     return result

I like the way @madb1lly summarized the concept above:

an inversion of the results list, i.e. list what is not in the results list.

Anyway, in the short term, I think it will be quicker for me to use dupeGuru as-is to export a scan result to CSV (via File > Export to CSV in the GUI); then write an independent, single-purpose Python script that compares a set of file paths (i.e. "the list of all input files"), to the set of files paths in the CSV file (i.e. "the list of input files that have duplicates"), and returns the difference. If I do write such a Python script, I'll share it here.

Jul 18 '22 19:07 gitname

Hi everyone, I wrote a Python script people can use to compare a folder on their computer to dupeGuru scan results (a CSV file exported from dupeGuru after performing a "filename" or "contents" type of scan), in order to identify non-duplicate files.

Here's a link: https://github.com/gitname/dupeguru-post-processor

I think this is my first time making a Python script for public consumption. I'm interested in any feedback you have. I hope some of you find the script helpful. Also, although I implemented the script as something separate from dupeGuru, I would be happy to see the functionality be added to dupeGuru, itself, in case anyone reading this wants to take that on.

Edit: I added a graphical launcher to facilitate the entry of folder/file paths.

Jul 29 '22 03:07 gitname

dupeguru dupeguru copied to clipboard

Show non-duplicates

dupeguru
dupeguru copied to clipboard