dupeguru icon indicating copy to clipboard operation
dupeguru copied to clipboard

"Almost done! Fiddling with results..." can take hours, days, infinitely long.

Open a-raccoon opened this issue 5 years ago • 32 comments

It seems that DupeGuru has pretty hard ceiling when it comes to loading a lot of results. It is fully capable of finding all duplicates and reaching 100%, but it will never get as far as displaying them all if there are too many. I estimate that the processing becomes exponentially slow for every result that DupeGuru has to process and add to the GUI list. It is not uncommon for Windows UI elements to work in this way.

I wonder if this can be solved by only displaying 10,000 to 100,000 results at a time, with a button to load more results. That way the user can work through the results in smaller batches with a fully responsive UI.

I might also suggest that this be done dynamically, by "giving up" after 10 minutes of processing, and the "Load More Results" button would process for another 10 minutes only. This could reduce total processing of results to 30 minutes instead of 30 days.

a-raccoon avatar May 20 '19 05:05 a-raccoon

How many results are you talking about? I tested it with about 200 000 results and it didn't take very long to display.

Dobatymo avatar May 29 '19 03:05 Dobatymo

I believe it was scantly over 100,000, but I can't really tell because the screen it gets stuck on (fiddling with results) removes the count total from view. The total number of files scanned were a 3 to 6 million.

a-raccoon avatar May 29 '19 11:05 a-raccoon

Sure. Let's say my computer is slow. It's still not a linear function, and gets exponentially slower regardless of your computer speed. Unfortunately I'm unfamiliar with the underpinnings of dupeGuru to comment on how it could be improved upon, it doesn't seem like an appropriate method is being used given the behavior exhibited. Perhaps a Bogosort slipped in?

If the 'fiddling' dialog were more verbose with a progress bar, I could comment further. And it shouldn't eliminate the 'matches found' information that precedes it.

a-raccoon avatar Jun 20 '19 00:06 a-raccoon

hav same problem. On first run ubuntu 18.04 also freezed because I run out of RAM i think. now I try again with an Extra SWAP.

Mannshoch avatar Jul 13 '19 12:07 Mannshoch

I'm noticing this problem too with about 631,000 matches on Windows 10 running 32GB RAM, and the images stored across on 4 HDDs @ 7200RPM using DrivePool (very fast read speeds). It took all night for it to complete just the 'all done' phase.

I think that 'show 10k (or whatever the user sets in their config and click to load another 10k (or whatever would be a good idea'

Looking through https://github.com/arsenetar/dupeguru/blob/master/core/scanner.py it says:


        # In removing what we call here "false matches", we first want to remove, if we scan by
        # folders, we want to remove folder matches for which the parent is also in a match (they're
        # "duplicated duplicates if you will). Then, we also don't want mixed file kinds if the
        # option isn't enabled, we want matches for which both files exist and, lastly, we don't
        # want matches with both files as ref.

I think that the issue is just loading that sheer volume is such a giant task. It should really be broken down into chunks; here is the first 10k, here is another etc. Or atleast show this process happening so we know it's not stuck.

DankMemeGuy avatar Sep 28 '20 12:09 DankMemeGuy

I have a huge amount which does not show as large under 600gb but its days now, i attach a screenshot, im using a new mac studio with 8tb hard drive the fastest cpu and 128gb ram

Screenshot 2023-02-21 at 13 16 13

UkMediaOnline avatar Feb 21 '23 13:02 UkMediaOnline

more than two days thusfar!

UkMediaOnline avatar Feb 21 '23 15:02 UkMediaOnline

@UkMediaOnline Have you checked the memory usage of dupeGuru during this operation?

arsenetar avatar Feb 21 '23 16:02 arsenetar

@UkMediaOnline Have you checked the memory usage of dupeGuru during this operation?

Has nothing to do with memory usage (or lack of system memory). It has to do with triple nested loops, and probably some bug in the sorting routine that craps out when there are too many items to be sorted. This software is unusable until all the nested looping has been eliminated and replaced by a series of unnested single pass operations.

a-raccoon avatar Feb 21 '23 17:02 a-raccoon

Ideally, this software will be rewritten to utilize hash digests for each file, storing the hash digests into ADS (alt data streams) or an .sha256 database, and then it becomes a simple matter of sorting and comparing sha checksums.

I've discontinued using DupeGuru and have switched to VoidTool's Everything 1.5

a-raccoon avatar Feb 21 '23 17:02 a-raccoon

I did check the memory usage arsentar, hardly using anything, still 126gb meory free

@UkMediaOnline Have you checked the memory usage of dupeGuru during this operation? I did check the memory usage arsentar, hardly using anything, still 126gb meory free...thanks anyway for your help :-)

UkMediaOnline avatar Feb 21 '23 17:02 UkMediaOnline

Ideally, this software will be rewritten to utilize hash digests for each file, storing the hash digests into ADS (alt data streams) or an .sha256 database, and then it becomes a simple matter of sorting and comparing sha checksums.

I've discontinued using DupeGuru and have switched to VoidTool's Everything 1.5

I will try the piece of software you are using a-racoon and thanks for your help, youre a gentleman :-)

UkMediaOnline avatar Feb 21 '23 17:02 UkMediaOnline

@UkMediaOnline Have you checked the memory usage of dupeGuru during this operation?

@arsenetar I’ve scanned my NAS and the scanning took about 20-30 hours but the fiddling with the results have been going on for 3 days now. The RAM usage has changed over the days, up to 3.5 and then down to 2.4 and right now it’s about 0.9 GB. I have the same problem with everyone else, 0-1 % CPU usage and 0 network usage. The computer has basically been on for almost a week and Fiddling with results took the most part with no CPU usage whatsoever. Is there a way I can help you investigate this?

Fethbita avatar Jun 27 '23 18:06 Fethbita

The RAM usage is increasing 4 kb each second or so.

Fethbita avatar Jun 27 '23 19:06 Fethbita

I have a MacBook Pro with a max chip, 68 core GPU and 96gb ram. It shouldn't break out in a sweat

On Tue, 27 Jun 2023, 19:16 Fethbita, @.***> wrote:

@UkMediaOnline https://github.com/UkMediaOnline Have you checked the memory usage of dupeGuru during this operation?

@arsenetar https://github.com/arsenetar I’ve scanned my NAS and the scanning took about 20-30 hours but the fiddling with the results have been going on for 3 days now. The RAM usage has changed over the days, up to 3.5 and then down to 2.4 and right now it’s about 0.9 GB. I have the same problem with everyone else, 0-1 % CPU usage and 0 network usage. The computer has basically been on for almost a week and Fiddling with results took the most part with no CPU usage whatsoever. Is there a way I can help you investigate this?

— Reply to this email directly, view it on GitHub https://github.com/arsenetar/dupeguru/issues/571#issuecomment-1610004565, or unsubscribe https://github.com/notifications/unsubscribe-auth/APK7MT4DQAVL2MKZ7OUZN33XNMPP5ANCNFSM4HN6THPQ . You are receiving this because you were mentioned.Message ID: @.***>

UkMediaOnline avatar Jun 27 '23 21:06 UkMediaOnline

@Fethbita There are a few operations that take place after the main scan. These operations run in a single thread so CPU usage will likely not be very high. For a NAS scan its quite likely that the existence checks that happen after the main scan has finished could slow things down.

The current development code in master has a new option to skip this check. The existence check is mainly to cleanup files that might have been removed since the scan started to prevent later errors when interacting with the results. If you are not removing files during the scan then this check is likely unnecessary. If you were able to build and test the latest development code using this option under advanced to disable the check it would help confirm if that is the primary source of the slowdown for your scan and identify if there are further items causing slowdown at this stage.

If you could providing a profiled scan result would be helpful (this can be a scan with the existence check disabled to determine where time is being spent outside of that or a scan with it enabled). The scan does need to complete to get a good profile result. Profiling scans can be enabled in the debug options and the location of the resulting profile files is in the same location as the logs which is linked in the dialog.

arsenetar avatar Jun 28 '23 08:06 arsenetar

@arsenetar I am quite sure that it's not the existing file check. There hasn't been network activity on the process so it hasn't been connecting to the NAS. It's not CPU intensive neither disk usage intensive, it is doing something that causes the RAM usage to go up by 4 kb around each second but I am not sure what that is.

It's still going and I don't want to cancel the scan right now since it has taken so long. I'll try to wait for a couple more days and see how it goes.

Fethbita avatar Jun 28 '23 14:06 Fethbita

Well, I would say it used to work much faster. I'll try scanning smaller and smaller dets, but by the time I'm done I've spent FAR too much time in manually selecting all those timy filesets, which I have to then scan THOSE sets into progressively wider and wider sets.

Now, if I could send the UI pairs of directories from the command line and then manipulate the results in the UI, it could be quicker. But having to navigate through sub-trees in the UI, and then select subdirectories to exclude/include (because that's how you end up trying to get it down to a file set it can handle) is sorely inefficient for the way you end up having to use it.

jelabarre59 avatar Jun 29 '23 23:06 jelabarre59

I've discontinued using DupeGuru and have switched to VoidTool's Everything 1.5

Nice, but it's a MSWindows application (I'm running Linux)

jelabarre59 avatar Jun 29 '23 23:06 jelabarre59

Yesterday I got a blue screen after around 4 days of Fiddling with results. I will try to build it from source using the master branch and try with debugging options to see what happens. I am kind of exhausted of the week long scan though.

Unrelated to the issue here, I was looking for options to install it to Synology NAS so I could do the scan on the NAS instead of over LAN but it seemed like there is only an unofficial docker option and no other option, so I will skip that one.

Fethbita avatar Jul 01 '23 09:07 Fethbita

Same, eats up all RAM then freeze. Can't he write its working data on temporary files? Because RAM is not reliable for big comparison

bphd avatar Sep 27 '23 09:09 bphd

I had thought I figured out where Dupegure would get stuck in a perpetual endless scan when it would work berrer on an ext4 or btrfs filesystem than ntfs (scanning from linux). But now I have a partition formatted as ext4, and the number of files scanned for a particular directory is in the millions, yet a "find

-type f | wc -l" says there's only 47441 files.

This is a recent Ryzen5 system, 32G RAM, Fedora 38. Running the scan on an external drive attached on USB3. The speed of the system or bus really shouldn't matter anyway, as it's apparently scanning for WAY more files than actually exist on the directory or disk, so somewhere it's not even reading the filesystem correctly.

jelabarre59 avatar Oct 27 '23 22:10 jelabarre59

It's still there on Windows 11. Plenty or free RAM. In the Task Manager the program only takes some 20% of the CPU and does nothing on SSD. The bottleneck might be the RAM speed.

Anyway, adding buttons like Pause / Abort on the progress dialog could bring some help.

nkomarov avatar Dec 08 '23 13:12 nkomarov

I'm at "Scanning for duplicates" with "4733895 matches found from 142642 groups" with no disk access anymore and slowly growing memory usage. Progress bar is showing 96%. Hasn't moved in a while. Overall process is taking days. 5.4G memory usage so far. CPU is 10% or less for that process (I have 16 cores (32 virtual)). Plenty of free memory, so will wait a bit longer. ... and it finished. The "Fiddling" part did not take much time at all. 625,285 duplicates found.

jtkouz avatar Jan 31 '24 17:01 jtkouz

@jtkouz I'm amazed it ever finished, how many days?

I've switched to using Everything v1.5 by Voidtools, to manage my files, find duplicates and perform bulk operations (rename, delete, etc). There's no excuse for dupeguru to take so long, and it doesn't even cache file hashes between scans.

a-raccoon avatar Feb 03 '24 13:02 a-raccoon

@a-raccoon It was 2-3 days. Didn't really note when I started it, sorry. I tried to select just two drives, but it ended up scanning more of my disks because I didn't realize in the interface I had to choose "Excluded" for the "State". I instead followed the prompt "Select folders to scan and press 'Scan'.", which I did. :)

Looks like dupeGuru isn't listing my RAID disks.

I use Everything also but haven't relied on its dupe checking... YET.

jtkouz avatar Feb 04 '24 00:02 jtkouz

Well, Voidtools wouldn't be much use as it's a MSWin application.

jelabarre59 avatar Feb 04 '24 18:02 jelabarre59

Maybe this will help.

PJDude avatar Feb 04 '24 20:02 PJDude

It seems that DupeGuru has pretty hard ceiling when it comes to loading a lot of results. It is fully capable of finding all duplicates and reaching 100%, but it will never get as far as displaying them all if there are too many. I estimate that the processing becomes exponentially slow for every result that DupeGuru has to process and add to the GUI list. It is not uncommon for Windows UI elements to work in this way.

I wonder if this can be solved by only displaying 10,000 to 100,000 results at a time, with a button to load more results. That way the user can work through the results in smaller batches with a fully responsive UI.

I might also suggest that this be done dynamically, by "giving up" after 10 minutes of processing, and the "Load More Results" button would process for another 10 minutes only. This could reduce total processing of results to 30 minutes instead of 30 days.

@a-raccoon Edit to add as workaround those. Dev are asleep since ages, so better carry on

bphd avatar Mar 21 '24 07:03 bphd

Agreed, I would prefer the program would come back and say the sample was too large, rather than hang indefinitely. It probably needs a way to segment the files to be scanned into batches; if a file's duplicate is in a different batch all that means is it won't get found until it starts consolidating batches. So in the end you should still find your duplicates.

jelabarre59 avatar Mar 21 '24 10:03 jelabarre59