dupeguru icon indicating copy to clipboard operation
dupeguru copied to clipboard

dupeGuru 4.0.4 RC x64 crashes on picture mode, a lot of files

Open o-l-a-v opened this issue 3 years ago • 9 comments

Describe the bug

Using dupeGuru 4.0.4 RC x64 in Picture mode, it crashes after a while without any warning, error, and can't find any logs.

This is on a folder containing results from a HDD recovery using "PhotoRec" (https://www.cgsecurity.org/wiki/PhotoRec), so it contains A LOT of files and also duplicates.

image

  • 8 597 sub folders
  • 244 376 files

Settings

image

image

System

  • Dell XPS 9560
    • i7-7700HQ CPU
    • 32gb 2600MHz HyperX Impact DDR4 RAM
    • 512gb 970 Pro SSD
    • Windows 10 Pro for Workstation 20H2, 19042.685

Files

dupeGuru says it's scanning through a total of 178 675 pictures.

image

Count and total size per file extension

Click to expand
 Count Extension         Size
 ----- ---------         ----
 15343 jpg       24 027.02 MB
132991 png       10 255.05 MB
  3203 mov        6 086.72 MB
    53 swf        3 141.84 MB
  7197 icns       2 786.16 MB
    69 bz2        2 726.91 MB
 19317 tif        2 499.13 MB
  4119 caf        2 204.74 MB
  6896 pdf        1 823.68 MB
   431 zip        1 390.22 MB
   736 m4p        1 144.69 MB
    54 tar        1 086.74 MB
   334 ds_store     897.21 MB
    18 avi          853.70 MB
   840 class        786.08 MB
   862 jar          771.13 MB
  1912 docx         563.45 MB
   100 pyc          518.96 MB
  5545 cab          487.30 MB
   480 xls          464.00 MB
     9 mp4          462.71 MB
   110 xar          368.51 MB
   327 ttf          364.81 MB
   769 mp3          358.88 MB
    25 jp2          343.47 MB
  2504 h            284.99 MB
   142 pptx         266.92 MB
   105 elf          239.24 MB
   360 sxw          237.65 MB
 11463 gz           223.85 MB
  1608 ps           223.76 MB
   471 wav          216.76 MB
   219 swc          205.71 MB
  1116 doc          196.41 MB
   415 <none>       188.46 MB
     1 wmv          156.25 MB
    32 wma          154.51 MB
   144 ppt          134.98 MB
   123 dll          117.23 MB
   713 aif          107.96 MB
    58 psd          101.55 MB
   348 ico           95.46 MB
   404 myi           87.94 MB
   391 eps           62.56 MB
   477 xlsx          56.89 MB
     1 gpg           54.16 MB
  1054 rtf           52.87 MB
     1 wmf           50.00 MB
   276 icc           43.01 MB
  1537 java          31.85 MB
     7 pct           24.77 MB
 10789 gif           24.57 MB
  1812 tex           21.36 MB
   235 bmp           18.61 MB
     1 nds           11.03 MB
     1 lzo           10.73 MB
   124 ai             9.85 MB
   493 mbox           8.18 MB
    39 a              8.06 MB
  1082 c              6.48 MB
  1087 svg            5.70 MB
   238 emf            5.46 MB
    27 exe            5.38 MB
    42 dvi            4.84 MB
   185 pl             3.53 MB
     3 ods            3.12 MB
     3 mat            2.90 MB
     6 odt            2.73 MB
   176 py             2.66 MB
   130 f              2.51 MB
   161 frm            2.44 MB
    19 jks            1.87 MB
   547 sh             1.69 MB
     2 sit            1.11 MB
     3 chm            0.90 MB
     5 dat            0.81 MB
     3 xpi            0.75 MB
     1 cp_            0.68 MB
    31 au             0.59 MB
     3 wps            0.58 MB
     6 dbx            0.54 MB
    83 asp            0.53 MB
    27 acb            0.53 MB
    88 adr            0.53 MB
   251 ini            0.52 MB
     1 pdb            0.52 MB
    24 woff           0.50 MB
    29 rb             0.40 MB
     2 res            0.39 MB
   174 csv            0.39 MB
     1 wtv            0.38 MB
     9 mpg            0.36 MB
     2 tz             0.34 MB
    42 jsonlz4        0.34 MB
     1 accdb          0.32 MB
    60 info           0.29 MB
    10 lyx            0.29 MB
     1 pst            0.26 MB
     1 reg            0.25 MB
    18 pm             0.25 MB
   165 ics            0.20 MB
    45 xmp            0.19 MB
   129 win            0.18 MB
    63 json           0.16 MB
     1 dwg            0.09 MB
     2 hdr            0.06 MB
    39 bat            0.05 MB
     2 db             0.05 MB
    32 xpt            0.03 MB
     3 mid            0.01 MB
     4 ly             0.01 MB
     5 php            0.01 MB
     3 wpl            0.01 MB
     1 vfb            0.01 MB
    36 url            0.00 MB
     1 amr            0.00 MB
     1 pcx            0.00 MB
    20 vcf            0.00 MB
    23 jsp            0.00 MB
     1 ifo            0.00 MB
     1 pfx            0.00 MB
     1 ogg            0.00 MB
     1 cue            0.00 MB
     1 pub            0.00 MB
     2 ram            0.00 MB
     1 lnk            0.00 MB
     3 inf            0.00 MB
     2 dif            0.00 MB

PowerShell for above list

$([array](Get-ChildItem -Recurse -Force -File | Select-Object -Property 'Extension','Length' | Group-Object -Property 'Extension')).ForEach{
    [PSCustomObject]@{
        'Count'     = [uint32] $_.'Count'
        'Extension' = [string] $_.'Name'
        'Size'      = [double](
            [uint64]$(
                $([uint64[]]($_.'Group'.'Length')) | Measure-Object -Sum | Select-Object -ExpandProperty 'Sum'
            ) / 1MB
        )
    }
} | Sort-Object -Property 'Size' -Descending | Format-Table -Property 'Count','Extension',@{'Name'='Size';'Expression'={'{0} MB' -f $_.'Size'.Tostring('N')};'Align'='Right'}

Expected behavior

  • Scan should complete and I should be presented with a result.
  • If something goes wrong, I should get an error message, and preferably some logs or debug info to share with you guys.

o-l-a-v avatar Dec 19 '20 14:12 o-l-a-v

That's interesting. Can you monitor RAM usage while doing such big scan? Perhaps it is running out of memory.

glubsy avatar Dec 19 '20 16:12 glubsy

It has come much further 2nd time (still running).

Using 20+gb of RAM and maxing CPU.

image

image

Never mind, there it crashed. At least it showed a error message this time. :)

image

Application Name: dupeGuru
Version: 4.0.4 RC

multiprocessing.pool.RemoteTraceback: 
"""
Traceback (most recent call last):
  File "multiprocessing\pool.py", line 121, in worker
  File "core\pe\matchblock.py", line 141, in async_compare
MemoryError
"""

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "core\pe\matchblock.py", line 199, in getmatches
  File "core\pe\matchblock.py", line 159, in collect_results
  File "multiprocessing\pool.py", line 657, in get
MemoryError

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "hscommon\gui\progress_window.py", line 101, in pulse
  File "core\app.py", line 323, in _job_error
  File "hscommon\jobprogress\performer.py", line 43, in _async_run
  File "core\app.py", line 780, in do
  File "core\scanner.py", line 137, in get_dupe_groups
  File "core\pe\scanner.py", line 31, in _getmatches
  File "core\pe\matchblock.py", line 208, in getmatches
MemoryError

%localappdata%\Hardcoded Software\dupeGuru\debug.log had one line saying:

2020-12-19 17:16:04,515 - WARNING - Ran out of memory when scanning! We had 175739229 matches.

175 million matches? Seems unlikely.

o-l-a-v avatar Dec 19 '20 16:12 o-l-a-v

Trying Standard mode now. Seems like there is a hefty memory leak, RAM is skyrocketing now. We're talking +100MB every 5 seconds. No, it actually seems to be exponential.

image

Getting "not responding" a lot.

image

Progress barely moving, goes slower and slower.

o-l-a-v avatar Dec 19 '20 17:12 o-l-a-v

It crashed in the end:

Capture

Application Name: dupeGuru
Version: 4.0.4 RC

Traceback (most recent call last):
  File "hscommon\gui\progress_window.py", line 101, in pulse
  File "core\app.py", line 323, in _job_error
  File "hscommon\jobprogress\performer.py", line 43, in _async_run
  File "core\app.py", line 780, in do
  File "core\scanner.py", line 137, in get_dupe_groups
  File "core\scanner.py", line 82, in _getmatches
  File "core\engine.py", line 269, in getmatches_by_contents
MemoryError

Wouldn't close after closing error message then closing the program, still open and took up all RAM so Chrome for instance misbehaved because out of memory.

Capture2

Think I'll try using jdupes first (https://github.com/jbruchon/jdupes/releases), then try this thing again for pictures not matching on checksum.

Feature requests:

  • Ability to look for one or multiple given file extensions at a time. Like, only JPG.
  • Proper handling of memory when a lot of files and duplicates, for both Standard and Picture mode.
    • Exponential RAM usage (up) and speed of progress (down, slower) tells that something is not right here.
  • Use other open source freely available tools for resource heavy operations, like checksum and similar, like jdupes.
  • Make it possible to move the window while dupeGuru is scanning. The "scanning for duplicates" in the foreground prohibits it.

o-l-a-v avatar Dec 19 '20 17:12 o-l-a-v

Oh, here is the settings I had with Standard mode

image

o-l-a-v avatar Dec 19 '20 17:12 o-l-a-v

Ability to look for one or multiple given file extensions at a time. Like, only JPG.

This should be part of 4.1.0 once this patch has been merged. You can test it out from this dev branch.

As for the memory issue, right now it's best to reduce the number of files to scan. But it's definitely something to improve in the future.

glubsy avatar Dec 19 '20 20:12 glubsy

Any updates on this? I have 16GB RAM and I crash for the same reason when de-duping >100k photos. OUT of memory.

flamaest avatar Apr 14 '21 21:04 flamaest

@o-l-a-v, @flamaest it might be worth taking a lot at the most recent version (4.3.1), I have made some changes which improved the speed of scans and may also improve memory performance in some situations (although the changes were targeted at speed). I have not noticed high usage when scanning ~50k items in normal content mode, stayed below 200MB of RAM usage. I don't really have 100k+ datasets to scan for testing right now.

arsenetar avatar Jul 09 '22 01:07 arsenetar

Same, eats up all RAM then freeze. Can't he write its working data on temporary files? Because RAM is not reliable for big comparison

bphd avatar Sep 27 '23 09:09 bphd