dupeguru
dupeguru copied to clipboard
dupeGuru 4.0.4 RC x64 crashes on picture mode, a lot of files
Describe the bug
Using dupeGuru 4.0.4 RC x64 in Picture mode, it crashes after a while without any warning, error, and can't find any logs.
This is on a folder containing results from a HDD recovery using "PhotoRec" (https://www.cgsecurity.org/wiki/PhotoRec), so it contains A LOT of files and also duplicates.
- 8 597 sub folders
- 244 376 files
Settings
System
- Dell XPS 9560
- i7-7700HQ CPU
- 32gb 2600MHz HyperX Impact DDR4 RAM
- 512gb 970 Pro SSD
- Windows 10 Pro for Workstation 20H2, 19042.685
Files
dupeGuru says it's scanning through a total of 178 675 pictures.
Count and total size per file extension
Click to expand
Count Extension Size
----- --------- ----
15343 jpg 24 027.02 MB
132991 png 10 255.05 MB
3203 mov 6 086.72 MB
53 swf 3 141.84 MB
7197 icns 2 786.16 MB
69 bz2 2 726.91 MB
19317 tif 2 499.13 MB
4119 caf 2 204.74 MB
6896 pdf 1 823.68 MB
431 zip 1 390.22 MB
736 m4p 1 144.69 MB
54 tar 1 086.74 MB
334 ds_store 897.21 MB
18 avi 853.70 MB
840 class 786.08 MB
862 jar 771.13 MB
1912 docx 563.45 MB
100 pyc 518.96 MB
5545 cab 487.30 MB
480 xls 464.00 MB
9 mp4 462.71 MB
110 xar 368.51 MB
327 ttf 364.81 MB
769 mp3 358.88 MB
25 jp2 343.47 MB
2504 h 284.99 MB
142 pptx 266.92 MB
105 elf 239.24 MB
360 sxw 237.65 MB
11463 gz 223.85 MB
1608 ps 223.76 MB
471 wav 216.76 MB
219 swc 205.71 MB
1116 doc 196.41 MB
415 <none> 188.46 MB
1 wmv 156.25 MB
32 wma 154.51 MB
144 ppt 134.98 MB
123 dll 117.23 MB
713 aif 107.96 MB
58 psd 101.55 MB
348 ico 95.46 MB
404 myi 87.94 MB
391 eps 62.56 MB
477 xlsx 56.89 MB
1 gpg 54.16 MB
1054 rtf 52.87 MB
1 wmf 50.00 MB
276 icc 43.01 MB
1537 java 31.85 MB
7 pct 24.77 MB
10789 gif 24.57 MB
1812 tex 21.36 MB
235 bmp 18.61 MB
1 nds 11.03 MB
1 lzo 10.73 MB
124 ai 9.85 MB
493 mbox 8.18 MB
39 a 8.06 MB
1082 c 6.48 MB
1087 svg 5.70 MB
238 emf 5.46 MB
27 exe 5.38 MB
42 dvi 4.84 MB
185 pl 3.53 MB
3 ods 3.12 MB
3 mat 2.90 MB
6 odt 2.73 MB
176 py 2.66 MB
130 f 2.51 MB
161 frm 2.44 MB
19 jks 1.87 MB
547 sh 1.69 MB
2 sit 1.11 MB
3 chm 0.90 MB
5 dat 0.81 MB
3 xpi 0.75 MB
1 cp_ 0.68 MB
31 au 0.59 MB
3 wps 0.58 MB
6 dbx 0.54 MB
83 asp 0.53 MB
27 acb 0.53 MB
88 adr 0.53 MB
251 ini 0.52 MB
1 pdb 0.52 MB
24 woff 0.50 MB
29 rb 0.40 MB
2 res 0.39 MB
174 csv 0.39 MB
1 wtv 0.38 MB
9 mpg 0.36 MB
2 tz 0.34 MB
42 jsonlz4 0.34 MB
1 accdb 0.32 MB
60 info 0.29 MB
10 lyx 0.29 MB
1 pst 0.26 MB
1 reg 0.25 MB
18 pm 0.25 MB
165 ics 0.20 MB
45 xmp 0.19 MB
129 win 0.18 MB
63 json 0.16 MB
1 dwg 0.09 MB
2 hdr 0.06 MB
39 bat 0.05 MB
2 db 0.05 MB
32 xpt 0.03 MB
3 mid 0.01 MB
4 ly 0.01 MB
5 php 0.01 MB
3 wpl 0.01 MB
1 vfb 0.01 MB
36 url 0.00 MB
1 amr 0.00 MB
1 pcx 0.00 MB
20 vcf 0.00 MB
23 jsp 0.00 MB
1 ifo 0.00 MB
1 pfx 0.00 MB
1 ogg 0.00 MB
1 cue 0.00 MB
1 pub 0.00 MB
2 ram 0.00 MB
1 lnk 0.00 MB
3 inf 0.00 MB
2 dif 0.00 MB
PowerShell for above list
$([array](Get-ChildItem -Recurse -Force -File | Select-Object -Property 'Extension','Length' | Group-Object -Property 'Extension')).ForEach{
[PSCustomObject]@{
'Count' = [uint32] $_.'Count'
'Extension' = [string] $_.'Name'
'Size' = [double](
[uint64]$(
$([uint64[]]($_.'Group'.'Length')) | Measure-Object -Sum | Select-Object -ExpandProperty 'Sum'
) / 1MB
)
}
} | Sort-Object -Property 'Size' -Descending | Format-Table -Property 'Count','Extension',@{'Name'='Size';'Expression'={'{0} MB' -f $_.'Size'.Tostring('N')};'Align'='Right'}
Expected behavior
- Scan should complete and I should be presented with a result.
- If something goes wrong, I should get an error message, and preferably some logs or debug info to share with you guys.
That's interesting. Can you monitor RAM usage while doing such big scan? Perhaps it is running out of memory.
It has come much further 2nd time (still running).
Using 20+gb of RAM and maxing CPU.
Never mind, there it crashed. At least it showed a error message this time. :)
Application Name: dupeGuru
Version: 4.0.4 RC
multiprocessing.pool.RemoteTraceback:
"""
Traceback (most recent call last):
File "multiprocessing\pool.py", line 121, in worker
File "core\pe\matchblock.py", line 141, in async_compare
MemoryError
"""
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "core\pe\matchblock.py", line 199, in getmatches
File "core\pe\matchblock.py", line 159, in collect_results
File "multiprocessing\pool.py", line 657, in get
MemoryError
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "hscommon\gui\progress_window.py", line 101, in pulse
File "core\app.py", line 323, in _job_error
File "hscommon\jobprogress\performer.py", line 43, in _async_run
File "core\app.py", line 780, in do
File "core\scanner.py", line 137, in get_dupe_groups
File "core\pe\scanner.py", line 31, in _getmatches
File "core\pe\matchblock.py", line 208, in getmatches
MemoryError
%localappdata%\Hardcoded Software\dupeGuru\debug.log
had one line saying:
2020-12-19 17:16:04,515 - WARNING - Ran out of memory when scanning! We had 175739229 matches.
175 million matches? Seems unlikely.
Trying Standard mode now. Seems like there is a hefty memory leak, RAM is skyrocketing now. We're talking +100MB every 5 seconds. No, it actually seems to be exponential.
Getting "not responding" a lot.
Progress barely moving, goes slower and slower.
It crashed in the end:
Application Name: dupeGuru
Version: 4.0.4 RC
Traceback (most recent call last):
File "hscommon\gui\progress_window.py", line 101, in pulse
File "core\app.py", line 323, in _job_error
File "hscommon\jobprogress\performer.py", line 43, in _async_run
File "core\app.py", line 780, in do
File "core\scanner.py", line 137, in get_dupe_groups
File "core\scanner.py", line 82, in _getmatches
File "core\engine.py", line 269, in getmatches_by_contents
MemoryError
Wouldn't close after closing error message then closing the program, still open and took up all RAM so Chrome for instance misbehaved because out of memory.
Think I'll try using jdupes first (https://github.com/jbruchon/jdupes/releases), then try this thing again for pictures not matching on checksum.
Feature requests:
- Ability to look for one or multiple given file extensions at a time. Like, only JPG.
- Proper handling of memory when a lot of files and duplicates, for both Standard and Picture mode.
- Exponential RAM usage (up) and speed of progress (down, slower) tells that something is not right here.
- Use other open source freely available tools for resource heavy operations, like checksum and similar, like jdupes.
- Make it possible to move the window while dupeGuru is scanning. The "scanning for duplicates" in the foreground prohibits it.
Oh, here is the settings I had with Standard mode
Ability to look for one or multiple given file extensions at a time. Like, only JPG.
This should be part of 4.1.0 once this patch has been merged. You can test it out from this dev branch.
As for the memory issue, right now it's best to reduce the number of files to scan. But it's definitely something to improve in the future.
Any updates on this? I have 16GB RAM and I crash for the same reason when de-duping >100k photos. OUT of memory.
@o-l-a-v, @flamaest it might be worth taking a lot at the most recent version (4.3.1), I have made some changes which improved the speed of scans and may also improve memory performance in some situations (although the changes were targeted at speed). I have not noticed high usage when scanning ~50k items in normal content mode, stayed below 200MB of RAM usage. I don't really have 100k+ datasets to scan for testing right now.
Same, eats up all RAM then freeze. Can't he write its working data on temporary files? Because RAM is not reliable for big comparison