dupeguru icon indicating copy to clipboard operation
dupeguru copied to clipboard

Very slow file collection

Open fotico opened this issue 3 years ago • 13 comments

Describe the bug File collection is very slow.

To Reproduce

  1. Add an entire drive (>1M files)
  2. Start a normal scan
  3. "Collecting files to scan" step takes 3 hours

Expected behavior Scan the drive should be done in under 20 minutes. A full scan using WinDirStat on the same drive, which also collects file sizes, takes about 10 minutes in total

Desktop

  • OS: Windows 8.1

fotico avatar Jan 04 '22 12:01 fotico

second that.

i intentionally make 3 programs to run on the same folder hosted in Synology shared with 2 millions files at the same time:

  1. WizTree running on windows, completed in about 8 minutes.
  2. CZKawka running on Linux box, completed in about 10 minutes
  3. Dupeguru running on the same Linux box, completed about 18 minutes

somehow Dupeguru is slower in collecting files for scan.

chchia avatar Jan 19 '22 10:01 chchia

os.walk should be replaced with os.scandir and the filesize collected at scan time to save on syscall per file on Windows.

Dobatymo avatar Feb 09 '22 02:02 Dobatymo

There are multiple items impacting the performance here, os.listdir is still used in some places, the file and folder classes are implementing functionality that may now be better to leave to other python base classes and methods, really there would need to be a bit of rework to really improve performance here, updating the one os.walk call with os.scandir (which is used by os.walk internally) does yield an improvement however other parts of the file and folder collection need additional updates to see any drastic improvement it seems from some local testing.

arsenetar avatar Feb 09 '22 06:02 arsenetar

I have confirmed with some initial testing that it is possible to see significant performance improvements rewriting the underlying file and folder classes around os.scandir and the resulting os.DirEntry objects, depending on the particular operation I saw a 4x to 10x improvement in speed. This will take a bit to pull these sort of changes in as there are other updates to make sure all existing functionality remains (my testing was focused on the collection and scan portion only).

arsenetar avatar Feb 10 '22 07:02 arsenetar

i will be available for test if you need.

chchia avatar Feb 10 '22 09:02 chchia

Just linking this related feature request https://github.com/arsenetar/dupeguru/issues/959

Dobatymo avatar Feb 11 '22 05:02 Dobatymo

@fotico, @Dobatymo and @chchia if you are interested in building from source, the latest commit https://github.com/arsenetar/dupeguru/commit/efd500ecc1eb604918da3fc01512c502912771d8 has several improvements to the file collection. In testing I am seeing some good improvements in speed and it still seems to work as expected. There is still a bit more that could be done but this seems to be much better.

arsenetar avatar Mar 30 '22 04:03 arsenetar

i confirm latest source have much faster scan speed! thanks! it is a huge improvement.

chchia avatar Mar 30 '22 10:03 chchia

@fotico, @Dobatymo, and @chchia pushed another update in https://github.com/arsenetar/dupeguru/commit/c5818b1d1f78be9201c5e3164177361fea0bf629 that adds a preference for profiling scan operations. This logs the number of calls and time spent within functions when running a scan. These logs can be used to determine where time is being spent. Right now I don't think there is a lot left to speed up beyond going to multiple threads (which I am going to put off for now) here with my testing so added the ability to get these logs to determine what users are seeing for further optimization.

arsenetar avatar Mar 31 '22 05:03 arsenetar

thank you, how do i read the content of .profile file?

image

chchia avatar Mar 31 '22 06:03 chchia

@chchia, sorry probably should have provided some information on that. Logs are created by python's cProfile profiler, so there are probably several ways to read them. I normally use https://jiffyclub.github.io/snakeviz/ to view them. I'll also note that there are two top level functions called captured by the profile get_dupe_groups() from scanner.py and then either get_files() or get_folders() from directories.py depending on the scan type.

arsenetar avatar Mar 31 '22 16:03 arsenetar

@chchia @Dobatymo @fotico the lastest version should be faster under most circumstances. Let me know if you find otherwise.

arsenetar avatar Jul 09 '22 00:07 arsenetar

i confirm latest version is much faster than previous.

chchia avatar Jul 21 '22 03:07 chchia