fastdupes Look into optimizations for the initial "gather paths to analyze" phase

Look into optimizations for the initial "gather paths to analyze" phase

Open ssokolow opened this issue 9 years ago • 0 comments

Unlike the other steps, I've done practically nothing to optimize the initial recursive tree traversal phase.

I'll want to do some cost-benefit research on the following as well as identifying other potential improvements:

Look into the performance effect of checking whether excludes contain meta-characters and using simple string matching if they don't.
As I understand it, fnmatch.fnmatch uses regexes internally and doesn't cache them. Given how many times it gets called, I should try using re.compile with fnmatch.translate instead.
I should also look into what the performance effect are of programmatically combining multiple fnmatch.translate outputs so the ignore check can be handled in a single pass.
Look into the memory-I/O trade-offs inherent in doing one stat call for each file and then caching it so it can be used both for sizeClassifier and for things like inode-based hardlink detection.

Aug 21 '14 00:08 ssokolow