fastdupes
fastdupes copied to clipboard
Look into optimizations for the initial "gather paths to analyze" phase
Unlike the other steps, I've done practically nothing to optimize the initial recursive tree traversal phase.
I'll want to do some cost-benefit research on the following as well as identifying other potential improvements:
- Look into the performance effect of checking whether excludes contain meta-characters and using simple string matching if they don't.
- As I understand it,
fnmatch.fnmatch
uses regexes internally and doesn't cache them. Given how many times it gets called, I should try usingre.compile
withfnmatch.translate
instead. - I should also look into what the performance effect are of programmatically combining multiple
fnmatch.translate
outputs so the ignore check can be handled in a single pass. - Look into the memory-I/O trade-offs inherent in doing one stat call for each file and then caching it so it can be used both for
sizeClassifier
and for things like inode-based hardlink detection.