find-duplicates
find-duplicates copied to clipboard
Find duplicate files quickly.
For large files (say over 1MB in size), we can test a few parts of the file to quickly detect non-duplicates without having to read the entire file. A possible...
Hi Tom, Thanks for the talk explaining the tool last night. I remembered the tool I was thinking of for benchmarking on the CLI (with the limitations on controlling the...
On a large-ish filesystem, find-duplicates failed with: ``` runtime: program exceeds 10000-thread limit fatal error: thread exhaustion runtime stack: runtime.throw({0x52ac68?, 0x472aa0?}) /home/twp/sdk/go1.21.5/src/runtime/panic.go:1077 +0x5c fp=0x7ff2478c9c38 sp=0x7ff2478c9c08 pc=0x43c9dc runtime.checkmcount() /home/twp/sdk/go1.21.5/src/runtime/proc.go:802 +0x8e fp=0x7ff2478c9c60...
When running find-duplicates ./dir ./dir/subdir `find-duplicates` will walk `./dir` and `./dir/subdir` separately, even though one is the subdirectory of the other. This duplicate work should be avoided. Fixing #3 will...
@florianl suggested using inode numbers to detect hard links to the same content, which allows us to detect duplicates without having to open the files or read their contents.
The command `find` seems to have much better performance than Go's `filepath.WalkDir`. stapelberg indicated that bradfitz (no mentions to avoid spamming) investigated this as part of `goimports` and was able...
@gnoack suggested using incremental hashing to help detect duplicates: hashing should proceed incrementally, and can stop as soon as we know that files are not duplicates. @rciurlea suggested an exponential...