find-duplicates issues

Test hashes of parts of large files before hashing entire file

For large files (say over 1MB in size), we can test a few parts of the file to quickly detect non-duplicates without having to read the entire file. A possible...

twpayne

Cli benchmarking with hyperfine

1

Hi Tom, Thanks for the talk explaining the tool last night. I remembered the tool I was thinking of for benchmarking on the CLI (with the limitations on controlling the...

pfiaux

Thread exhaustion when scanning large filesystems

5

On a large-ish filesystem, find-duplicates failed with: ``` runtime: program exceeds 10000-thread limit fatal error: thread exhaustion runtime stack: runtime.throw({0x52ac68?, 0x472aa0?}) /home/twp/sdk/go1.21.5/src/runtime/panic.go:1077 +0x5c fp=0x7ff2478c9c38 sp=0x7ff2478c9c08 pc=0x43c9dc runtime.checkmcount() /home/twp/sdk/go1.21.5/src/runtime/proc.go:802 +0x8e fp=0x7ff2478c9c60...

twpayne

bug

Avoid work when paths overlap

When running find-duplicates ./dir ./dir/subdir `find-duplicates` will walk `./dir` and `./dir/subdir` separately, even though one is the subdirectory of the other. This duplicate work should be avoided. Fixing #3 will...

twpayne

Use inodes to detect duplicates

1

@florianl suggested using inode numbers to detect hard links to the same content, which allows us to detect duplicates without having to open the files or read their contents.

twpayne

Improve directory walk performance

1

The command `find` seems to have much better performance than Go's `filepath.WalkDir`. stapelberg indicated that bradfitz (no mentions to avoid spamming) investigated this as part of `goimports` and was able...

twpayne

Incrementally hash files

@gnoack suggested using incremental hashing to help detect duplicates: hashing should proceed incrementally, and can stop as soon as we know that files are not duplicates. @rciurlea suggested an exponential...

twpayne

find-duplicates
find-duplicates copied to clipboard

Metadata

Test hashes of parts of large files before hashing entire file

Cli benchmarking with hyperfine

Thread exhaustion when scanning large filesystems

Avoid work when paths overlap

Use inodes to detect duplicates

Improve directory walk performance

Incrementally hash files

← Metadata

Owner

Metadata

find-duplicates find-duplicates copied to clipboard

Metadata

Test hashes of parts of large files before hashing entire file

Cli benchmarking with hyperfine

Thread exhaustion when scanning large filesystems

Avoid work when paths overlap

Use inodes to detect duplicates

Improve directory walk performance

Incrementally hash files

← Metadata

Owner

Metadata

find-duplicates
find-duplicates copied to clipboard