mergerfs-tools icon indicating copy to clipboard operation
mergerfs-tools copied to clipboard

Feature Request: Add more hashing algorithms to mergefs.dedup

Open donmor opened this issue 1 year ago • 4 comments

Add an option: -H, --hashing-algorithm= used along with -i, --ignore. Thus we can use faster algorithms like CRC32, or safer one like sha256, or multiple algorithms in turn (skip latter if former is different)

donmor avatar Apr 23 '24 07:04 donmor

#148 is an implementation.

donmor avatar Apr 23 '24 09:04 donmor

The speed of a hash function is rarely an issue. The tool is IO bound most of the time. Have you done any benchmarking?

trapexit avatar Apr 23 '24 11:04 trapexit

I'd do it later.

donmor avatar Apr 23 '24 12:04 donmor

Made some modifications to #148 , making it way faster to use same-hash by calling short_hashes_all before hashing each file.

Before:

$ time mergerfs.dedup -v --ignore=same-hash /tmp/C
rm -vf /tmp/B/2
rm -vf /tmp/B/4
rm -vf /tmp/B/5
rm -vf /tmp/B/6
rm -vf /tmp/B/7
rm -vf /tmp/B/8
rm -vf /tmp/B/A
rm -vf /tmp/B/C
# Total savings: 2.6GB

real    0m14.265s
user    0m13.363s
sys     0m0.900s

After:

$ time mergerfs.dedup -v --ignore=same-hash /tmp/C
rm -vf /tmp/B/2
rm -vf /tmp/B/4
rm -vf /tmp/B/5
rm -vf /tmp/B/6
rm -vf /tmp/B/7
rm -vf /tmp/B/8
rm -vf /tmp/B/A
rm -vf /tmp/B/C
# Total savings: 2.6GB

real    0m6.724s
user    0m6.286s
sys     0m0.432s

MD5 / SHA1 is considered unsafe, so it may use SHA256 (slower):

$ time mergerfs.dedup -v --ignore=same-hash --hash=sha256 /tmp/C
rm -vf /tmp/B/2
rm -vf /tmp/B/4
rm -vf /tmp/B/5
rm -vf /tmp/B/6
rm -vf /tmp/B/7
rm -vf /tmp/B/8
rm -vf /tmp/B/A
rm -vf /tmp/B/C
# Total savings: 2.6GB

real    0m16.079s
user    0m15.569s
sys     0m0.500s

Sometimes there can be very few bits corrupted in a file, leaking it from the random sampling of short_hash_file. A --hash=crc32 can be specified before --hash=sha256 as acceleration.

donmor avatar Apr 24 '24 02:04 donmor