duff
duff copied to clipboard
memory optimization
Hi I'm doing some work on duff because I found it useful when fixing broken rsnapshot repositories (I will make some pull request in few days). Unfortunately such repositories are a bit unusual (millions of files, mostly hardlinked in groups of 30-50).
It seems that I'm having problem with large buckets (long lists): because each sampled file allocates 4KB of data that is going to be free at the end of bucket processing - I'm getting "out of memeory" errors at 3GB of memory allocated (because the box is light 32-bit atom-based system).
As sizeof(FileList) == 12 I see no problem increasing HASH_BITS to 16 (~800KB) or even 20 (~13MB). I wonder what you think - if it's a good idea to add an option to make it runtime-configurable?
Another idea is to replace (optionally?) sample with some simple fast running checksum (crc64?).
You're not the only one running duff on millions of files, which is something I hadn't imagined when I was writing it. It's past time for a 0.6 release anyway. I will look into this.
You're completely right - numbers seems to go beyond any expectations. I just hit limit for inode reference counter (number of possible hardlinks for given file) which is something around 32K on FreeBSD/UFS2... I will fix my deduplication script and provide as alternative to join-duplicates.sh