duff icon indicating copy to clipboard operation
duff copied to clipboard

memory optimization

Open marcin-gryszkalis opened this issue 9 years ago • 2 comments

Hi I'm doing some work on duff because I found it useful when fixing broken rsnapshot repositories (I will make some pull request in few days). Unfortunately such repositories are a bit unusual (millions of files, mostly hardlinked in groups of 30-50).

It seems that I'm having problem with large buckets (long lists): because each sampled file allocates 4KB of data that is going to be free at the end of bucket processing - I'm getting "out of memeory" errors at 3GB of memory allocated (because the box is light 32-bit atom-based system).

As sizeof(FileList) == 12 I see no problem increasing HASH_BITS to 16 (~800KB) or even 20 (~13MB). I wonder what you think - if it's a good idea to add an option to make it runtime-configurable?

Another idea is to replace (optionally?) sample with some simple fast running checksum (crc64?).

marcin-gryszkalis avatar Dec 07 '15 14:12 marcin-gryszkalis

You're not the only one running duff on millions of files, which is something I hadn't imagined when I was writing it. It's past time for a 0.6 release anyway. I will look into this.

elmindreda avatar Dec 07 '15 16:12 elmindreda

You're completely right - numbers seems to go beyond any expectations. I just hit limit for inode reference counter (number of possible hardlinks for given file) which is something around 32K on FreeBSD/UFS2... I will fix my deduplication script and provide as alternative to join-duplicates.sh

marcin-gryszkalis avatar Dec 31 '15 15:12 marcin-gryszkalis