rabin icon indicating copy to clipboard operation
rabin copied to clipboard

Incoherent default values for parameter min and bits

Open green-coder opened this issue 8 years ago • 2 comments

By default, the number of bits used by the mask is 12. It means that if the fingerprint gives homogeneously distributed values, there is 1 chance over 2^12=4096 that it finishes with 12 zero-bits.

By default, the min is set to 8192. Statistically, you will miss about half of the delimiters, which means that your de-duplication algorithm won't be efficient.

I think that you want to either change the min or the bits default value.

Give a try to that command line, you will see that if you remove the min, your average chunks will have an average size close to 4096 + 64 (64 is WINSIZE, your sliding window's size):

node cli.js ~/Downloads/GitEye-1.11.0-linux.x86_64.zip --min=64
...
> average 4134

I also recommend you not to set the min and max too close to the desired average size, in order to maximize the cases where the chunks are defined by their content, to get a more robust de-duplication in case of content shift.

green-coder avatar Jan 01 '16 16:01 green-coder