rabin icon indicating copy to clipboard operation
rabin copied to clipboard

Incoherent default values for parameter min and bits

Open green-coder opened this issue 9 years ago • 2 comments

By default, the number of bits used by the mask is 12. It means that if the fingerprint gives homogeneously distributed values, there is 1 chance over 2^12=4096 that it finishes with 12 zero-bits.

By default, the min is set to 8192. Statistically, you will miss about half of the delimiters, which means that your de-duplication algorithm won't be efficient.

I think that you want to either change the min or the bits default value.

Give a try to that command line, you will see that if you remove the min, your average chunks will have an average size close to 4096 + 64 (64 is WINSIZE, your sliding window's size):

node cli.js ~/Downloads/GitEye-1.11.0-linux.x86_64.zip --min=64
...
> average 4134

I also recommend you not to set the min and max too close to the desired average size, in order to maximize the cases where the chunks are defined by their content, to get a more robust de-duplication in case of content shift.

green-coder avatar Jan 01 '16 16:01 green-coder

Also:

node cli.js ~/Downloads/GitEye-1.11.0-linux.x86_64.zip --min=64 --max=100000 --bits=14
...
average 16361

green-coder avatar Jan 01 '16 17:01 green-coder

I have also been doing experiments and think this is incorrect. Maybe drop the min and max settings, or set them from the bits.

bits, average chunk length (bytes)
8, 193.01641684511063
9, 269.8508137432188
10, 401.496875
11, 621.6129374337222
12, 1012.9902200488998
13, 1952.9852398523985
14, 2964.357894736842
15, 5166.117903930131
16, 9204.31111111111

that is by setting min and max to 1 << bits - 3 and 1 << bits + 3 respectively.

input is plain text.

hmm, this gives a somewhat lower size that @green-coder is getting.

dominictarr avatar Feb 27 '16 02:02 dominictarr