borg icon indicating copy to clipboard operation
borg copied to clipboard

experiment with a new chunking algorithm?

Open bdklahn opened this issue 2 years ago • 2 comments

Have you checked borgbackup docs, FAQ, and open GitHub issues?

Yes, a little.

Is this a BUG / ISSUE report or a QUESTION?

QUESTION

I got interested in chunking algorithms, and was looking around. I came across . . .

Lightweight hash-based de-duplication system using the self detection of most repeated patterns as chunks divisors

I thought of Borgbackup. I was thinking about test implementing something in Julia, to try it out, but . . . I never seem to find time to do side projects.

I had to read the paper a few times, to get where I think I understand the gist. It looks, to me, like the two key things are . . .

  1. It scans files to find an optimal pattern for a chunk break point (saved with frequency; uses most frequent).
  2. It seems to use a more space (and speed?) efficient way to store (and use?) the chunk index, based on a triple hash, with smaller length, rather than a single large one.

The paper seems to show good performance, but I haven't tested to reproduce.

One thing I wonder about is any extra processing time to collect and sort the cut points by frequency. But maybe that ends up being not significant, after the first files are in. Idk.

I just wanted to document this, because it seems like I never end up having time for such things. So thought I'd at least share this somewhere potentially useful.

bdklahn avatar Sep 14 '23 01:09 bdklahn

I had a (rather short) look:

  • they work with a ton of text files (linux kernel src) and very small chunk sizes (like some hundred bytes or few kBs) - borg currently can not efficiently deal with a huge amount of tiny chunks (and maybe this never could be efficient due to overheads of doing this)
  • they do not have results for larger chunks (borg uses 2MiB default target chunk size)
  • they use a newly invented hash algorithm resulting in a 48bit value (made from 3 16bit values) - good to save memory, but unsure about issues like collisions or novelty / issues of that algorithm
  • found it strange that they compare to sha1 and md5, which raise red flags considering the age of these algorithms and their bad collision resistance (if one applies cryptographic standards). there are newer/faster cryptographic and also non-cryptographic better proven hash algorithms...

I didn't fully read the paper, so maybe this misses something. But for usage in current borg, guess we would need to know about larger chunks.

ThomasWaldmann avatar Sep 14 '23 11:09 ThomasWaldmann

I had a (rather short) look:

  • they work with a ton of text files (linux kernel src) and very small chunk sizes (like some hundred bytes or few kBs) - borg currently can not efficiently deal with a huge amount of tiny chunks (and maybe this never could be efficient due to overheads of doing this)
  • they do not have results for larger chunks (borg uses 2MiB default target chunk size)

Gotcha. Yeah, those seemed small. Maybe that was just one/the presented test set? Maybe they could also set a minimum chunk size? Anyway, maybe doing so would negate the benefit of their approach. E.g., I'm guessing having very small chunks ends in having a small dataset, no matter what algorithm. But then index/memory size/use gets big.

  • they use a newly invented hash algorithm resulting in a 48bit value (made from 3 16bit values) - good to save memory, but unsure about issues like collisions or novelty / issues of that algorithm

I don't really know, either. But I got the impression that maybe the "triple" part of it is what gets it to the same level as traditional algorithms, in terms of acceptable collisions. But maybe that only helps speed lookups (by subsetting on two levels, first???). But I do think they noted the collisions were not worse . . . Some of the phrasing took me a few passes to understand.

  • found it strange that they compare to sha1 and md5, which raise red flags considering the age of these algorithms and their bad collision resistance (if one applies cryptographic standards). there are newer/faster cryptographic and also non-cryptographic better proven hash algorithms...

Yeah, I was a little leery about seeing that "md5" in there, as well. But other parts of the paper looked like they maybe mistakenly used the wrong term. Anyways, sometimes "academic research-grade" is not production-ready. I'd suspect one could swap in sha256, or whatever, but . . . Idk.

I didn't fully read the paper, so maybe this misses something. But for usage in current borg, guess we would need to know about larger chunks.

Right. I suppose I/one could contact them. -maybe get one of them to create a test implementation on a borg branch. :-) Unfortunately, many research projects end up with a paper as the final "product".

I partly just wanted to pass this on in one place, rather than have that be a time wasted on a thread I may not have time to further pursue. :-)

Feel free to close, if there's not enough info or value or bandwidth for this. Thanks.

bdklahn avatar Sep 14 '23 17:09 bdklahn