borg improve files cache memory usage even further

Have you checked borgbackup docs, FAQ, and open GitHub issues?

Yes

Is this a BUG / ISSUE report or a QUESTION?

None of the above. Probably a feature request.

Your borg version (borg -V).

2.0.0b14

Describe the problem you're observing.

In #8511 the memory usage of the files cache is already greatly reduced by using the IDs provided by the new borghash. If I understand everything correctly, there probably often are multiple consecutive IDs in the chunk list of a file. Thus the memory usage could be optimized by storing an additional number indicating how many chunks with consecutive IDs follow after a given chunk ID.

disclaimer

I am not familiar with how to use msgpack efficiently so here is just one random idea on how to implement this. Probably there are better ideas, but I wanted to be able to get a rough estimate what this could mean for the memory usage. (I could find barely any documentation about the inner workings of msgpack and I did not read the source code to find out what exactly happens. If anyone knows a good documentation I'd be happy for a hint)

optimization for consecutive chunks

Increase the size of the ID from 4 Bytes to 5 Bytes. The added Byte contains how many consecutive chunks there are after the given ID. So there can be 0 to 255 following consecutice chunks. If there are more consecutive chunks, another ID would have to be encoded. This would hurt a little but at that point the reduction in memory usage should already be pretty good. Some estimates of the change in memory usage based on the estimate given in #8511 which is X + N * 5 Bytes, with N being the number of chunks used by a file.

Worst case X + N * 6 Bytes. Probably mostly for small files where X is big compared to N * 6.
Best case X + N * ~0.02 Bytes
When each chunk ID has exactly one consecutive following ID: X + N * 3 Bytes

This assumes that msgpack does not align the data it stores and thus would align 5 Byte-values at 8 Byte adresses and thus contain 3 Bytes of overhead each. But in that case maybe separate list would help.

additional optimization for repeated chunks

I assume besides consecutive IDs there could also be repeated IDs as well, for example if a disk image with long runs of zeroes (empty space) is stored. So maybe having a special case for this could be worthwhile as well. Either by separately storing this information or otherwise by "taking some values away" from the additional byte and using them to indicate repated chunks instead. Example: 255 means 1 additional repetition, 254 means 2 repetitions, 253 means 4 repetitions, 252 means 8 repetitions, ... up to some good value. That way the best case from above would only get slightly worse while still being able to get some significant gains for long runs of repeated values.

So I hope my assumptions are correct and this leads to some more nice improvements

Jan 02 '25 22:01 MichaelDietzel

Not sure how frequent consecutive runs of the same chunk are in practice.

I guess there is one "popular" case, when long runs of all-zero ranges get chunked. But guess that's all, isn't it? In case of sparse files, such ranges might already get some special treatment depending on the chunker (only the fixed chunker does that).

Guess we need to avoid "optimizing" for a not-that-frequent case - otherwise there might be more overhead for that than we save by it.

Jan 04 '25 23:01 ThomasWaldmann

Sorry for my delayed reply.

I was mostly thinking about the consecutive chunks case which I think should be more relevant, the repeated chunks case only was an afterthought.

I will try to generate some statistics on this to get a somewhat realistic estimate of the gains that could be achived.

Mar 14 '25 23:03 MichaelDietzel

So I finally managed to do a test implementation and it only was a very limited success: I could estimate the expected space savings for the file cache to be about 10%. However one of my assumption turned out to be wrong: I assumed borghash would give the first chunk to be processed an index of 0 (or maybe 1), the 2nd chunk an index of 1 and so on. So I would have to first modify borghash to behave that way before these gains can be realized. I'll have to think about this. Thanks for your patience and reading all my messages especially now that there still are no good results.

Jun 03 '25 20:06 MichaelDietzel

IIRC it is the hashtable array index that borghash returns with that special call.

Thanks for your efforts!

Jun 04 '25 09:06 ThomasWaldmann