borg using compression for files cache?

the borg files cache can be rather large, because it keeps some information about all files that have been processed recently.

lz4 is a very fast compression / decompression algorithm, so we could try to use it to lower the in-memory footprint of the files cache entries.

before implementing this, we should check how big the savings typically are - to determine whether it is worthwhile doing that.

the files cache dictionary maps H(fullpath) --> msgpack(fileinfo).

the msgpack algorithm already lowers the storage requirements a bit by using e.g. integers only as long as necessary. it also serializes the python data structure (which is not necessary as we use it now, but would be also necessary for compressing it).

with compression, it could work like H(fullpath) --> compress(msgpack(fileinfo)).

but we first need some statistics about the overall size of the files cache entries with and without compression.

because msgpacking is already removing some of the redundant information, it is unclear how much compressing its output can reduce the size. of course we need to compress the cache entries individually, so the amount of data per compression call is relatively low.

note: theoretically, we could also use other combinations of serialization algorithm and compression algorithm, if they give a better overall result (compressed size and decompression speed).

Jan 27 '21 16:01 ThomasWaldmann

Whats about to split the full path?

Jan 28 '21 20:01 jedie

Not sure what you mean...

Jan 28 '21 22:01 ThomasWaldmann

I mean: Do you store the fullpath as a complete string? then it has many redundant information that can be stored as a tree...

Feb 02 '21 17:02 jedie

No, I simplified a bit: it stores somehash(fullpath)

Feb 02 '21 18:02 ThomasWaldmann

Just running gzip and xz on my files cache with borg 1.1.15:

files gzip: 203304634 -> 173423651 (-15%)
files xz: 203304634 -> 151031764 (-26%)

In comparison, the chunks file was much more compressible:

chunks gzip: 264346166 -> 139486190 (-47%)
chunks xz: 264346166 -> 132012032 (-50%)

Oct 25 '21 21:10 nadalle

@nadalle what i meant in top post:

the RAM requirement
for good speed, rather lz4
compressing each mapping value separately, not the whole mapping.

could be that the on-disk compressibility you determined is a upper bound to what I wanted to know, so it doesn't look like we should implement that.

Oct 26 '21 21:10 ThomasWaldmann

Just did a small experiment and compressed the on-disk files cache of borg2:

files.c6c3fd94f9b1c899 : 79.76%   (  47.1 KiB =>   37.5 KiB, files.c6c3fd94f9b1c899.zst)

Sep 24 '24 09:09 ThomasWaldmann

Maybe not worth the effort.

Oct 24 '24 00:10 ThomasWaldmann