using compression for files cache?
the borg files cache can be rather large, because it keeps some information about all files that have been processed recently.
lz4 is a very fast compression / decompression algorithm, so we could try to use it to lower the in-memory footprint of the files cache entries.
before implementing this, we should check how big the savings typically are - to determine whether it is worthwhile doing that.
the files cache dictionary maps H(fullpath) --> msgpack(fileinfo).
the msgpack algorithm already lowers the storage requirements a bit by using e.g. integers only as long as necessary. it also serializes the python data structure (which is not necessary as we use it now, but would be also necessary for compressing it).
with compression, it could work like H(fullpath) --> compress(msgpack(fileinfo)).
but we first need some statistics about the overall size of the files cache entries with and without compression.
because msgpacking is already removing some of the redundant information, it is unclear how much compressing its output can reduce the size. of course we need to compress the cache entries individually, so the amount of data per compression call is relatively low.
note: theoretically, we could also use other combinations of serialization algorithm and compression algorithm, if they give a better overall result (compressed size and decompression speed).
Whats about to split the full path?
Not sure what you mean...
I mean: Do you store the fullpath as a complete string? then it has many redundant information that can be stored as a tree...
No, I simplified a bit: it stores somehash(fullpath)
Just running gzip and xz on my files cache with borg 1.1.15:
files gzip: 203304634 -> 173423651 (-15%)
files xz: 203304634 -> 151031764 (-26%)
In comparison, the chunks file was much more compressible:
chunks gzip: 264346166 -> 139486190 (-47%)
chunks xz: 264346166 -> 132012032 (-50%)
@nadalle what i meant in top post:
- the RAM requirement
- for good speed, rather lz4
- compressing each mapping value separately, not the whole mapping.
could be that the on-disk compressibility you determined is a upper bound to what I wanted to know, so it doesn't look like we should implement that.
Just did a small experiment and compressed the on-disk files cache of borg2:
files.c6c3fd94f9b1c899 : 79.76% ( 47.1 KiB => 37.5 KiB, files.c6c3fd94f9b1c899.zst)
Maybe not worth the effort.