borg
borg copied to clipboard
Could borg make use of zstd dictionary training?
As far as I understand, borg uses compression on chunk level and therefore cannot harvest full power. I've read that zstd-compression is able to train and later use explicit dictionaries. So I thought about borg and if it could use this feature for improving compression. Imagine some recompress / optimization routine that first trains a dictionary based on all the chunks in the repo; then recompresses everything with this pre-trained dictionary; I might also be used for all further compression. I mean, it's just an idea which might be discussed.
Theoretically, that would be possible to implement.
But not sure whether it would be worth it - storage is often cheap and recompressing a whole repo takes quite some time.
Maybe you could do an experiment and compare how much space a repo takes with normal zstd,X compression and then recompress it as you described and check space usage again.
If that is more than a few percent improvement, it might be worth implementing that.
A related not-yet-implemented feature would be to keep compression dicts between different chunks and not start from scratch each time.
Guess if subsequent chunks are from same file / file type, maybe even directly following chunks (initial backup run) that might improve compression a bit. Not sure if it could also have a negative impact, if to-be-compressed stuff is of very different kind, like when switching from a binary file's chunks to a text file chunk.
I could use an uncompressed repo for that and then compress the whole repo with an external zstd tool; but to really try compressing the individual chunks ... I would need to access them separately. Is there any way to export all chunks to individual files?
This is just a very preliminary comparison, because I do not know how to extract all individual chunks.
REPO_UNCOMPRESSED 1.7GB >> External Compress with ZSTD 22 whole repo >> 575 MB REPO_SETTINGZSTD22 818MB >> External Compress with ZSTD 22 whole repo >> 647 MB
Nearly the same size (575 MB) is obtained when training a 1MB-dict based on full repo directory and then using this dict to compress the whole repo with level 22.
Having the individual chunks as separate files would provide a more reasonable scenario!
You could try to borg init
a new test repo, then edit the repo config and use a very small segment file size (default: 500MB, try 1kB or so). Guess it will start a new segment file for each chunk then, but I never tried that.
I tried that. For 4352 unique chunks I got 3880 segment files. Compressing those as individual files without, with (dxxx dict size in kb) size; or as tar with all files results in following file sizes:
531M ./tar.zstd -- not individual 712M ./compressed_d3000k 720M ./compressed_d2000k 725M ./compressed_d5000k 726M ./compressed_d1500k 726M ./compressed_d4000k 727M ./compressed_d1250k 730M ./compressed_d10000k 731M ./compressed_d1000k 733M ./compressed_d750k 736M ./compressed_d500k 745M ./compressed_d200k 750M ./compressed_d100k 755M ./compressed_nodict 1,6G ./orig
I'v not changed any dictionary training settings. Does not look as promising as expected. I don't know if keeping the compression dicts between chunks would result in similar performance as tar.zstd?
You wrote about a related not-yet-implemented feature: keeping compression dicts between different chunks and not starting from scratch each time. Is this in active consideration?
I do not currently work on that, but feel free if you want to try this.