borg icon indicating copy to clipboard operation
borg copied to clipboard

Could borg make use of zstd dictionary training?

Open Beiri22 opened this issue 4 years ago • 8 comments

As far as I understand, borg uses compression on chunk level and therefore cannot harvest full power. I've read that zstd-compression is able to train and later use explicit dictionaries. So I thought about borg and if it could use this feature for improving compression. Imagine some recompress / optimization routine that first trains a dictionary based on all the chunks in the repo; then recompresses everything with this pre-trained dictionary; I might also be used for all further compression. I mean, it's just an idea which might be discussed.

Beiri22 avatar Jan 17 '21 11:01 Beiri22

Theoretically, that would be possible to implement.

But not sure whether it would be worth it - storage is often cheap and recompressing a whole repo takes quite some time.

Maybe you could do an experiment and compare how much space a repo takes with normal zstd,X compression and then recompress it as you described and check space usage again.

If that is more than a few percent improvement, it might be worth implementing that.

ThomasWaldmann avatar Jan 17 '21 16:01 ThomasWaldmann

A related not-yet-implemented feature would be to keep compression dicts between different chunks and not start from scratch each time.

Guess if subsequent chunks are from same file / file type, maybe even directly following chunks (initial backup run) that might improve compression a bit. Not sure if it could also have a negative impact, if to-be-compressed stuff is of very different kind, like when switching from a binary file's chunks to a text file chunk.

ThomasWaldmann avatar Jan 17 '21 16:01 ThomasWaldmann

I could use an uncompressed repo for that and then compress the whole repo with an external zstd tool; but to really try compressing the individual chunks ... I would need to access them separately. Is there any way to export all chunks to individual files?

Beiri22 avatar Jan 18 '21 08:01 Beiri22

This is just a very preliminary comparison, because I do not know how to extract all individual chunks.

REPO_UNCOMPRESSED 1.7GB >> External Compress with ZSTD 22 whole repo >> 575 MB REPO_SETTINGZSTD22 818MB >> External Compress with ZSTD 22 whole repo >> 647 MB

Nearly the same size (575 MB) is obtained when training a 1MB-dict based on full repo directory and then using this dict to compress the whole repo with level 22.

Having the individual chunks as separate files would provide a more reasonable scenario!

Beiri22 avatar Jan 18 '21 17:01 Beiri22

You could try to borg init a new test repo, then edit the repo config and use a very small segment file size (default: 500MB, try 1kB or so). Guess it will start a new segment file for each chunk then, but I never tried that.

ThomasWaldmann avatar Jan 18 '21 17:01 ThomasWaldmann

I tried that. For 4352 unique chunks I got 3880 segment files. Compressing those as individual files without, with (dxxx dict size in kb) size; or as tar with all files results in following file sizes:

531M ./tar.zstd -- not individual 712M ./compressed_d3000k 720M ./compressed_d2000k 725M ./compressed_d5000k 726M ./compressed_d1500k 726M ./compressed_d4000k 727M ./compressed_d1250k 730M ./compressed_d10000k 731M ./compressed_d1000k 733M ./compressed_d750k 736M ./compressed_d500k 745M ./compressed_d200k 750M ./compressed_d100k 755M ./compressed_nodict 1,6G ./orig

I'v not changed any dictionary training settings. Does not look as promising as expected. I don't know if keeping the compression dicts between chunks would result in similar performance as tar.zstd?

Beiri22 avatar Jan 18 '21 22:01 Beiri22

You wrote about a related not-yet-implemented feature: keeping compression dicts between different chunks and not starting from scratch each time. Is this in active consideration?

Beiri22 avatar Feb 16 '21 20:02 Beiri22

I do not currently work on that, but feel free if you want to try this.

ThomasWaldmann avatar Feb 16 '21 20:02 ThomasWaldmann