rustic icon indicating copy to clipboard operation
rustic copied to clipboard

Feature request: prevent blobs fragmentation by saving files in separate packs

Open lucatrv opened this issue 1 year ago • 4 comments

Hi @aawsome, in case you don't recall I'm the one who first proposed to rewrite restic in Rust. I see that a lot has happened since then, thanks for your efforts! I'm looking forward to start using rustic in production once it gets stable. I'd like to suggestion a new feature that could be added to the list of improvements implemented in rustic to make backups quicker (in particular with SSD) and reduce blobs fragmentation. Reading several files concurrently reduces backup time, in particular with modern SSD drives, as explained here. However with restic this also increases blobs fragmentation, because blobs coming from different files are saved in the same pack, se this restic issue. To improve performance and reduce blobs fragmentation, rustic should automatically read several files concurrently when an SSD drive is detected, but then save blobs coming from each file in a separate pack. Only when a file is finished, if the corresponding pack is not full yet it should be continued with blobs coming from the next file to read. So for instance if 8 files are read concurrently, then 8 packs should be written concurrently, each one with blobs coming from one of the files.

lucatrv avatar Nov 26 '23 23:11 lucatrv

Thanks a lot @lucatrv for opening this feature request.

Before going into detail, I have a general question: What is the reason you want to prevent blobs fragmentation? I mean, from a very theoretical point of view, fragmentation can always occur. For a large file it could be that ever second blob is already contained in the repository which results in a strongly fragmentated storage...

I thought a bit about the proposal. The current behavior is that rustic has a packfile in-memory which it uses to add new blobs until it is ready for writing to the backend. The writing is done parallel to creating a new one in-memory such that the writing and processing of new blobs is parallelized. This already means that rustic can use up to 2 time the pack size as memory requirement just for the packer. As rustic allows very large packfiles, we have to take memory usage into account here. Another thing is preventing of double-saving the same blob. The packer currently checks the index before backing up, a list of already saved blobs during the backup and the list of blobs-in-progress. Using more than one packer, the last check would need additional synchronization. TL;DR: Using more than one packer would result in a larger refactoring of the whole packer code - nothing that could be done quickly.. On the other hand, I agree that using multiple packers could also increase performance...

I came up with an additional idea - we could identify the origin of the blobs-in-flight and before saving the packfile to the backend re-sort it such that the blobs within the packfile are not (or bette: the least possible way) fragmented. But is also a topic for a packer-rewrite...

aawsome avatar Dec 03 '23 09:12 aawsome

Hi @aawsome, sorry for late answer. IMHO the following situation should be avoided: consider backing up a volume with some very large and frequently modified files (for instance virtual machines or databases) and many other smaller files. If reading and processing is carried out in parallel and then processed data is saved concurrently to the same packs, the blobs would and up highly fragmented. Then, if everyday the large files are changed, then many packs would need to be updated, containing also data of other files that instead maybe will never change. I guess you agree that this would be an unwanted situation.

To avoid that, at least as much as practically possible, I can think of only two ways:

  1. Read each file serially, so that processed blobs will be saved in order to the same pack. Clearly some fragmentation will still happen, because for instance a pack could be initiated with data from the last portion of a large file, and then finished with data from other smaller files, but the end result would still be much less fragmented than the situation described above. This is a possible option, but would not optimally use all available resources in current multi-core processors, unless the compression algorithm is the bottle neck and can fill all cores.
  2. Read files in parallel (maybe just a limited amount, for instance Restic defaults to 2, but maybe a better value would be 4 or 8) but then save the blobs of each file separately in different packs, at least until each file is finished and the next file is read. The current average packs size for Rustic, at least for my use case, is about 40 Mb (compressed), and I am not even sure that it is an optimal value, because for the same use case Restic uses smaller packs (less than 18 Mb compressed) and I noticed that it is quicker when updating them. So for instance if a smaller default pack sizes was chosen, assuming 18 Mb compressed is about 30 Mb uncompressed, reading 4 files in parallel would require 150 Mb of memory (considering that one pack at a time would be written to disk before starting a new one), which does not seem to be a large value nowadays for a backup program. To prevent double-saving the same blob, I guess each reading / compressing process should send a message with a reference to the incoming data to a single process which checks already processed data. Once the single process gives the go ahead, the compressing processes should proceed and confirm at the end.

The defragmenting packer is also a good idea, but ideally it would be in addition to the above solution, because otherwise I think that almost all packs would be written to disk twice.

lucatrv avatar Dec 21 '23 21:12 lucatrv