zfs icon indicating copy to clipboard operation
zfs copied to clipboard

[Feature] Compression Migration Tool

Open PrivatePuffin opened this issue 5 years ago • 16 comments

Describe the problem you're observing

Currently one can change the compression setting on a dataset and this will compress new blocks using the new algorithm. This works perfectly fine for many people during normal use.

However, there are 3 scenario's where we would want an easy way to recompress a complete dataset:

  • If one wants to change decompression speed for currently stored write-once-read-many data
  • If one wants to increase the compression ratio of currently compressed data
  • If we remove (or depricate) a compression algorithm

While it's perfectly possible to send data to a new dataset and thus trigger a recompression, this has a few downsides:

  • It's not very accessible for the simplest of users, for example (future) freenas home users
  • It might mean downtime

A prefered way to handle this would be a features which recompresses current data on the drive, "in the background", just like a scrub or resilver. this also has the added benefid of making us able to force it if we depricate/replace/remove an algorithm.

This feature would enable us to go byond the requested deprication in #9761.

PrivatePuffin avatar Dec 21 '19 14:12 PrivatePuffin

How do you plan to implement this? What is to happen to snapshots? Recv already does this.

scineram avatar Dec 21 '19 15:12 scineram

@scineram Snapshots would be a problem indeed. I don't have a "plan" to implement this, otherwise I wouldn't file an issue ;)

How to you suggest to do future removal of compression algorithms and zero-downtime change of on-disk compression otherwise, I don't think recv covers this usecase, or does it?

If so: Where is the documentation about using recv in this way? it would have a very low downtime ofcourse...

PrivatePuffin avatar Dec 21 '19 15:12 PrivatePuffin

this requires block pointer rewrite

richardelling avatar Dec 21 '19 23:12 richardelling

@richardelling Precisely, I didn't say it was going to be easy ;)

PrivatePuffin avatar Dec 22 '19 11:12 PrivatePuffin

this requires block pointer rewrite

I personally would be fine if this feature initially behaved like/leveraged an auto-resumed local send/receive and some clone/upgrade-like switcheroo (and obeyed the same constraints, if unavoidable even temporarily using twice the required storage of the dataset being 'transformed') in the background with the user interface of a scrub (i.e. trigger it through a zfs subcommand, appears in zfs?/zpool status, gets resumed after reboots, can me paused, stopped, etc.).

The applications for this go beyond just applying a different compression algorithm:

  • AFAIK this also applies to checksum algorithms.
  • Shouldn't this also convert xattrs to the sa format?
  • If there's sufficient free space on the pool, this can also be a form of defragmentation, right?

One could hack something like this together using zfs send/recv, it'd probably involve a clone receive and some upgrade shenanigans, but it would definitely not be the same as having a canonical zfs subcommand with the above-mentioned UX; especially since it would somewhat cleanly resolve some "please unshoot my foot" situations that inexperienced and/or sleep deprived users might get themselves into, for example choosing the wrong compression algorithm/level a year before realizing it, without the need to figure out and possibly script (recursive) zfs send and receive. Also, zfs is probably in a better position to do a much cleaner in-place swap of the two versions of the dataset when the 'rewrite' is done, probably like a snapshot rollback, and will most likely not forget to delete the old version afterwards, unlike my hacky scripts, which break all the time. 😉

Future future work ideas:

  • Defrag mode: Only rewrite fragmented datasets for some definition of fragmented. (Without knowing implementation details, sounds like it could be a two phase process like scrubs?)
  • -o encryption=on from off might be a useful thing to support, now that we allow unencrypted children -> A future³ pr might add a way to migrate between crypto ciphers.

InsanePrawn avatar Dec 22 '19 14:12 InsanePrawn

@InsanePrawn

  • Good point, it should
  • Could be interesting
  • Considering All data should gets read and re-writen sequentially, it would defrag the drive, yes.

One could hack something like this together using zfs send/recv, it'd probably involve a clone receive and some upgrade shenanigans, but it would definitely not be the same as having a canonical zfs subcommand with the above-mentioned UX

Yes, thats mostly the point... I think more advanced users can do things that get pretty close (and pretty hacky), but creating it to be "as easy as possible" for the median user was the goal of my feature request...

PrivatePuffin avatar Dec 25 '19 12:12 PrivatePuffin

@InsanePrawn, given enough space, yes, a transparent ZFS send/receive would be a way to go. All new writes go to the new dataset, and any read not yet available in the new dataset would fall back to the old dataset. Whence the entire dataset is received, the old dataset is destroyed.

Theoretically, we could almost do it without enough space for the whole dataset. Whence one file is entirely copied to the new dataset, the file could be deleted from the source dataset.

If something like this were implemented, a resume after Zpool export would also have to be part of the work. Otherwise, the pool would remain in a partially migragted state.

This does have the advantage of re-striping the data. Simple example, you have 1 vDev and when it get fullish, you add a second vDev. The data from the first, (if not changed), remains only on the first vDev. Even newly written data may have to favor the second vDev as it has the most free space. Something like suggested above can help balance data, even if we don't need to change checksum, compress or encryption algorythms.

Back to reality, snapshots & possibly even bookmarks would be a problem. Even clones of snapshots that reference the old dataset would still reference the old data & metadata, (be it compression, checksum or encryption changes).

Lady-Galadriel avatar Jan 01 '20 21:01 Lady-Galadriel

I think a simple "reseat" of a file/dir interface would be the most practical. I.e. an operation that did this transparently:

cp A TMP rm A mv TMP A

Perhaps not the easiest to implement. Lustre has a similar feature called "migrate", which is more about re-stripping data.

Snapshots etc should just keep referencing the old data.

hhhappe avatar Jan 13 '20 10:01 hhhappe

My interest is in copies= changes as per the above mentioned ticket (2 to 1, in particular). In that specific sub case it feels like it should be something like:

file handle > list of pointers to data and properties thereof with duplicates for every block if copies=2

So in my mind (and perhaps not in the source code) it would be as simple as looking at that structure, choosing one of each duplicate to release, removing one duplicate from the list, and freeing that region of block device for reuse. And in reverse, iterating over the single set of blocks and writing a new one for each and adding them to the list/set.

I could see the encryption and compression being more difficult as you'd have to decode the existing block and write it again with a new algorithm and then swap the entire block set out for the file in question, somehow atomically. I'm not sure if there's a layer there that would allow two sets of blocks under the hood and the file handle to switch pointers from one to the other.

fredcooke avatar May 16 '21 12:05 fredcooke

@fredcooke metadata is checksummed too, so we can't easily change those structs. But you're right, copies case may have a room for somewhat hacks to ignore wrong (freed and already reused region) copy, the question is how ugly and cheap it may be.

gmelikov avatar May 16 '21 12:05 gmelikov

Surely a checksum can be recalculated and rewritten too, just as if the file itself is modified, no?

What both of these tickets need is a champion who is expert in the guts of this beast to come up with a coherent thorough file-re-write-in-place plan and then delegate the work out to mere mortals like me :-D

fredcooke avatar May 16 '21 12:05 fredcooke

Surely a checksum can be recalculated and rewritten too, just as if the file itself is modified, no?

Aaand you need to recalculate checksums for all blocks in Merkle tree (if you try to change existing blocks inplace, which we shouldn't do in ZFS CoW paradigm). For a general solution you might want to look at "block pointer rewrite" idea, which is hard to implement https://github.com/openzfs/zfs/issues/3582#issuecomment-123901505

Not wanted to demotivate you, it would be really great to have bprewrite at last!

gmelikov avatar May 16 '21 14:05 gmelikov

So I watched Matt Ahren's 1.5 hour 2013 OpenZFS talk on YT and BPrewrite as he describes it is something that would modify the past, not just the present, and is therefore risky and difficult as detailed in that video and elsewhere. I don't want that.

I want snapshots to remain immutable and honest, I think anything else is harmful though I see the uses for BPrewrite for defrag, device remove, rebalance, etc and perhaps those will always be pipe dreams in order to keep the project moving or perhaps an offline only variant for those would be fine.

However something lower level than cp/mv and less painful than send/receive would be nice to have for rewriting files but without doing it in userspace and without doing it globally and making a snapshot a lie.

I'd be happy enough with something zfs-aware that could rewrite a tree of files as needed in order for latest settings to stick in the knowledge that the data would be in addition to prior snapshots and thus require some snapshot cycling to reclaim the space (normal). Then I could gradually migrate sub-dir at a time or dataset at a time and once the earlier snapshots were all gone, the space would naturally be freed and there'd be room to do the next one. etc.

Might be time to start poking around the source instead of talking hypothetically at a high level about something I know nothing about :-D

fredcooke avatar May 17 '21 03:05 fredcooke

Dear All, I would like to support this question and request, due to the fact I use mostly lzo or lzo4 for compression and for some storage kind I would like to use afterwards zstd with maximum compression or zstd-fast

the wired think is, that (I know that ZFS is not btrfs) BTRFS can recompress its files i.e.: https://askubuntu.com/questions/129063/will-btrfs-automatically-compress-existing-files-when-compression-is-enabled https://wiki.ubuntuusers.de/Btrfs-Mountoptionen/

I am still wondering why not it has been implemented into ZFS

djdomi avatar Aug 30 '21 12:08 djdomi

possible duplicate of #3013? But i still hope it will come someday.

Konrni avatar Jun 09 '22 16:06 Konrni

@djdomi

While I'm not an expert on ZFS by any means - I do know that changing the compression on btrfs is accomplished through the "defragment" mechanism ... but this has a number of pitfalls (which, as of now, AFAIK, haven't been solved) -- notably, that it removes deduplication. (Deduplication in btrfs is completely different than in ZFS -- offline (e.g. after data is written, similar to NTFS deduplication) vs. online/on-the-fly...)

  • If one wants to increase the compression ratio of currently compressed data

For the most part decompression speeds are really fast with both ZSTD and LZ4. In fact, ZSTD is somewhat unique that decompression speeds are pretty constant regardless of the compression level - this article from the FreeBSD Journal has an excellent analysis of this very thing... (Their results indicate that in some cases ZSTD decompression can be even faster than LZ4... not to mention faster than no compression...) Given that they're both fast and pretty constant, I would suggest that there isn't much to be gained by changing compression in order to improve decompression speed (unless you're not using ZSTD/LZ4)

With that in mind, the problem comes down to: tuning the speeds of new writes (which can easily be done with zfs set compress=<whatever> pool/vol) and possibly "upgrading" compression to ZSTD/LZ4. (Which, while not perfect, can be done -for the most part- with a zfs create and rsync)

Given that (1) decompression speed is always (? nearly always?) faster with ZSTD/LZ4 as compared to no compression, I can't imagine a scenario where you'd want to remove it .... ? (And if it isn't faster on your hardware, that's something that should be tested/benchmarked/discovered before putting a system into production.)

danieldjewell avatar Jul 31 '22 23:07 danieldjewell