For discussion: "slack" compression option
Motivation and Context
We have a customer that deals in very large, generally uncompressible data. They run with compression=off and lately have been considering recordsize=16M. In testing, we found that this presented a real bottleneck for files that only fill a small amount of the last record, eg, a 17M file.
The main reason for this is simply that we’re doing a ton more writing IO than we need to - we write all those trailing zeroes that we don’t even need. There’s also substantial extra CPU load, eg a lot of time spent checksumming.
So, we’ve been experimenting with various ways of ignoring the “slack space” of trailing zeroes at the end of each record. We call this “slack compression”.
There was some interest in this at the last OpenZFS Leadership call and maybe even doing it outside of the compression system. So, I’m posting this PR as a starting point for discussion!
Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc.
Method
The method is pretty simple no matter how its hooked up:
- in
zio_write_compress(), call a function that will search the data buffer to find the last non-zero byte in the file. This value is thepsizefor the write. - In
zio_read_bp_init(), push a transform that will setpsizeto thelsize.
Yes, this is basically describing compression :smile:
The slack_compress() function shows a fairly simple implementation of the search. It runs backwards through the data buffer as an array of uint64_t. It exits at the first non-zero value it finds. This implementation is loosely modeled on zio_compress_zeroed_cb() (indeed, our earliest experiments were simply exposing ZIO_COMPRESS_EMPTY). There are obvious optimisation opportunities available here, including vectorisation and letting it work on a scatter ABD to avoid the need to linearise the incoming buffer.
The interesting part is how and when to use this functionality.
compression=slack (this PR)
This PR shows the simplest form: adding a compression=slack option. This works just fine as a substitute for compression=empty, and the ARC gets the benefit as well.
Early on, I plumbed this method directly into zio_write_compress rather than hooking it up to zio_compress_table, because we’re not actually doing any compression, so we don’t need a destination buffer or a copy; the source buffer can be reused. As it turns out it only made a small difference to performance because the search function still requires a linear buffer.
This works just fine. It could be merged today if we had any spare DMU_BACKUP_FEATURE_* bits). But its not really how I would like to do it.
compression=none and force psize to lsize
I am probably wrong about this, but as far as I could tell, there is only one situation outside of compression where we store psize < lsize, and that is when the ZIL is stored on a non-raidz vdev. Before issuing IO zio_shrink() is called to set psize as small as possible. On read, the data is read into a lsize'd buffer, so it naturally falls out right. I haven’t tried hard to find the code, but my intuition is that that doesn’t work for raidz because the block gets sliced up further and that goes wrong if the sizes aren’t right coming in.
If that’s right, that should mean that we could set a transform in zio_read_bp_init() for any uncompressed block with psize < lsize and it would “just work”, without requiring a separate compression option. It would still need a read feature flag as older software won’t understand it (perhaps only active if written to a raidz vdev), but possibly won’t require a stream feature flag.
We wouldn’t have to, but we could also teach the ARC about this, so that even though compression is “off” it could still keep only the real data portion in memory rather than larger buffers of zeroes.
Slack with content compression?
We didn’t try this, but it might be possible to use slack with data compression. This might be worthwhile, as collapsing zeroes could perhaps be made faster than a general compressor and may still have less overhead.
I’ve not thought hard about this, but I think method should be as simple as collapsing zeroes before invoking the real compressor, and when reading back, resetting psize after compression. However I think it would require that compression algorithms are able to know their decompressed size from the compressed state, since we can’t tell them ahead of time.
If that’s no good, then I can see taking a spare bit from somewhere in the BP (maybe bit 7 of the checksum type?) and use that as a signal that a further transform is necessary after decompression. Hmm, or maybe not, because where is the size? Do we need the final size for read even? Ok, I would really need to think about that more.
Slack compress the last block only?
We also spent some time trying to see if we could only trigger slack compression for the last block in an object, for when we know that the “inner” blocks are full and so pointless to search, but it seems to be a non-starter. If a file is written in order from start to finish and the ZIL isn’t used, then its usually possible to have a clear notion of “last block” which can be plumbed down through the write policy. If the block is written out to the ZIL though (and later hooked up with WR_INDIRECT) then there’s no way to know that it was the “last” block. Since our customer has a fsync()-heavy workload, most writes were of this type. For workloads that aren’t calling fsync() a lot it can be better, but there also problems if holes or zero-fills are introduced to “inner” blocks and so could benefit from slack compression are not now able to benefit. Taken together it ends up being a very situational gain, so we haven’t pursued it further.
Make ZLE faster?
We considered that perhaps we could avoid this by optimising ZLE. The main difference is that ZLE is a true compression algorithm, and so copies the data. It could certainly be optimised, but regardless, it would require the entire block to be scanned, so we’d be going from a full scan and checksum to a full scan, copying some amount and then another shorter scan for the checksum. We think the best case for slack will be no copies, scanning backwards to find the end of data, and then a forward scan to compute the checksum, so at the end, the entire block has only been seen once and no copies made.
(That said, ZLE is ripe for some optimising work, though I doubt many people are using it directly these days)
Performance
We have a test scaffold available that runs a representative simulation of the customer’s workload: open(), write(), fsync(), close(). We use this to test the effects of config changes, patches, etc.
Below are results for varying combinations of recordsize, compression and at different file sizes. The test is done on three 12-wide raidz3 pools, each with 10 writing threads, for 10 minutes. Note that there is a lot of other tuning involved (standard for the customer workload), so these are useful for ballpark comparison only.
The “throughput” column is showing the “real” amount of file data, not the total written, which helps compare the total loss due to trailing zeroes. The “fsync” column is the average time for the fsync() call to return, which is a useful proxy for the total IO time (and an important number for our customer as their clients are blocked until this returns).
These results have mostly given us confidence that even this quite naive version helps bring us closer to where we were with none, and can even rival lz4 some of the time. Our next steps are likely to see what we can gain by removing or minimising the number of allocations and copies required, and by improving the performance of the searching function.
| recordsize | compression | file size | files written | throughput (MB/s) | fsync (ms) |
|---|---|---|---|---|---|
| 128k | none | 15m | 84892 | 707 | 203.000 |
| 16m | 84054 | 747 | 202.667 | ||
| 17m | 77221 | 729 | 223.000 | ||
| 24m | 58366 | 778 | 295.000 | ||
| 31m | 47749 | 822 | 360.333 | ||
| lz4 | 15m | 84937 | 707 | 203.000 | |
| 16m | 83768 | 745 | 203.333 | ||
| 17m | 77605 | 733 | 222.333 | ||
| 24m | 57284 | 764 | 300.667 | ||
| 31m | 46755 | 805 | 368.000 | ||
| zle | 15m | 82360 | 686 | 209.000 | |
| 16m | 82078 | 729 | 207.000 | ||
| 17m | 75475 | 713 | 228.333 | ||
| 24m | 54590 | 728 | 315.333 | ||
| 31m | 43946 | 757 | 391.333 | ||
| slack | 15m | 85562 | 713 | 201.000 | |
| 16m | 84296 | 749 | 202.000 | ||
| 17m | 77982 | 736 | 220.333 | ||
| 24m | 58677 | 782 | 293.333 | ||
| 31m | 47734 | 822 | 359.333 | ||
| 16m | none | 15m | 85691 | 714 | 176.333 |
| 16m | 89569 | 796 | 159.333 | ||
| 17m | 47338 | 447 | 306.333 | ||
| 24m | 47172 | 629 | 304.667 | ||
| 31m | 46861 | 807 | 304.333 | ||
| lz4 | 15m | 92176 | 768 | 158.667 | |
| 16m | 88380 | 785 | 161.000 | ||
| 17m | 66540 | 628 | 187.667 | ||
| 24m | 58349 | 778 | 225.333 | ||
| 31m | 45889 | 790 | 299.333 | ||
| zle | 15m | 86731 | 723 | 165.000 | |
| 16m | 86580 | 769 | 163.000 | ||
| 17m | 61562 | 581 | 196.333 | ||
| 24m | 57622 | 768 | 232.000 | ||
| 31m | 46719 | 804 | 301.333 | ||
| slack | 15m | 90687 | 756 | 157.667 | |
| 16m | 88439 | 786 | 161.333 | ||
| 17m | 62187 | 587 | 190.667 | ||
| 24m | 56066 | 747 | 216.667 | ||
| 31m | 46534 | 801 | 306.000 |
Feedback
We’ll keep working on this, but we’re interested in feedback on any or all of the above. If you’ve got workloads you can test this on, or might like us to try, let us know. If you have thoughts about what this might mean if it wasn’t a compression option but “always on”, great! (especially in “weird” IO situations, like gang blocks, or metadata, or whatever). I’d like to hear more about what it all might mean for block pointer and IO pipeline changes. And I dunno what else; if you were interested in the dev call, please throw some thoughts down too!
@rincebrain Have you had a look at this?
For what it is worth, my opinion is that optimizing ZLE should be preferred. After all, if LZ4 is so fast, why zero elimination should be slower? I understand the concerns about buffer copies - but if the end result is faster/simpler, I can not see any real issue.
Last push just rebasing to master in preparation for more work, nothing to see here.
Experimental work from yesterday and today pushed.
First, zfs_slack_compress is converted to operate on ABDs directly, rather than linearising. In simple tests this makes very little difference, as the IO buffers are usually LINEAR_PAGE, so it's just one chunk. More chunks are going to be slower, because it has to walk forward through the chunks. It's a step though.
Next is in-place compression. Allows a compression method to be "in-place" or "zero-copy", where it operates on the source ABD only. I've changed the zio_compress_data return semantics a little, so that now you find out if compression happened by comparing the compressed length to the source length. If compression happened, the output is either definitely in the dest buffer if the caller passed one, or in the dest buffer if the one needed to be allocated, or in the source buffer if not. See the comment. The point is, the caller can indicate what they want to happen and understand what did happen wrt copies.
Finally, added a very very prototype reverse iterator for ABD. About all I can say is that it works. It's not too bad, but inefficient on the "advance" step for gang ABDs. Not that those are tested really, all pretty theoretical at this point.
Very very early numbers are promising. This is short fio runs, overwrite the 16-17M range of a 17M file on recordsize 16M, on a memory-backed pool. Only tuning is primarycache=metadata, and this is not really a good machine for performance work, but it's what I have right now (working on getting access back to the machine & workload I ran the original numbers on).
95b5cb1ba rob.nor.. 2024-11-18 compress: add "slack" compression option
compression=off:
WRITE: bw=238MiB/s (249MB/s), 238MiB/s-238MiB/s (249MB/s-249MB/s), io=10.0GiB (10.8GB), run=43161-43161msec
compression=slack:
WRITE: bw=518MiB/s (543MB/s), 518MiB/s-518MiB/s (543MB/s-543MB/s), io=18.3GiB (19.7GB), run=36258-36258msec
compression=lz4
WRITE: bw=561MiB/s (588MB/s), 561MiB/s-561MiB/s (588MB/s-588MB/s), io=19.7GiB (21.1GB), run=35872-35872msec
87b4315c8 rob.nor.. 2024-11-18 WIP: slack: compress in-place
compression=slack:
WRITE: bw=636MiB/s (667MB/s), 636MiB/s-636MiB/s (667MB/s-667MB/s), io=21.7GiB (23.3GB), run=34931-34931msec
458fcefbc rob.nor.. 2024-11-19 WIP: slack: use reverse iterator
compression=slack:
WRITE: bw=628MiB/s (658MB/s), 628MiB/s-628MiB/s (658MB/s-658MB/s), io=21.3GiB (22.9GB), run=34787-34787msec
compression=lz4
WRITE: bw=552MiB/s (579MB/s), 552MiB/s-552MiB/s (579MB/s-579MB/s), io=19.4GiB (20.8GB), run=35994-35994msec
So roughly what we saw before, but reducing copy overhead helps a lot.
With regards to true upstreaming, at this point I'm thinking to offer something very close to this: in-place/zero-copy compress, much improved reverse iterator, and that's about all as a first cut.
I thought about doing in-place for decompress too, but so far I don't think it will be very useful just because of how it's used. Most of the time, it's going into the ARC, and we want to keep that storing the "compressed" copy, so "decompressing" into the same buffer, even if it was big enough, probably isn't going to come up very often.
Though, that's also the kind of place where I feel like compress=slack perhaps shouldn't be expressed as just another compression option. If the ARC knew, then when "decompressing" it's copy, it could instead hand out an ABD that is a view over the "compressed" ABD, plus a bunch of zero pages at the back. No copies, no double allocation. But that would of course be a special case just for this compression method. Which maybe is fine, but maybe also says its not really compression, and should be something else.
In another world, maybe I'd name it after zio_shrink() and call it "shrinkwrap". But regardless, we can't store something extra without a signal in the block pointer that some post-processing is required.
Other things I want to play with that seem adjacent to this:
- optimise the zero search. AVX-based population counts are pretty easy to put together
- "lazy" ABDs, where we don't allocate the buffer/chunks until first use (borrowed, mapped, iterated), then zero regions can just be empty chunk slots and/or points to zero_page, and most of the zero scan could just be skipping the empty chunks. Of course, that makes LINEAR_PAGE interesting; maybe scatter ends up being non-contiguous that way.
That's it. Have at it, I'll be back tomorrow :)
Would it be possible to add a new partial block only compression algorithm implementation that stores full blocks (power of two, ≥ 1<<shift, ≤ record size) uncompressed (thus avoiding the copy implied by even ZLE compression/decompression) and attempts to compress partial blocks if it compresses enough to reduce disk space (and bandwidth)?
@Crest All blocks in ZFS are full blocks of record size, unless it is a file of only one block, in which case it might be any multiple of 1<<ashift up to record size.
@robn - what do you think, could you rebase the thing for rechecking the changes via ZTS?
Edit: the 4 times XXX is a nice hint for reviewing, thanks
Edit2:
What do you think about using zstd with level 1 (or even lower, when more speed is needed).
| raw zeros | lz4 | zstd -1 |
|---|---|---|
| 1M | 4,1K | 55 |
| 100M | 402K | 3,5k |
| 1G | 4,1M | 36k |
Zstandard compresses raw zeros really good.