zfs
zfs copied to clipboard
zio_compress: introduce max_size threshold
Created brand new PR because of some problems with rebasing old one #9311, + updated description.
Motivation and Context
87.5% minimum compressratio is way to large, on 1TB with recordsize 16M such artifical limitation may lead to up to 125GB of wasted space in worst case (but you must be very, very unlucky).
So i propose to check only for max ashift of pool, it may save you nearly 99% of compressratio on large recordsizes.
And I think if you enable slow compression on dataset you will prefer better compressratio over decompression absence for data nowadays.
So, pros:
- better compressratio
- better IO if decompress is faster than disks
- we're already compressing everything
- it may be tuned for not so compressible data via module parameter
Cons:
- if you have data, which used to be written uncompressible due to previous code - you will see better compressratio at the price of more CPU load
Description
Now default compression is lz4, which can stop compression process by itself on incompressible data. So we can check only for common sector size.
How Has This Been Tested?
Generate 12% (just as needed to get into 87.5% threshold) compressible file with fio:
# cat fio.conf
[workload]
bs=128k
ioengine=libaio
iodepth=1
numjobs=1
direct=1
runtime=10
size=100m
filename=bigfile
rw=read
thread
unified_rw_reporting=1
group_reporting=1
buffer_compress_percentage=12
refill_buffers
buffer_pattern=0xdeadbeef
zfs set compression=lz4 rpool/test
version | recordsize | ashift | written | diff (from original) |
---|---|---|---|---|
0.7.13 | 1M | 12 | 100M | |
master_patched | 1M | 12 | 88.8M | ~11.2% |
So there may be up to ~12.5% compression gain for hardly compressible files.
Tested manually in VM, didn't run actual system on this code yet.
Types of changes
- [ ] Bug fix (non-breaking change which fixes an issue)
- [x] New feature (non-breaking change which adds functionality)
- [x] Performance enhancement (non-breaking change which improves efficiency)
- [ ] Code cleanup (non-breaking change which makes code smaller or more readable)
- [ ] Breaking change (fix or feature that would cause existing functionality to change)
- [ ] Documentation (a change to man pages or other documentation)
Checklist:
- [x] My code follows the ZFS on Linux code style requirements.
- [x] I have updated the documentation accordingly.
- [x] I have read the contributing document.
- [ ] I have added tests to cover my changes.
- [ ] All new and existing tests passed.
- [x] All commit messages are properly formatted and contain
Signed-off-by
.
Codecov Report
Merging #9416 into master will increase coverage by
<.01%
. The diff coverage is100%
.
@@ Coverage Diff @@
## master #9416 +/- ##
==========================================
+ Coverage 79.15% 79.15% +<.01%
==========================================
Files 416 416
Lines 123655 123658 +3
==========================================
+ Hits 97876 97880 +4
+ Misses 25779 25778 -1
Flag | Coverage Δ | |
---|---|---|
#kernel | 79.75% <85.71%> (+0.01%) |
:arrow_up: |
#user | 66.64% <85.71%> (-0.54%) |
:arrow_down: |
Continue to review full report at Codecov.
Legend - Click here to learn more
Δ = absolute <relative> (impact)
,ø = not affected
,? = missing data
Powered by Codecov. Last update 685abd5...39d1434. Read the comment docs.
@gmelikov Thanks for pointing me here. I can at least confirm that 12% compressible writes still don't get compressed as today on MASTER, so this is not just an issue with 0.7.13.
Good work on this, LGTM, this PR deserves some more love! 👍
edit To be more clear: This issue isn't just with extremely large record sizes, with 1M record sizes this can already contribute to significant savings!
I would prefer to eliminate the module parameter, and instead "use sector size of this pool as threshold". I.e. always make the threshold 1 sector (spa_max_ashift).
@ahrens hmm, I agree with you, we don't need to use larger threshold here with lz4 and future zstd.
Will get back to this PR in a week.
Codecov Report
Merging #9416 into master will decrease coverage by
0.00%
. The diff coverage is100.00%
.
@@ Coverage Diff @@
## master #9416 +/- ##
==========================================
- Coverage 79.38% 79.38% -0.01%
==========================================
Files 388 388
Lines 123392 123396 +4
==========================================
Hits 97953 97953
- Misses 25439 25443 +4
Flag | Coverage Δ | |
---|---|---|
#kernel | 79.83% <87.50%> (-0.10%) |
:arrow_down: |
#user | 66.07% <75.00%> (+0.56%) |
:arrow_up: |
Impacted Files | Coverage Δ | |
---|---|---|
module/zfs/arc.c | 85.65% <100.00%> (+0.44%) |
:arrow_up: |
module/zfs/zio.c | 88.76% <100.00%> (+0.25%) |
:arrow_up: |
module/zfs/zio_compress.c | 92.85% <100.00%> (+0.54%) |
:arrow_up: |
module/os/linux/spl/spl-zlib.c | 55.35% <0.00%> (-28.58%) |
:arrow_down: |
module/os/linux/spl/spl-kmem-cache.c | 75.58% <0.00%> (-8.14%) |
:arrow_down: |
module/zfs/dsl_synctask.c | 92.40% <0.00%> (-2.54%) |
:arrow_down: |
module/zfs/zil.c | 91.25% <0.00%> (-1.87%) |
:arrow_down: |
module/zfs/vdev_indirect.c | 73.83% <0.00%> (-0.84%) |
:arrow_down: |
module/os/linux/zfs/vdev_disk.c | 83.27% <0.00%> (-0.73%) |
:arrow_down: |
module/zfs/vdev_trim.c | 95.66% <0.00%> (-0.73%) |
:arrow_down: |
... and 51 more |
Continue to review full report at Codecov.
Legend - Click here to learn more
Δ = absolute <relative> (impact)
,ø = not affected
,? = missing data
Powered by Codecov. Last update a7929f3...f7523c3. Read the comment docs.
Updated,
- I've removed module parameter and use max_ashift of pool;
- tested interoperability with old code, manually imported pool with new code, created data with 89% effective compression, imported it with old code, scrub/read/write is ok, rewrote it, imported with new code, scrub/read/write ok;
- nightly ztest is almost all green, I've got just one
ASSERT at ztest.c:6050:ztest_scrub()/home/debian/zfs/cmd/ztest/.libs/ztest(+0x9444)[0x564c51a51444]
, but looks like it's not related to this patch - I've found only one place in code where we recompress data to checksum it - https://github.com/openzfs/zfs/pull/9416/files#diff-163982413ed6b6861e08633c89a77395R1759 , so looks like encrypted dataset and ARC without compression may trigger a checksum error on pool with 89+% compressed data on old code, @tcaputi could you please help me here and prove it or give some information. I can't reproduce this case, looks like I'm wrong;
- I looked through all
zio_compress_data()
usages and can conclude that if data wasn't compressed - it will be saved asZIO_COMPRESS_OFF
, so if we really need to recompress data without ashift info and check anything - we can just compress it with zero threshold; - Old code (unless of uncompressed ARC and encryption case) doesn't recompress data for checksums, so if that case will be resolved - we can use this patch without feature flag!
@gmelikov without looking at it too closely i think i might know whats going on. I think there are 2 things as play which are combining to result in the ARC's "recompress before checksumming" to work ok, but I think it is a pretty brittle dependency.
For some quick background, we need to store data in the L2ARC exactly the same as it is on disk in order for the decryption to work. This is why we do the re-compression here in the ARC layer.
When the data gets recompressed, the ARC uses the compression algorithm that it stored in the ARC header. This info comes from the bp when the block first enters the ARC. This patch doesn't change the way compression is done (AFAICT), it just changes the threshold at which we decide the compression is "worth it". However, no matter what that threshold is, if ZFS decides not to compress the block it sets the compression algorithm to ZIO_COMPRESS_OFF
. So the encryption code will only recompress it if it was actually compressed. This code does NOT have the same check for whether or not the compression was worth it. It simply relies on the value from the bp, so it ends up doing the right thing in all these cases.
That being said, this discussion is making me wonder if QAT compression support breaks L2ARC with encryption + no compression. I suspect it does, but that is a very niche bug with very minor repercussions in that case (the data simply fails to read from the L2ARC) and checks the main pool instead. Probably a problem for another time.
Anyway, hope that helps.
@tcaputi Thank you for insight, you're right, if it was compressed - we can use zero threshold here too, so I was wrong here (how glad I am)!
So, this patch is ready to review, I've rebased it.
(as an aside, the comments here talk about a >1MB recordsize a few times - I thought 1MB was the upper limit?)
@adamdmoss by default it is, but technically 16MB is an upper limit now, you may just change this module parameter and voila https://github.com/openzfs/zfs/wiki/ZFS-on-Linux-Module-Parameters#zfs_max_recordsize
It's good if you have really large block workload with rare read.
its still questionable if this correct then because your threshold is then some sort of alignment per sector and you are using this shift as threshole in zio_write_compress too which is unrelated to your l2arc case. a standard record size oriented block has a different alignment. so just using a check like destsize<srcsize should be enough from my point of view
@BrainSlayer I don't agree, sector size is used for threshold because we can save space on disk only if we saved at least one sector size of space, not less.
but a recordsize of 1M (or 128kb by default) will span over multiple sectors. so doing it on sector alignment is okay, but sectorsize as maximum boundary makes no sense.
@BrainSlayer can't uderstand what you mean,
d_len = s_len - compress_threshold;
here for 128K block and 4K sector size we'll set it to "compress result must be less than 124K or write it as is", you can see it here https://github.com/openzfs/zfs/pull/9416/files#diff-cddcbcd27b593866d4d3a2ffbc06ff4bR134
Besides, you can add comment to exact line of code to be more precise.
from my understanding a physical sectorsize is much smaller than a block size. so i would solve this in the following way targetsize = inputsize - (inputsize % sectorsize). so we ensure that its sector aligned, no matter how big the blocksize is
but a uncompressed block will still ignore this rule of course. so the question is if it really makes any benefit if we align it, because if its uncompressed it will be still cross the limit
regarding your formula. 128kb - 4K = 124K but 128K is sector aligned already. so no need to subtract anything
@BrainSlayer It has nothing to do with alignment though? It's a threshold to see if the compression actually saves space? In that scenario if it's more than 124KB we're writing 32 4KB sectors anyway and can just write it uncompressed, if it's less we've compressed enough to save a sector. The original code does the same calculation but with a higher threshold:
/* Compress at least 12.5% */
d_len = s_len - (s_len >> 3);
@BrainSlayer ZFS stores data in 2^ashift sized blocks on-disk, this is distinct from both the sector size the drives expose on their external interface (SAS, SATA, whatever) and the volblock-/recordsize configured for the dataset. Storing data compressed only makes sense when it needs less on-disk (ashift) blocks than the uncompressed data would, as the overhead of having to decompress data on access is then at least offset by a reduced amount of data that needs to be read from the pool.
Apart from that: this isn't IRC or discord, please try to write a comment in one instead of multiple, as each comment spawns a mail toward each participant of this issue.
@thulle yes, but the 12.5% is a wild value. i mean after compression the ratio is detected, but the performance for doing the compression is already lost. so even if its a small ratio, it still makes sense. with algorithms like gzip with lower decompression speed such a ratio makes sense, but high performance algorithms like LZ4 or ZSTD doesnt need such a hard ratio.
@GregorKopka yes. but a simply compressedsize < sourcesize match would be already enough. we dont need to subtract another sector size as ratio. in our example with 4K sector size the ratio limit is 124K if just 123K is stored 1K is wasted. but that would be the same if we dont substract it and just 127K is stored. we just limit the efficency. from the performance itself it wouldnt make any different since algorithms like LZ4 or ZSTD are faster than physical drive access in most usecases
@BrainSlayer Which is the reason for this pull request? To get rid of that 25% "wild value".
And if we use compressedsize < sourcesize than a 128K record compressed to 127K would still use 128K on disk and we'll just waste some minor time decompressing again when reading. Can just as well save it uncompressed for slightly better performance.
@thulle i cannot answer this question. read the topic of this pull request. its not mine
yes 127kb will still waste 1kb. but its the same waste if its stored uncompressed and will not make any difference at reading. but if we store just less than 124kb the propability of storing 128k blocks uncompressed is higher which also doesnt make any sense. and sector sizes can also be very different. i mean ssd's have typical 512 byte physical addressing size no matter how the filesystem is configured
@BrainSlayer With the caveat that I'm not familiar with the code but wanting to spare the actual devs from explaining what I perceive as misunderstandings on your end:
It does make a difference. Comparing reading 128KB vs reading 128KB and running it through an decompression algorithm, the pass through decompression - however fast it is - would just be wasted CPU-cycles. Thus it does makes sense to only write things compressed if we actually save any diskspace by doing so. And since we have to write things in ashift-sized blocks this pull request changes the threshold from that arbitrary 25% compression ratio to "does this compression save at least one ashift-sized block of space?" To accomplish this we compare to a threshold that's the equivalent of if (number of ashift-sized blocks compressed) < (number of ashift-sized blocks uncompressed). If ashift is 12 an 128KB record has to compress to 124KB (128KB - one single saved 4KB block) to be saved compressed. With ashift 14 the threshold becomes 112KB (128KB - one single saved 16KB block).
In my earlier reply I adapted to your use of sectors since i thought you used it to described "ashift-sized block" - but that was an misinterpretation on my end. 512/4096b sector size on the drive doesn't matter here.
i agree that one sector size at least would make sense for minimum compression threshold, since anything more would not greate any benefit for physical read/write efforts. but consider that these days sector sizes are more or less virtual since the introduction of ssd's which will become more and more important in future. and the decompression for most modern algorithms are much faster than reading any data from storage. so the question is if it creates any performance benefit to go this way and making the code more complex.
@BrainSlayer I've got your point and generally agree with it, but I think that one simple check in this exact case is still faster than decompression of block without any space gains. More on that - we still want to check if compression didn't make it even worse than original size, so this check is nearly free already, IMHO.
@BrainSlayer I've got your point and generally agree with it, but I think that one simple check in this exact case is still faster than decompression of block without any space gains. More on that - we still want to check if compression didn't make it even worse than original size, so this check is nearly free already, IMHO.
you dont need to check if its getting worse. the check < sourcesize is already catched. if compression result is bigger or equal source size, the block will be stored uncompressed. this case is already handled. no need todo anything
sorry. i have to correct my sayings. i handled this case in my zstd PR code, but its not handled by the upstream code. so the zstd PR does handle the case that the result is bigger than the source for all available compression algorithms
@ahrens thanks, agree with everything.
@BrainSlayer we should deal with all compression types here, unfortunately, but we may reconsider this logic in future without additional feature flags.
@gmelikov i just think a compression threshold limiter parameter is a logic consequence in this development here. why do we decide what a user wants? give him the choice. black box concepts are never good. we can make sane default values like threshold 0 = auto. (your way) or any other value = percentage budged from source size.
Codecov Report
All modified and coverable lines are covered by tests :white_check_mark:
Project coverage is 79.49%. Comparing base (
161ed82
) to head (3268d3c
). Report is 2455 commits behind head on master.
:exclamation: Current head 3268d3c differs from pull request most recent head 268a9a7. Consider uploading reports for the commit 268a9a7 to get more accurate results
Additional details and impacted files
@@ Coverage Diff @@
## master #9416 +/- ##
==========================================
+ Coverage 75.17% 79.49% +4.31%
==========================================
Files 402 390 -12
Lines 128071 123354 -4717
==========================================
+ Hits 96283 98055 +1772
+ Misses 31788 25299 -6489
Flag | Coverage Δ | |
---|---|---|
kernel | 79.88% <87.50%> (+1.11%) |
:arrow_up: |
user | 65.20% <75.00%> (+17.78%) |
:arrow_up: |
Flags with carried forward coverage won't be shown. Click here to find out more.
:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.