zfs icon indicating copy to clipboard operation
zfs copied to clipboard

zio_compress: introduce max_size threshold

Open gmelikov opened this issue 4 years ago • 71 comments

Created brand new PR because of some problems with rebasing old one #9311, + updated description.

Motivation and Context

87.5% minimum compressratio is way to large, on 1TB with recordsize 16M such artifical limitation may lead to up to 125GB of wasted space in worst case (but you must be very, very unlucky).

So i propose to check only for max ashift of pool, it may save you nearly 99% of compressratio on large recordsizes.

And I think if you enable slow compression on dataset you will prefer better compressratio over decompression absence for data nowadays.

So, pros:

  • better compressratio
  • better IO if decompress is faster than disks
  • we're already compressing everything
  • it may be tuned for not so compressible data via module parameter

Cons:

  • if you have data, which used to be written uncompressible due to previous code - you will see better compressratio at the price of more CPU load

Description

Now default compression is lz4, which can stop compression process by itself on incompressible data. So we can check only for common sector size.

How Has This Been Tested?

Generate 12% (just as needed to get into 87.5% threshold) compressible file with fio:

# cat fio.conf
[workload]
bs=128k
ioengine=libaio
iodepth=1
numjobs=1
direct=1
runtime=10
size=100m
filename=bigfile
rw=read
thread
unified_rw_reporting=1
group_reporting=1
buffer_compress_percentage=12
refill_buffers
buffer_pattern=0xdeadbeef
zfs set compression=lz4 rpool/test
version recordsize ashift written diff (from original)
0.7.13 1M 12 100M
master_patched 1M 12 88.8M ~11.2%

So there may be up to ~12.5% compression gain for hardly compressible files.

Tested manually in VM, didn't run actual system on this code yet.

Types of changes

  • [ ] Bug fix (non-breaking change which fixes an issue)
  • [x] New feature (non-breaking change which adds functionality)
  • [x] Performance enhancement (non-breaking change which improves efficiency)
  • [ ] Code cleanup (non-breaking change which makes code smaller or more readable)
  • [ ] Breaking change (fix or feature that would cause existing functionality to change)
  • [ ] Documentation (a change to man pages or other documentation)

Checklist:

  • [x] My code follows the ZFS on Linux code style requirements.
  • [x] I have updated the documentation accordingly.
  • [x] I have read the contributing document.
  • [ ] I have added tests to cover my changes.
  • [ ] All new and existing tests passed.
  • [x] All commit messages are properly formatted and contain Signed-off-by.

gmelikov avatar Oct 05 '19 19:10 gmelikov

Codecov Report

Merging #9416 into master will increase coverage by <.01%. The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #9416      +/-   ##
==========================================
+ Coverage   79.15%   79.15%   +<.01%     
==========================================
  Files         416      416              
  Lines      123655   123658       +3     
==========================================
+ Hits        97876    97880       +4     
+ Misses      25779    25778       -1
Flag Coverage Δ
#kernel 79.75% <85.71%> (+0.01%) :arrow_up:
#user 66.64% <85.71%> (-0.54%) :arrow_down:

Continue to review full report at Codecov.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update 685abd5...39d1434. Read the comment docs.

codecov[bot] avatar Oct 06 '19 04:10 codecov[bot]

@gmelikov Thanks for pointing me here. I can at least confirm that 12% compressible writes still don't get compressed as today on MASTER, so this is not just an issue with 0.7.13.

Good work on this, LGTM, this PR deserves some more love! 👍

edit To be more clear: This issue isn't just with extremely large record sizes, with 1M record sizes this can already contribute to significant savings!

Ornias1993 avatar Dec 25 '19 20:12 Ornias1993

I would prefer to eliminate the module parameter, and instead "use sector size of this pool as threshold". I.e. always make the threshold 1 sector (spa_max_ashift).

ahrens avatar Apr 14 '20 15:04 ahrens

@ahrens hmm, I agree with you, we don't need to use larger threshold here with lz4 and future zstd.

Will get back to this PR in a week.

gmelikov avatar Apr 14 '20 17:04 gmelikov

Codecov Report

Merging #9416 into master will decrease coverage by 0.00%. The diff coverage is 100.00%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #9416      +/-   ##
==========================================
- Coverage   79.38%   79.38%   -0.01%     
==========================================
  Files         388      388              
  Lines      123392   123396       +4     
==========================================
  Hits        97953    97953              
- Misses      25439    25443       +4     
Flag Coverage Δ
#kernel 79.83% <87.50%> (-0.10%) :arrow_down:
#user 66.07% <75.00%> (+0.56%) :arrow_up:
Impacted Files Coverage Δ
module/zfs/arc.c 85.65% <100.00%> (+0.44%) :arrow_up:
module/zfs/zio.c 88.76% <100.00%> (+0.25%) :arrow_up:
module/zfs/zio_compress.c 92.85% <100.00%> (+0.54%) :arrow_up:
module/os/linux/spl/spl-zlib.c 55.35% <0.00%> (-28.58%) :arrow_down:
module/os/linux/spl/spl-kmem-cache.c 75.58% <0.00%> (-8.14%) :arrow_down:
module/zfs/dsl_synctask.c 92.40% <0.00%> (-2.54%) :arrow_down:
module/zfs/zil.c 91.25% <0.00%> (-1.87%) :arrow_down:
module/zfs/vdev_indirect.c 73.83% <0.00%> (-0.84%) :arrow_down:
module/os/linux/zfs/vdev_disk.c 83.27% <0.00%> (-0.73%) :arrow_down:
module/zfs/vdev_trim.c 95.66% <0.00%> (-0.73%) :arrow_down:
... and 51 more

Continue to review full report at Codecov.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update a7929f3...f7523c3. Read the comment docs.

codecov-io avatar Apr 18 '20 02:04 codecov-io

Updated,

  • I've removed module parameter and use max_ashift of pool;
  • tested interoperability with old code, manually imported pool with new code, created data with 89% effective compression, imported it with old code, scrub/read/write is ok, rewrote it, imported with new code, scrub/read/write ok;
  • nightly ztest is almost all green, I've got just one ASSERT at ztest.c:6050:ztest_scrub()/home/debian/zfs/cmd/ztest/.libs/ztest(+0x9444)[0x564c51a51444], but looks like it's not related to this patch
  • I've found only one place in code where we recompress data to checksum it - https://github.com/openzfs/zfs/pull/9416/files#diff-163982413ed6b6861e08633c89a77395R1759 , so looks like encrypted dataset and ARC without compression may trigger a checksum error on pool with 89+% compressed data on old code, @tcaputi could you please help me here and prove it or give some information. I can't reproduce this case, looks like I'm wrong;
  • I looked through all zio_compress_data() usages and can conclude that if data wasn't compressed - it will be saved as ZIO_COMPRESS_OFF, so if we really need to recompress data without ashift info and check anything - we can just compress it with zero threshold;
  • Old code (unless of uncompressed ARC and encryption case) doesn't recompress data for checksums, so if that case will be resolved - we can use this patch without feature flag!

gmelikov avatar Apr 18 '20 11:04 gmelikov

@gmelikov without looking at it too closely i think i might know whats going on. I think there are 2 things as play which are combining to result in the ARC's "recompress before checksumming" to work ok, but I think it is a pretty brittle dependency.

For some quick background, we need to store data in the L2ARC exactly the same as it is on disk in order for the decryption to work. This is why we do the re-compression here in the ARC layer.

When the data gets recompressed, the ARC uses the compression algorithm that it stored in the ARC header. This info comes from the bp when the block first enters the ARC. This patch doesn't change the way compression is done (AFAICT), it just changes the threshold at which we decide the compression is "worth it". However, no matter what that threshold is, if ZFS decides not to compress the block it sets the compression algorithm to ZIO_COMPRESS_OFF. So the encryption code will only recompress it if it was actually compressed. This code does NOT have the same check for whether or not the compression was worth it. It simply relies on the value from the bp, so it ends up doing the right thing in all these cases.

That being said, this discussion is making me wonder if QAT compression support breaks L2ARC with encryption + no compression. I suspect it does, but that is a very niche bug with very minor repercussions in that case (the data simply fails to read from the L2ARC) and checks the main pool instead. Probably a problem for another time.

Anyway, hope that helps.

tcaputi avatar May 07 '20 10:05 tcaputi

@tcaputi Thank you for insight, you're right, if it was compressed - we can use zero threshold here too, so I was wrong here (how glad I am)!

So, this patch is ready to review, I've rebased it.

gmelikov avatar May 07 '20 10:05 gmelikov

(as an aside, the comments here talk about a >1MB recordsize a few times - I thought 1MB was the upper limit?)

adamdmoss avatar May 08 '20 22:05 adamdmoss

@adamdmoss by default it is, but technically 16MB is an upper limit now, you may just change this module parameter and voila https://github.com/openzfs/zfs/wiki/ZFS-on-Linux-Module-Parameters#zfs_max_recordsize

It's good if you have really large block workload with rare read.

gmelikov avatar May 10 '20 21:05 gmelikov

its still questionable if this correct then because your threshold is then some sort of alignment per sector and you are using this shift as threshole in zio_write_compress too which is unrelated to your l2arc case. a standard record size oriented block has a different alignment. so just using a check like destsize<srcsize should be enough from my point of view

@BrainSlayer I don't agree, sector size is used for threshold because we can save space on disk only if we saved at least one sector size of space, not less.

gmelikov avatar May 18 '20 09:05 gmelikov

but a recordsize of 1M (or 128kb by default) will span over multiple sectors. so doing it on sector alignment is okay, but sectorsize as maximum boundary makes no sense.

BrainSlayer avatar May 18 '20 09:05 BrainSlayer

@BrainSlayer can't uderstand what you mean, d_len = s_len - compress_threshold; here for 128K block and 4K sector size we'll set it to "compress result must be less than 124K or write it as is", you can see it here https://github.com/openzfs/zfs/pull/9416/files#diff-cddcbcd27b593866d4d3a2ffbc06ff4bR134

Besides, you can add comment to exact line of code to be more precise.

gmelikov avatar May 18 '20 09:05 gmelikov

from my understanding a physical sectorsize is much smaller than a block size. so i would solve this in the following way targetsize = inputsize - (inputsize % sectorsize). so we ensure that its sector aligned, no matter how big the blocksize is

BrainSlayer avatar May 18 '20 09:05 BrainSlayer

but a uncompressed block will still ignore this rule of course. so the question is if it really makes any benefit if we align it, because if its uncompressed it will be still cross the limit

BrainSlayer avatar May 18 '20 09:05 BrainSlayer

regarding your formula. 128kb - 4K = 124K but 128K is sector aligned already. so no need to subtract anything

BrainSlayer avatar May 18 '20 09:05 BrainSlayer

@BrainSlayer It has nothing to do with alignment though? It's a threshold to see if the compression actually saves space? In that scenario if it's more than 124KB we're writing 32 4KB sectors anyway and can just write it uncompressed, if it's less we've compressed enough to save a sector. The original code does the same calculation but with a higher threshold:

/* Compress at least 12.5% */
	d_len = s_len - (s_len >> 3);

thulle avatar May 18 '20 10:05 thulle

@BrainSlayer ZFS stores data in 2^ashift sized blocks on-disk, this is distinct from both the sector size the drives expose on their external interface (SAS, SATA, whatever) and the volblock-/recordsize configured for the dataset. Storing data compressed only makes sense when it needs less on-disk (ashift) blocks than the uncompressed data would, as the overhead of having to decompress data on access is then at least offset by a reduced amount of data that needs to be read from the pool.

Apart from that: this isn't IRC or discord, please try to write a comment in one instead of multiple, as each comment spawns a mail toward each participant of this issue.

GregorKopka avatar May 18 '20 10:05 GregorKopka

@thulle yes, but the 12.5% is a wild value. i mean after compression the ratio is detected, but the performance for doing the compression is already lost. so even if its a small ratio, it still makes sense. with algorithms like gzip with lower decompression speed such a ratio makes sense, but high performance algorithms like LZ4 or ZSTD doesnt need such a hard ratio.

BrainSlayer avatar May 18 '20 11:05 BrainSlayer

@GregorKopka yes. but a simply compressedsize < sourcesize match would be already enough. we dont need to subtract another sector size as ratio. in our example with 4K sector size the ratio limit is 124K if just 123K is stored 1K is wasted. but that would be the same if we dont substract it and just 127K is stored. we just limit the efficency. from the performance itself it wouldnt make any different since algorithms like LZ4 or ZSTD are faster than physical drive access in most usecases

BrainSlayer avatar May 18 '20 11:05 BrainSlayer

@BrainSlayer Which is the reason for this pull request? To get rid of that 25% "wild value".

And if we use compressedsize < sourcesize than a 128K record compressed to 127K would still use 128K on disk and we'll just waste some minor time decompressing again when reading. Can just as well save it uncompressed for slightly better performance.

thulle avatar May 18 '20 11:05 thulle

@thulle i cannot answer this question. read the topic of this pull request. its not mine

yes 127kb will still waste 1kb. but its the same waste if its stored uncompressed and will not make any difference at reading. but if we store just less than 124kb the propability of storing 128k blocks uncompressed is higher which also doesnt make any sense. and sector sizes can also be very different. i mean ssd's have typical 512 byte physical addressing size no matter how the filesystem is configured

BrainSlayer avatar May 18 '20 12:05 BrainSlayer

@BrainSlayer With the caveat that I'm not familiar with the code but wanting to spare the actual devs from explaining what I perceive as misunderstandings on your end:

It does make a difference. Comparing reading 128KB vs reading 128KB and running it through an decompression algorithm, the pass through decompression - however fast it is - would just be wasted CPU-cycles. Thus it does makes sense to only write things compressed if we actually save any diskspace by doing so. And since we have to write things in ashift-sized blocks this pull request changes the threshold from that arbitrary 25% compression ratio to "does this compression save at least one ashift-sized block of space?" To accomplish this we compare to a threshold that's the equivalent of if (number of ashift-sized blocks compressed) < (number of ashift-sized blocks uncompressed). If ashift is 12 an 128KB record has to compress to 124KB (128KB - one single saved 4KB block) to be saved compressed. With ashift 14 the threshold becomes 112KB (128KB - one single saved 16KB block).

In my earlier reply I adapted to your use of sectors since i thought you used it to described "ashift-sized block" - but that was an misinterpretation on my end. 512/4096b sector size on the drive doesn't matter here.

thulle avatar May 18 '20 12:05 thulle

i agree that one sector size at least would make sense for minimum compression threshold, since anything more would not greate any benefit for physical read/write efforts. but consider that these days sector sizes are more or less virtual since the introduction of ssd's which will become more and more important in future. and the decompression for most modern algorithms are much faster than reading any data from storage. so the question is if it creates any performance benefit to go this way and making the code more complex.

BrainSlayer avatar May 18 '20 13:05 BrainSlayer

@BrainSlayer I've got your point and generally agree with it, but I think that one simple check in this exact case is still faster than decompression of block without any space gains. More on that - we still want to check if compression didn't make it even worse than original size, so this check is nearly free already, IMHO.

gmelikov avatar May 18 '20 13:05 gmelikov

@BrainSlayer I've got your point and generally agree with it, but I think that one simple check in this exact case is still faster than decompression of block without any space gains. More on that - we still want to check if compression didn't make it even worse than original size, so this check is nearly free already, IMHO.

you dont need to check if its getting worse. the check < sourcesize is already catched. if compression result is bigger or equal source size, the block will be stored uncompressed. this case is already handled. no need todo anything

BrainSlayer avatar May 18 '20 19:05 BrainSlayer

sorry. i have to correct my sayings. i handled this case in my zstd PR code, but its not handled by the upstream code. so the zstd PR does handle the case that the result is bigger than the source for all available compression algorithms

BrainSlayer avatar May 18 '20 19:05 BrainSlayer

@ahrens thanks, agree with everything.

@BrainSlayer we should deal with all compression types here, unfortunately, but we may reconsider this logic in future without additional feature flags.

gmelikov avatar May 18 '20 20:05 gmelikov

@gmelikov i just think a compression threshold limiter parameter is a logic consequence in this development here. why do we decide what a user wants? give him the choice. black box concepts are never good. we can make sane default values like threshold 0 = auto. (your way) or any other value = percentage budged from source size.

BrainSlayer avatar May 18 '20 22:05 BrainSlayer

Codecov Report

All modified and coverable lines are covered by tests :white_check_mark:

Project coverage is 79.49%. Comparing base (161ed82) to head (3268d3c). Report is 2455 commits behind head on master.

:exclamation: Current head 3268d3c differs from pull request most recent head 268a9a7. Consider uploading reports for the commit 268a9a7 to get more accurate results

Additional details and impacted files
@@            Coverage Diff             @@
##           master    #9416      +/-   ##
==========================================
+ Coverage   75.17%   79.49%   +4.31%     
==========================================
  Files         402      390      -12     
  Lines      128071   123354    -4717     
==========================================
+ Hits        96283    98055    +1772     
+ Misses      31788    25299    -6489     
Flag Coverage Δ
kernel 79.88% <87.50%> (+1.11%) :arrow_up:
user 65.20% <75.00%> (+17.78%) :arrow_up:

Flags with carried forward coverage won't be shown. Click here to find out more.

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

codecov-commenter avatar May 19 '20 02:05 codecov-commenter