zfs Should compressed ARC be mandatory?

Problem Statement:

Since the introduction of the Compressed ARC feature (6950 ARC should cache compressed data), it has been possible to disable the feature using the tunable: compressed_arc_enabled=0

It is unclear how many users operate systems with the compressed ARC feature disabled, however it is clear that it gets a lot less testing than the default case. Over time, the assumption that the ARC is compressed, and dealing with the corner cases when it is not, has increased the complexity of the code base. It is often standing in the way of additional new features.

A number of developers have expressed a desire to retire the ability to disable the Compressed ARC.

Pathological Behaviour when Compressed ARC is Disabled:

On illumos and FreeBSD: Every read from the L2ARC requires the data be re-compressed to validate the checksum
On Linux: Every write to the L2ARC requires the data be re-compressed so that its checksum will match when it is later read back
On Linux: Every read from the L2ARC requires the data be decompression
With Native Crypto: Authenticating the data requires re-compression to verify the MAC
With Intel QuickAssist, and other offload cards, the implementation of GZIP is "decompress compatible", meaning the software gzip implementation can read data compressed with QAT-gzip, but very often is not bit-for-bit the same. If a pool is moved to a system without QAT, or if QAT is temporarily disabled, this will result in L2ARC checksum errors when the blocks are recompressed but the resulting checksum is not the same as that in the block pointer.
A similar case exists with the forthcoming ZSTD compression feature. In the future if the version of the ZSTD algorithm is upgraded to take advantage to improvements in the compression ratio, the L2ARC recompression will result in a checksum mismatch.

Latent Bugs caused by the Compressed ARC is disabled case being under-tested:

9321 arc_loan_compressed_buf() can increment arc_loaned_bytes by the wrong value (receiving a compressed stream on a system with compressed_arc disabled)

Counter Arguments:

A large working set of frequently accessed blocks will overwhelm the dbuf cache and spend a lot of time decompressing blocks cached in the ARC. Expanding the size of the dbuf cache results in a lot of double buffering (both the compressed and uncompressed version of the block).

Further Considerations:

We likely need to follow the not-yet-established OpenZFS Deprecation Policy, to give users warning that this feature is going away, and to give those with use cases for disabling the compressed ARC to make those use cases known to us.

See discussion of the deprecation policy in: Deprecate dedup send/receive

Sep 13 '18 03:09 allanjude

I've been mulling this over for a couple days, I think your "pathological" reasons are quite compelling. I honestly can't think of a reason for turning it off, unless you're not using compression for the entire pool since we're going to be spending time decompressing blocks at some point along the chain no matter what.

I also just realized that we never put compressed_arc_enable into the man pages.

Sep 15 '18 10:09 bunder2015

I think compressed_arc_enabled=0 has valid use cases (and performance impacts), thus should not be removed.

ARC compression trades CPU for RAM and as every read will eventually lead to the data being decompressed it can very well reduce performance (the TPS metric) in some scenarios, caused by the overhead for repeated decompression of one and the same data block - which might well offset any benefit gained from being able to fit more blocks into the available RAM.

The 1st issue (illumos and FreeBSD) could be avoided by directly compressing the data with the algorythm it was originally read from the pool when evicting it to L2ARC, which is what ZoL is iirc doing in the 2nd issue (leading to more data fitting into L2ARC).

The 3rd issue is IMHO none as the data has to be decompressed anyway: it was requested for a read, else it wouldn't be fetched from L2ARC in the first place.

The 4st issue states in the linked comment:

The performance overhead of this will be relatively low [...]

Regarding offload cards (and improved software implementations) coming to a compressed representation than differs from the corresponding/old software implementation: This breaks at least dedup and nop-write (in the sense of not finding a match in case an accelerator is added/removed - not in terms of data loss), and should get a clear warning in the documentation.

Bottom line

The listed issues could all (possivly except the encryption case, havn't wrapped my head around that yet) addressed by tracking the original on-disk block checksum through the ARC and giving the L2ARC header a on-L2-disk checksum (so reads can be verified against the raw data coming from the cache drive).

This should decouple ARC compression from on-disk (allowing to compress even blocks stored with compression=off inside ARC, and the other way around). That might increase the size of the (L2)ARC headers a bit, but should remove any problems related to de-/recompression while avoiding the potential performance penalty from repeated decompression of MRU/MFU blocks (as ARC could decompress these once and drop the compressed version, so it won't need to hold onto both).

This would make it completely irrelevant if a clean block in ARC has been decompressed, (re-)compressed (either for a trip through L2ARC or just to save some space when the block dosn't get accessed that often anymore) or has been sitting there as on-disk (compressed or not) from the very beginning.

And possibly enable us to add knobs to tune ARC compression behaviour per dataset, later.

Sep 15 '18 13:09 GregorKopka

I think compressed_arc_enabled=0 has valid use cases (and performance impacts), thus should not be removed.

ARC compression trades CPU for RAM and as every read will eventually lead to the data being decompressed it can very well reduce performance (the TPS metric) in some scenarios, caused by the overhead for repeated decompression of one and the same data block - which might well offset any benefit gained from being able to fit more blocks into the available RAM.

This is not exactly true. There is a dbuf cache that avoids decompressing the same block repeatedly if it is accessed frequently enough. Depending on the compression algorithm, the cost to decompress can be quite low.

However, yes, as I mentioned in the original post, if you have a very large working set, it could still be useful, as expanding the dbuf cache to compensate results in excessive double caching.

The 1st issue (illumos and FreeBSD) could be avoided by directly compressing the data with the algorythm it was originally read from the pool when evicting it to L2ARC, which is what ZoL is iirc doing in the 2nd issue (leading to more data fitting into L2ARC).

The only reason ZoL is different is that the way the L2ARC works was changed for the ZFS Native Crypto work, which has not been ported to FreeBSD and illumos yet.

The L2ARC is supposed to boost performance, having to use up CPU to compress blocks before writing them to the L2ARC works against that, and is much more expensive than decompression. Neither solution is idea.

The same algorithm may not be available. As the point about QAT (Intel Quick Assist crypto/compression accelerator) shows.

The 3rd issue is IMHO none as the data has to be decompressed anyway: it was requested for a read, else it wouldn't be fetched from L2ARC in the first place.

An argument could be made that the L2ARC should be a cache of the in-memory representation, not the on-disk representation. In the case where the compressed_arc is enabled, they are the same. When it is disabled, doing extra work in one or both directions reduces the usefulness of the cache.

The 4st issue states in the linked comment:

The performance overhead of this will be relatively low [...]

Regarding offload cards (and improved software implementations) coming to a compressed representation than differs from the corresponding/old software implementation: This breaks at least dedup and nop-write (in the sense of not finding a match in case an accelerator is added/removed - not in terms of data loss), and should get a clear warning in the documentation.

Bottom line

The listed issues could all (possivly except the encryption case, havn't wrapped my head around that yet) addressed by tracking the original on-disk block checksum through the ARC and giving the L2ARC header a on-L2-disk checksum (so reads can be verified against the raw data coming from the cache drive).

Expanding the size of every ARC header by 32 bytes would be a huge cost. Storing the checksum with the data on the L2ARC risks the checksum being corrupted along with the data, although maybe that is less of a concern since the probability of a checksum being corrupted in a way that it matches the data are low.

The thought of removing this tunable came out of a discussion of how to extend the L2 header in the ARC to contain the uncompressed data checksum, since the blockpointer checksum is of the compressed version. There was a strong desire not to increase the size of the L2 header, since there may be a very large number of them in memory if the L2ARC device is large.

This should decouple ARC compression from on-disk (allowing to compress even blocks stored with compression=off inside ARC, and the other way around). That might increase the size of the (L2)ARC headers a bit, but should remove any problems related to de-/recompression while avoiding the potential performance penalty from repeated decompression of MRU/MFU blocks (as ARC could decompress these once and drop the compressed version, so it won't need to hold onto both).

This doesn't really make sense to me. I definitely would not want the ARC trying to compress blocks read from disk that were not compressed on disk, because of the latency, and because the most likely reason a block on disk is not compressed, is because compression failed to yield sufficient gains.

This would make it completely irrelevant if a clean block in ARC has been decompressed, (re-)compressed (either for a trip through L2ARC or just to save some space when the block dosn't get accessed that often anymore) or has been sitting there as on-disk (compressed or not) from the very beginning.

One of the things that makes the compressed ARC better than most other memory compression features out there, is that it does not spend time and memory trying to compress data at the worst possible time, when there is demand for memory.

And possibly enable us to add knobs to tune ARC compression behaviour per dataset, later.

That would be a big change.

Sep 15 '18 16:09 allanjude

The L2ARC is supposed to boost performance, having to use up CPU to compress blocks before writing them to the L2ARC works against that, and is much more expensive than decompression. Neither solution is idea.

L2ARC is intended to boost read performance, yes.

An argument could be made that the L2ARC should be a cache of the in-memory representation, not the on-disk representation. In the case where the compressed_arc is enabled, they are the same. When it is disabled, doing extra work in one or both directions reduces the usefulness of the cache.

I see the usefulness of L2ARC in being backed by a device with way lower latency (more read IOPS) than the average pool drive, so fetching data from it has lower latency than from the main pool - decompression cost should play no big role in comparison (and it likely also applies when fetching the block from the pool). Plus being able to fetch from L2 dosn't impact the IOPS budget of the pool.

Expanding the size of every ARC header by 32 bytes would be a huge cost. Storing the checksum with the data on the L2ARC risks the checksum being corrupted along with the data, although maybe that is less of a concern since the probability of a checksum being corrupted in a way that it matches the data are low. The thought of removing this tunable came out of a discussion of how to extend the L2 header in the ARC to contain the uncompressed data checksum, since the blockpointer checksum is of the compressed version. There was a strong desire not to increase the size of the L2 header, since there may be a very large number of them in memory if the L2ARC device is large.

I can follow the argument about header size, please disregard my comment in that direction as it wasn't well thought out.

Regarding self-checksumming: it's OK for other stuff (among others: uberblocks) so it should be fine to store the checksum with the data - possibly enriched with some more information like original DVA and birth (so one can't accidently read a stale block with an intact self checksum, instead of the one we actually want, but I have no idea if such extra checks would actually be needed).

This doesn't really make sense to me. I definitely would not want the ARC trying to compress blocks read from disk that were not compressed on disk, because of the latency, and because the most likely reason a block on disk is not compressed, is because compression failed to yield sufficient gains.

That depends, I guess. My line of though was the following:

When a block is written to disk (with compression!=off) it is according to https://github.com/zfsonlinux/zfs/blob/cc99f275a28c43fe450a66a7544f73c4935f7361/module/zfs/zio.c#L1589

of the smallest-ashift device, and zero the tail.

only stored compressed if the results needs less physical blocks than the uncompressed version and according to https://github.com/zfsonlinux/zfs/blob/c3bd3fb4ac49705819666055ff1206a9fa3d1b9e/module/zfs/zio_compress.c#L123 only when the compression yields at least 12.5%.

So looking at an ashift=12 pool and one 8k block of a zvol that could compress to, say, 4.1k - this block would be stored uncompressed on-disk and thus need 8k space when read back into ARC.

Now, if ARC could compress that buffer when it isn't in use frequently enough it could (should I have gotten it correctly) shrink the uncompressed 8K buffer to 4.5k (9 blocks of SPA_MINBLOCKSIZE, the granularity ARC seems to cut it's data buffer) - it wouldn't need to abide the rules for on-disk compression as that buffer would never written back to disk (thanks to CoW, unless I miss something about how resilver works). A 43% reduction in ARC space use (granted: I constructed a quite optimal case to make the point) could be an interesting saving, don't do you think?

Same goes for evicting into L2, there even more as L2 is byte addressed and wrap-around write (not free space mapped and block addressed like the rest of the pool) so the worker thread that evicts to L2 could well batch several writes together (removing real alignment need for all but the first block of the write) so the compressed 4.5k ARC buffer would only need the actually used 4.1k (plus a little for self-checksum as discussed above) to go onto the L2 drive - gaining us another ~9% (compared to storing the compressed buffer, or ~48% compared to evicting the uncompressed block to L2).

Currently we have a background thread that feeds L2ARC, couldn't we have something like this compress ARC buffers fallen out of grace (instead of directly evicting them, as the L2 feeder does, or dropping them)? And should we be able to do that, couldn't ARC not decompress frequently used buffers to avoid double caching using dbufs (in case I got it correctly)?

One of the things that makes the compressed ARC better than most other memory compression features out there, is that it does not spend time and memory trying to compress data at the worst possible time, when there is demand for memory.

The tricky part could be to get good behaviour in case of memory pressure, with that I agree.

But this all is, for the moment, a rough idea I just got from digging through the source. I havn't wrapped my head around all of it yet, please correct me if I took a wrong turn somewhere.

But should I be right: Even with the CPU overhead of background compressing ARC buffers, the space saving it might give could make it well worth to spend a currently free bit in arc_flags_t on decoupling ARC compression from on-disk.

That would be a big change.

Possibly. But maybe worth it.

Sep 16 '18 00:09 GregorKopka

ISTM like you're advocating removing an ARC feature because it interferes with L2ARC. However, relatively few people use L2ARC, while everyone uses ARC.

As for recompressing, AIUI, the blocks are uncompressed when used, but the compressed blocks remain, so there is no recompression.

Sep 16 '18 21:09 richardelling

The theoretical arguments both ways are interesting. However, practical experience can help us weigh the importance of the pros and cons.

@GregorKopka and @richardelling, are you using compressed_arc_enabled=0? If so could you elaborate on what makes this compelling for your use case?

Feb 13 '19 05:02 ahrens

@richardelling

relatively few people use L2ARC, while everyone uses ARC.

But (we think) almost nobody sets compressed_arc_enabled=0. Certainly fewer folks than are using L2ARC.

As for recompressing, AIUI, the blocks are uncompressed when used, but the compressed blocks remain, so there is no recompression.

With compressed_arc_enabled=0, the blocks are stored in memory uncompressed, so they are not uncompressed when used. The compressed version is not available in memory, which is why it needs to be recompressed when writing it to the L2ARC.

Feb 13 '19 05:02 ahrens

I agree that almost nobody sets compressed_arc_enabled=0 Based on many studies of tunables over the years, few people actually tune any of them. When they do tune, it is because of what we used to call "/etc/system viruses" or "I read it on the internet"

To frame the discussion, is the following table correct?

compress_arc_enabled	ARC contents	ARC efficiency	L2ARC impact
0	only uncompressed block in ARC	reducing ARC efficiency	blocks must be recompressed to send to L2ARC, reducing L2ARC efficiency
1 (default)	both compressed and uncompressed blocks in ARC	possibly reducing number of blocks eligible to be in the ARC and therefore reducing ARC efficiency	easy to pass compressed block to L2ARC, improving L2ARC efficiency

Feb 13 '19 16:02 richardelling

possibly reducing number of blocks eligible to be in the ARC and therefore reducing ARC efficiency

Having compressed_arc turned on does not change any of the eligibility criteria. It just means that compressed blocks take less space than if they were stored in the ARC uncompressed, so you can fit more data in the ARC, increasing its efficiency. There may however be a small performance impact when you are decompressing the block repeatedly when reading it from the ARC, rather than just once as it is loaded into the ARC.

Feb 13 '19 16:02 allanjude

@richardelling I'm not sure exactly what you mean by "ARC efficiency". All blocks are eligible to be in the ARC, regardless of compress_arc_enabled, though less may fit in the ARC with compress_arc_enabled=0. Here's an updated table that matches my understanding, with "efficiency" a proxy for "ARC hit rate". Also note that the impacts only apply to data that's compressed on disk. Data that's stored uncompressed on disk is handled the same regardless of compress_arc_enabled (no compression / decompression).

compress_arc_enabled	ARC contents	ARC efficiency	L2ARC impact
0	uncompressed in ARC	reduced ARC efficiency (ARC stores less blocks because they are bigger in memory)	blocks must be recompressed to send to L2ARC, increasing L2ARC CPU usage
1 (default)	matches what's on disk (compressed or uncompressed)	good ARC efficiency (ARC stores maximum number of blocks)	easy to pass compressed block to L2ARC, negligible L2ARC CPU usage

Feb 13 '19 16:02 ahrens

yeah, I was afraid the "efficiency" word would cause confusion. What I mean is the ARC is a constrained resource containing a limited number of bytes. With compressed ARC, each compressed block consumes its lsize + (compressed) psize, for some period of time. Clearly, for higher compression ratios, the efficiency is better. But my argument is a rathole... don't go there now.

For the expected common case where compression ratios are high, the new framing is better.

Feb 13 '19 18:02 richardelling

With compressed ARC, each compressed block consumes its lsize + (compressed) psize, for some period of time

I think you're talking about the need to store the uncompressed version in memory while it's being accessed. compress_arc_enabled=0 has no impact on this. Even if the data is stored uncompressed in the ARC, an additional in-memory copy is made while it is being accessed, due to ABD. Additionally, this memory may continue to be used after the access completes, due to the dbuf cache. We think of this space as being owned by the dbuf cache, not the ARC, because the dbuf layer controls how much memory is used by it, and the eviction policy.

Feb 13 '19 20:02 ahrens

Currently running no productions systems with compressed_arc_enabled=0, though I had experimented with it and vaguely remember to have seen a slightly better performance when booting multiple diskless clients (backed by cloned zvols exported over iSCSI) in parallel - basically the 'repeated uncompress' case. For reasons long forgotten by now it hasn't been kept disabled.

I still suspect though (given I understood the code correctly enough that my view of on-disk data being read (and stored) in ARC in on-disk 2^ashift sized blocks (of the vdevs they come from) is somewhat correct) that (re-)compressing the data (after being delivered to the DMU, for which it needs to be decompressed anyway) using SPA_MINBLOCKSIZE as block granularity would lead to a more effective ARC, as it could store more data compared to storing the verbatim on-disk representation as it comes from the drives. Especially for data from vdevs with higher ashift (12 or 13) and/or lower record-/volblocksize.

Feb 16 '19 15:02 GregorKopka

I still suspect though (given I understood the code correctly enough that my view of on-disk data being read (and stored) in ARC in on-disk 2^ashift sized blocks (of the vdevs they come from) is somewhat correct) that (re-)compressing the data (after being delivered to the DMU, for which it needs to be decompressed anyway) using SPA_MINBLOCKSIZE as block granularity would lead to a more effective ARC, as it could store more data compared to storing the verbatim on-disk representation as it comes from the drives. Especially for data from vdevs with higher ashift (12 or 13) and/or lower record-/volblocksize.

In the case where we need to re-compress before writing to the L2ARC, it must be compressed in exactly the same way as the on-disk version, or the checksum will not match. The L2ARC used to have its own separate checksum, since it was usually compressed while the in-ARC version was not, but this was removed to make the L2ARC use a lot less ram per block that is cached there.

Feb 16 '19 16:02 allanjude

L2 data could self-checksum on-disk, or is the checksum needed after the data had been read back into RAM?

Feb 16 '19 18:02 GregorKopka

Data is compressed on disk It is read from disk, and the checksum is compared It is then stored in the ARC (compressed, or uncompressed, based on the setting) If it nears the tail end of the cache, then it is written to the L2ARC (if ARC is uncompressed, it is recompressed with the same settings, so the checksum will match later) When it is read back from the L2ARC, the checksum is compared again, to the version in the ARC header, which is from the block pointer (the checksum of the original on-disk version)

The L2ARC does not store its own checksum (it used to, but this was a waste of memory).

Feb 16 '19 19:02 allanjude

I know that it currently works that way, dosn't answer my question if reading back from L2 needs to compare to the checksum in the L1 header or if the data being read couldn't be verified through a L2 on-disk header (self-checksum, DVA, TXG) that would only 'waste' space on the L2 drives. That header could well be discarded (from RAM) after the read is verified and the retrieved payload is decompressed (which needs to be done anyway, else the L2 read wouldn't have happened in the first place).

This was while thinking about the 'counter arguments' part: an ARC behaviour where the hot data is (and stays) uncompressed to avoid constant decompression, while the cooling data gets compressed (while otherwise unused CPU time being available) to squeeze as much into the available ARC space as possible - could be way more effective than the current approach. Should that be a derail... sorry.

Feb 16 '19 19:02 GregorKopka

It appears that utilizing L2ARC with zfs_compressed_arc_enabled = 0 results in a kernel panic - #8454

Feb 26 '19 15:02 jwittlincohen

For what is worth, seeing how a not often used (and so not well tested) codepath (ie: compressed ARC off) caused a kernel panic, I agree with the premise of this issue: compressed ARC should be mandatory. This would simplify code management, significantly reducing the possibility of messing up when doing other changes.

Feb 27 '19 12:02 shodanshok

This was discussed at the Feb 26 OpenZFS meeting (link below), and the input was positive. So I think we should move forward with this proposal. @allanjude would you like to open a PR? https://www.youtube.com/watch?v=EXstK9ckcZQ

Feb 28 '19 17:02 ahrens

I'll get started on it this week.

Feb 28 '19 18:02 allanjude

Compressed ARC have huge (x4) performance penalty in some cases: https://blog.lexa.ru/2019/05/10/zfs_vfszfscompressed_arc_enabled0.html Please, don't remove control of this!

May 10 '19 11:05 slw

On Fri, May 10, 2019 at 06:57:40AM -0700, ptx0 wrote:

better to open an issue and resolve your performance than it is to make them leave compressed arc tunable in and not get zstd compression. disabling compressed arc should not be a requirement for performance.

you are kidding? or you are realy ready to resolve this issuse by donate power hardware?

May 10 '19 18:05 slw

@slw Compressed ARC should not have a huge performance impact compared to uncompressed ARC. It sounds like you have a workload where that is not the case. We would like to investigate and fix that. Could you open a separate issue describing the problem you're having with compressed ARC?

May 10 '19 19:05 ahrens

On Fri, May 10, 2019 at 12:43:58PM -0700, Matthew Ahrens wrote:

@slw Compressed ARC should not have a huge performance impact compared to uncompressed ARC. It sounds like you have a workload where that is not the case. We would like to investigate and fix that. Could you open a separate issue describing the problem you're having with compressed ARC?

This is not may setup, this is setup of Alex Tutubalin. Can you contact directly to [email protected]? English is ok.

May 10 '19 20:05 slw

@ahrens I don't have a case with major performance difference, but I can see a reproducible ~10% difference on my Intel i5-5200u with FIO's buffer_compress_percentage=50 ,

zfs_compressed_arc_enabled=1 : ~30k IOPS
zfs_compressed_arc_enabled=0 : ~34k IOPS

I didn't think that it may be a huge difference, but after my brief tests I'm against mandatory compression.

Reproducer:

# zfs get compression,recordsize,primarycache rpool/home/gmelikov/fio
NAME                     PROPERTY      VALUE         SOURCE
rpool/home/gmelikov/fio  compression   lz4           inherited from rpool
rpool/home/gmelikov/fio  recordsize    128K          default
rpool/home/gmelikov/fio  primarycache  all           default
# echo 1 > /sys/module/zfs/parameters/zfs_compressed_arc_enabled
$ fio --name=randwrite --ioengine=libaio --iodepth=1 --rw=read --bs=128k --direct=0 --size=512M --numjobs=2 --runtime=48 --group_reporting -time_based --buffer_compress_percentage=50
$ rm ./*.0
# echo 0 > /sys/module/zfs/parameters/zfs_compressed_arc_enabled
$ fio --name=randwrite --ioengine=libaio --iodepth=1 --rw=read --bs=128k --direct=0 --size=512M --numjobs=2 --runtime=48 --group_reporting -time_based --buffer_compress_percentage=50

zfs_compressed_arc_enabled=1 fio output:

randwrite: (g=0): rw=read, bs=128K-128K/128K-128K/128K-128K, ioengine=libaio, iodepth=1
...
fio-2.16
Starting 2 processes
Jobs: 2 (f=2): [R(2)] [100.0% done] [3712MB/0KB/0KB /s] [29.7K/0/0 iops] [eta 00m:00s]
randwrite: (groupid=0, jobs=2): err= 0: pid=20250: Fri May 10 23:43:22 2019
  read : io=177351MB, bw=3694.8MB/s, iops=29557, runt= 48001msec
    slat (usec): min=44, max=5843, avg=65.92, stdev=14.33
    clat (usec): min=0, max=1988, avg= 1.10, stdev= 1.84
     lat (usec): min=45, max=5872, avg=67.03, stdev=14.57
    clat percentiles (usec):
     |  1.00th=[    0],  5.00th=[    1], 10.00th=[    1], 20.00th=[    1],
     | 30.00th=[    1], 40.00th=[    1], 50.00th=[    1], 60.00th=[    1],
     | 70.00th=[    1], 80.00th=[    1], 90.00th=[    1], 95.00th=[    2],
     | 99.00th=[    2], 99.50th=[    3], 99.90th=[    5], 99.95th=[    6],
     | 99.99th=[   20]
    lat (usec) : 2=90.41%, 4=9.14%, 10=0.41%, 20=0.03%, 50=0.01%
    lat (usec) : 100=0.01%, 250=0.01%, 500=0.01%
    lat (msec) : 2=0.01%
  cpu          : usr=2.42%, sys=97.39%, ctx=1097, majf=0, minf=79
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued    : total=r=1418807/w=0/d=0, short=r=0/w=0/d=0, drop=r=0/w=0/d=0
     latency   : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
   READ: io=177351MB, aggrb=3694.8MB/s, minb=3694.8MB/s, maxb=3694.8MB/s, mint=48001msec, maxt=48001msec

zfs_compressed_arc_enabled=0 fio output:

randwrite: (g=0): rw=read, bs=128K-128K/128K-128K/128K-128K, ioengine=libaio, iodepth=1
...
fio-2.16
Starting 2 processes
Jobs: 2 (f=2): [R(2)] [100.0% done] [4242MB/0KB/0KB /s] [33.1K/0/0 iops] [eta 00m:00s]
randwrite: (groupid=0, jobs=2): err= 0: pid=15436: Fri May 10 23:39:59 2019
  read : io=204526MB, bw=4260.9MB/s, iops=34086, runt= 48001msec
    slat (usec): min=36, max=10262, avg=56.93, stdev=21.23
    clat (usec): min=0, max=2163, avg= 1.11, stdev= 1.83
     lat (usec): min=37, max=10266, avg=58.04, stdev=21.38
    clat percentiles (usec):
     |  1.00th=[    0],  5.00th=[    1], 10.00th=[    1], 20.00th=[    1],
     | 30.00th=[    1], 40.00th=[    1], 50.00th=[    1], 60.00th=[    1],
     | 70.00th=[    1], 80.00th=[    1], 90.00th=[    2], 95.00th=[    2],
     | 99.00th=[    3], 99.50th=[    4], 99.90th=[    5], 99.95th=[    6],
     | 99.99th=[   19]
    lat (usec) : 2=88.87%, 4=10.61%, 10=0.48%, 20=0.03%, 50=0.01%
    lat (usec) : 100=0.01%, 250=0.01%, 500=0.01%
    lat (msec) : 4=0.01%
  cpu          : usr=2.70%, sys=97.01%, ctx=1233, majf=0, minf=82
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued    : total=r=1636208/w=0/d=0, short=r=0/w=0/d=0, drop=r=0/w=0/d=0
     latency   : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
   READ: io=204526MB, aggrb=4260.9MB/s, minb=4260.9MB/s, maxb=4260.9MB/s, mint=48001msec, maxt=48001msec

May 10 '19 20:05 gmelikov

Enabled (!) compression could be devastating for performance on some systems. Here are example (sorry, it is Russian blog): https://blog.lexa.ru/2019/05/10/zfs_vfszfscompressed_arc_enabled0.html

Bottom line of this post: on FreeBSD with ARC compression enabled (default) some operations (backup checksum verifying) go with throughput 600-800Mbit/sec. and with compression disabled (and on old versions of FreeBSD) it is 2-2.5 Gbit/s.

If you want to make compression always-on, you should fix such pathological cases first.

In my own experience (on FreeBSD again) compression on server with media files (effectively uncompresseable) make ARC efficiency significallyu lower in terms of hit rate, as difference between Wired memory and ARC becomes larger and ARC becomes effectively SMALLER for same amount of physical RAM.

May 11 '19 11:05 blacklion

Compressed ARC should have no impact on uncompressible data. It only impacts reads of data that is already compressed on disk. Are you seeing evidence to the contrary?

May 11 '19 14:05 richardelling

On Sat, May 11, 2019 at 07:36:40AM -0700, Richard Elling wrote:

Compressed ARC should have no impact on uncompressible data. It only impacts reads of data that is already compressed on disk. Are you seeing evidence to the contrary?

Metadata still compressable. High demand of metadata + low end CPU can cause high performance impact.

May 11 '19 15:05 slw

On Sat, May 11, 2019 at 07:36:40AM -0700, Richard Elling wrote: Compressed ARC should have no impact on uncompressible data. It only impacts reads of data that is already compressed on disk. Are you seeing evidence to the contrary? Metadata still compressable. High demand of metadata + low end CPU can cause high performance impact.

People understand that "compressed ARC" never actually compresses anything, right?

It just defers decompressing data that is already compressed on disk, until each time it is actually read from the ARC, so it can store the compressed version in the ARC and maintain a higher cache hit ratio.

May 11 '19 15:05 allanjude