zfs icon indicating copy to clipboard operation
zfs copied to clipboard

ZVOL write IO merging not sufficient

Open samuelxhu opened this issue 6 years ago • 67 comments

System information

Type Version/Name
Distribution Name ZFS on Linux
Distribution Version Centos 7
Linux Kernel 3.10
Architecture x-86
ZFS Version 0.6.5.X, 0.7.X, 0.8.X
SPL Version 0.6.5.X, 0.7.X, 0.8.X

Describe the problem you're observing

Before 0.6.5.X, e.g. 0.6.3-1.3 or 0.6.4.2, ZoL had the standard linux block device layer for ZVOL, thus one can use scheduler, deadline or others, to merge incoming IO requests. Even with the simplest noop scheduler, contiguous IO requests could still merge if they are sequential.

Things changed from 0.6.5.X on, Rao re-wrote the block layer of ZVOL, and disabled request merging at ZVOL layer, claiming that DMU does IO merging. However it seems that DMU IO merging either not work properly, or DMU IO merging is not sufficient from the performance point of view.

The problem is as follows. ZVOL has a volblocksize setting, and in many cases, e.g. for hosting VM, it is set to 32KB or so. When IO requests has a request size less than the volblocksize, read-modify-writes (RMW) will occur, leading to performance degradation. A scheduler, such as deadline, is capable of sorting and merging IO request, thus reducing the chance of RMW.

Describe how to reproduce the problem

Create a not-so-big ZVOL with volblocksize of 32KB, use FIO to issue a single sequential write IO workload of size 4KB, after a while (after the ZVOL filled with some data), either using "iostat -mx 1 10 " or "zpool iostat 1 10", one can see there are a lot of read-modify-writes. Note that at the beginning of writes, there will be less or no RMW because ZVOL is almost empty and ZFS can intelligently skip reading zeros.

In contrast, use FIO to issue sequential write IO workload of size 32KB, 64KB, or even larger, no matter how long you run the workload, there is no RMW.

Apparently IO merging logic at ZVOL is not working properly. Either we re-enable block device scheduler choice of deadline or noop, or fix the broken IO merging logic in DMU, should fix this performance issue.

Include any warning/errors/backtraces from the system logs

samuelxhu avatar Mar 02 '19 07:03 samuelxhu

ZVOL currently does not even support noop

samuelxhu avatar Mar 02 '19 07:03 samuelxhu

The default value of nomerges is 2, I will try to set it 0, re-test the cases, and report back soon.

Today I can confirm that setting nomerges to 0 has no actual effect

samuelxhu avatar Mar 02 '19 10:03 samuelxhu

Can somebody (who are familar with ZFS DMU code) investigate the IO merging logic inside DMU a bit, perhaps one can find a better solution there?

Just wonder why the IO merging at DMU is not working in this simple (single thread of 4KB consecutive IO writes) case.

samuelxhu avatar Mar 03 '19 02:03 samuelxhu

@samuelxhu @kpande From how I understand it, the problem is reproducible even without zvol: if you overwrite a large-recordsize (ie: 128k) file with 4k writes, you will encounter heavy read/modify/write. The problem does not seem related to the aggregator not doing its work; rather, it depends on the fact that on partial-recordsize write, the entire recordsize must be copied in memory. For example:

  • a 32M sized, 128K recordsize file exists. A sequential 4k workload is generated by issuing something as simple as dd if=/dev/urandom of=<testfile> bs=4k count=1024 conv=notrunc,nocreat;

  • the previous command accumulates writes in memory - nothing is written until txg_sync;

  • by monitoring I/O on another terminal we can see that, while no writes are issued, a significant read activity happens. This is due to each 4k write belonging to a new 128K chunk to bring that specific 128K chunk in memory in the ADB (ARC data buffer) structure. In other words: the first 4k hitting the file at offset 0 will cause the entire recordsized chunk (128K) to be copied in memory, before even issuing other 4k writes and regardless if these writes completely overwrite such recordsized chunk.

  • at transaction flush, the DMU aggregates these individual 4k writes in much fewer 128K ones. This can be checked by running zpool iostat -r 1 on another terminal.

So, the r/m/w behavior really seems intrinsically tied to the ARC/checksumming, rather than depending on aggregator not doing its work.

However, in older ZFS versions (<= 0.6.4), zvols where somewhat immune from this problem. This stems from the fact that, unless doing direct I/O, zvols do not bypass the standard linux pagecache. In the example above, running dd if=/dev/random of=/dev/zd0 bs=4k count=1024 will place all new data into pagacache, rather than in ZFS own ARC. Is at this point, before "passing down" the writes to the ARC, that the linux kernel has a change to coalesce all these 4k writes into bigger ones (up to 512K by default). If it succeeds, the ARC will only see 128K+ sized requests, which cause no r/m/w. This, however, is not without contraindications: double caching all data in pagecache leads to much higher pressure on ARC, causing lower hit rates and higher CPU load. Bypassing the pagecache with direct I/O will instead cause r/m/w.

On ZFS >= 0.6.5, the zvol code was changed to skip some of the previous linux "canned" block layer code, simplyfing the I/O stack and bypassing the I/O scheduler entirely (side note: in recent linux kernel, none is not a noop alias anymore. Rather, it really means no scheduler is in use. I also tried setting nomerges to 0, with no changes in I/O speed or behavior). This increased performance for the common case (zvol with direct I/O), but prevented any merging in the pagecache.

For what it is worth, I feel the current behavior is the right one: in my opinion, zvols should not behave too much differently from datasets. That said, this preclude a possible optimization (ie: using the pagecache as a sort of "first stage" buffer where merging can be done before sending anything to ZFS).

shodanshok avatar Mar 03 '19 13:03 shodanshok

Sorry, I disagree that the current behavior of ZVOL is the right one. There are many use cases for zvol to behave like a normal block device, e.g. as backend storage for FC and iSCSI, hosting VMs etc. In those use-cases, a scheduler such as deadline/noop can merge smaller requests into bigger ones, thereby reducing the likelihood of RMWs.

AND using a scheduler to merge requests does not impose a big burden on memory useage!

samuelxhu avatar Mar 03 '19 14:03 samuelxhu

@kpande Just to confirm that setting /sys/devices/virtual/block/zdXXX/queue/nomerges to 0 does not cause contiguous IO requests to merge. It seems all kinds of IO merging are, unfortunately, disabled by the current implementation.

Ryao's original good will is to avoid double merging and let DMU do IO merging. It is mysterious that DMU does not do the correct merging either.

samuelxhu avatar Mar 03 '19 14:03 samuelxhu

@samuelxhu I think the rationale for current behavior is that you should avoid double caching by using direct I/O to the zvols; in this case, the additional merging done by the pagacache is skipped anyway, so it is better to also skip any additional processing done by the I/O scheduler. Anyway, @ryao can surely answer you in more detailed/correct form.

The key point is that is not the DMU not merging request. Actually, it is doing I/O merging. You are asking for an additional buffer to "pre-merge" multiple write requests before passing them to the "real" ZFS code in order to avoid read amplification. While understanding your request, I think this currently is out of scope, and quite different from how ZFS is expected to work.

shodanshok avatar Mar 03 '19 16:03 shodanshok

@shodandok ZVOL has been widely used as block devices since its beginning, such as backends for FC, iSCSI, hosting VM and even stacking with md, drdb, flashcace, drdb, rdb block devices. Therefore it is extremely important to keep ZVOL like a "normal" block device, supporting scheduler such noop/deadline to merging incoming IO reqeusts. By the way, having standard scheduler behavior has nothing to do with double caching.

@kpande Using smaller volblocksize, such as volblocksize=4k, may help reduce RMWs without the need of IO request merging, however, this is far from the ideal case: with 4KB disks, this effectively prevents ZFS compression and the usage of ZFS RAIDZs. Furthermore, using extremely small volblocksize has a negative impact on throughput performance. It is widely reported that, for hosting VM, volblocksize of 32KB is a better choice in practice.

On Mon, Mar 4, 2019 at 3:54 AM kpande [email protected] wrote:

just use a smaller volblocksize and be aware of raidz overhead considerations if you are not using mirrors. using native 512b storage (some NVMe, some datacentre HDD up to 4TB) and ashift=9 will allow compression to work with volblocksize=4k.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/zfsonlinux/zfs/issues/8472#issuecomment-469057592, or mute the thread https://github.com/notifications/unsubscribe-auth/ALDBAY29hniBkSaaXYAIjTGnnp9zTznGks5vTCh5gaJpZM4baO63 .

samuelxhu avatar Mar 03 '19 23:03 samuelxhu

@samuelxhu but they are normal block devices; only the scheduler code was bypassed to improve performance in the common case. I have no problem understanding what you say and why, but please be aware you are describing a pretty narrow use case/optimization: contiguous, non-direct 4k writes to a zvols, is the only case where pagacache merging will be useful. If random I/O are issued, merging is not useful. If direct I/O Is used, merging is again not useful.

So, while I am not against the change you suggest, please be aware of its narrow scope in real world workloads.

shodanshok avatar Mar 04 '19 06:03 shodanshok

@kpande I have over 20 ZFS storage box served as FC/iSCSI backend, which use 32KB volblocksize. We run different workloads on them, and found that 32KB volblocksize strikes the best balance between IOPS and throughput. I had severals friends runing ZVOL for VMware, who recommends 32KB as well. Therefore IO reqeust merging and sorting at ZVOL can effectively reduce RMW. @shodanshok Adding scheduler layer to ZVOL will not cost much memory/CPU usage, but it will enable stacking ZVOL with many other linux block devices, embracing a much broader scope of use.

samuelxhu avatar Mar 04 '19 09:03 samuelxhu

Let me describe another ZVOL use case which requires the normal block device behavior with a valid scheduler: one or multiple application servers use an FC or iSCSI LUN backed by ZVOL; the servers use a server-side SSD cache, such as Flashcache or bcache to reduce latency and to accelerate application IO. Either flashcache or bcache will issue small but contiguous 4KB IO requests to the backend, anticipating the backend block device will sort and merge those contiguous IO requests.

In the above case, any other block devices include HDD, SSD, RAID, or virutal block device will have no performance issues. BUT with zvol with its current implementation, one will see siginificant performance degradation due to excessive and unneccesary RMWs.

samuelxhu avatar Mar 04 '19 14:03 samuelxhu

In general, it is unlikely that merging will benefit overall performance. However, concurrency is important and has changed during the 0.7 evolution. Unfortunately, AFIAK, there is no comprehensive study on how to tune the concurrency. See https://github.com/zfsonlinux/zfs/wiki/ZFS-on-Linux-Module-Parameters#zvol_threads

Also, there are discussions in https://github.com/zfsonlinux/zfs/issues/7834 regarding the performance changes over time, especially with the introduction of the write and DVA throttles. If you have data to add, please add it there.

richardelling avatar Mar 04 '19 17:03 richardelling

Why using ZVOLs as backend block device for iSCSI/FC LUN is not a common use case? Don't be narror-minded, it is very common. This is the typical use case that ZVOL should have its own scheduler, at least for two purposes: 1) to keep compatible with linux block device model (extremely important for block device stacking), as the applications anticipate the backend storage ZVOL to do IO merging and sorting ; 2) to reduce the chance of notorious RMWs in parituclar for non-4KB ZVOL volblocksizes

I do not really understand, why ZVOL should be different from a normal block device? For those who use ZVOL with 4KB volblocksize only, set the scheduler to noop/deadline does only cost few CPU cycles, but IO merging has the big potential to reduce the chance of RMWs for non-4KB ZVOL volblocksizes.

Pity on me, i run more than hundreds FC/iSCSI ZFS ZVOL storage box with volblocksize of 32KB or even bigger for sensible reasons, missing a valid scheduler in 0.7.X causes us pains on excessive RMWs and thus performance degradation, preventing us from upgrading (from 0.6.4.2) to any later versions.

We would like to sponsor a fund to support somebody who can make a patch restoring the scheduler feature for ZVOL in 0.7.X. Anyone who are interested pls contact me at [email protected]. The patch may or may not be accepted by ZFS authority, but we would like to pay the work.

samuelxhu avatar Mar 05 '19 02:03 samuelxhu

@kpande Thanks a lot for pointing out the related previous commits, and i will have a careful look at it and try to find a temporary remedy for excessive RMWs.

I notice that previous zvol performance testing focuses primarily on 4KB or 8KB ZVOL, perhaps that is the reason rendering RMW issues less visible and thus RMWs are ignored by many eyes.

Let me explain a bit why a larger blocksize ZVOL still makes sense and should not be ignored: 1) to enable the use of LZ4 compression together with RAIDZ(1/2/3) to gain storage space efficiency; 2) to strike a balance between IOPS and throughput, and 32KB seems to be good for VM workloads since it is not-so-big and not-so-small either; 3) We have server-side flash cache (flashcache, bcache, enhanceIO, etc) implemented on all application servers, which absorbs random 4KB writes and then issues contiguous IO requests (semi-sequential )of 4KB or other small sizes, anticipating the backend block devices (iSCSI/FC ZVOLs) to do IO merging/sorting.

In my humble opinion, elliminating the scheduler code from ZVOL really causes the RMWs pain for non-4KB ZVOL, perhaps not for everyone, but at least for some of ZFS fans.

samuelxhu avatar Mar 05 '19 06:03 samuelxhu

@kpande It is interesting to notice that some people are complaining about performance degradation due to commit 37f9dac as well in https://github.com/zfsonlinux/zfs/issues/4512

Maybe it is just a coincidence, maybe not.

The commit 37f9dac may perform well for zvols with direct I/O, but there are many other use cases which are suffering from performance degradation due to the missing scheduler (merging and sorting IO requests) behavior.

samuelxhu avatar Mar 05 '19 09:03 samuelxhu

It seems https://github.com/zfsonlinux/zfs/issues/361 basically cover the problem explained here.

Rather than using the pagecache (with its double-caching and increased memory pressure on ARC), I would suggest creating a small (~1M), front "write buffer" to coalesce writes before sending them to ARC.

@behlendorf @ryao any chances to implement something similar?

shodanshok avatar Mar 05 '19 10:03 shodanshok

@shodanshok good finding!

Indeed #361 deals essentially with the same RMW issue here. It came out in 2011, at which time ZFS practitioner can at least use deadline/noop scheduler (before 0.6.5.X) to allievate the chance of RMWs. In #4512, a few ZFS users complained about significant writes amplification just after removing the scheduler, but for unknown reasons RMWs were not paid attention to.

Given so much evidence, it seems to be the right time to take serious efforts to solve this RMW issue for ZVOL. We volunteer to take the responsibility of testing, and if needed, funding sponsorship up to 5K USD (from Horeb Data AG, Switerland) is possible for the code developer (If multiple developers involved, behlendorf pls divide)

samuelxhu avatar Mar 05 '19 14:03 samuelxhu

@kpande Only for database workloads we have aligned IO for ZVOLs, and unfortunately I do not observe significant performance improvement after 0.6.5.x. The reason might be that I universally have ZFS box with high-end CPU and plenty of DRAM (256GB or above), thus saving a few CPU cycles does not have material impact on IO performance. (The bottleneck is definitely on HDDs, not on CPU cycles or memory bandwidth)

Most of our workloads are not un-aligned IOs, such as hosting VMs, FC/iSCSI backed by ZVOLs, where the frontend applications generate mixed workloads of all kinds. Our engineer team currently focuses on fighting with RMWs, and I think either #361 or #4512 should already show sufficent evidence of the issue.

Before ZVOLs has an effective IO merging facility, we plan to write a shim layer block device sitting in front of ZFS to enable IO request sorting and merging to reduce the occurence of RMWs.

samuelxhu avatar Mar 06 '19 10:03 samuelxhu

@samuelxhu one thing I'd suggest trying first is to increase the dbuf cache size. This small cache sits in front of the compressed ARC and contains an LRU of the most recently used uncompressed buffers. By increasing its size you may be able to mitigate some of the RMW penalty you're seeing. You'll need to increase the dbuf_cache_max_bytes module option.

Before ZVOLs has an effective IO merging facility, we plan to write a shim layer block device sitting in front of ZFS to enable IO request sorting and merging to reduce the occurrence of RMWs.

You might find you can use one of Linux's many existing dm devices for this layer.

Improving the performance of volumes across a wide variety of workloads is something we're interested in, but haven't had the time to work on. If you're interested, rather than implementing your own shim layer I'd be happy to discuss a design for doing the merging in the zvol implementation. As mentioned above, the current code depends on the DMU do the heavy lifting regarding merging. However, for volumes there's nothing preventing us from doing our own merging. Even just front/back merging or being aware of the volumes internal alignment might yield significant gains.

behlendorf avatar Mar 06 '19 20:03 behlendorf

In order to merge you need two queues: active and waiting. With the request-based scheme there is one queue with depth=zvol_threads. In other words, we'd have to pause I/Os before they become active. This is another reason why I believe merging is not the solution to the observed problem.

richardelling avatar Mar 07 '19 00:03 richardelling

@richardelling From my test, it seems that DMU merging at writeout time is working properly. What kills the performance of smaller-than-recordsize writes (ie: 4k on a 128K recordsize/volblocksize), for both zvols and regular datasets, is the read part of the r/m/w behavior. Basically, when a small write (ie: 4k) is buffered by the ARC, it had to bring in memory the whole 128K record, irrespective of later writes overlapping (and completely accounting for) the whole recordsize.

Hence my idea of a "front-buffer" which accepts small writes as they are (irrespective of the underlying recordsize) that, after having accumulated/merged some data (say, 1 MB) writes them via the normal ARC buffering/flushing scheme. This would emulate what pagecache is doing for regular block devices, without the added memory pressure of a real pagecache (which can not be limited in any way, if I remember correctly).

I have no idea if this can be implemented without lowering ZFS excellent resilience or how difficult would be doing it, of course.

shodanshok avatar Mar 07 '19 07:03 shodanshok

@behlendorf thanks a lot for suggestions. Looks like that front merging can be easily turned on by reverting commit 5731140, but extensive IO sorting/merging inside ZVOL/DMU may take more efforts, and I may not be capable of coding much myself, but would like to contribute in testing or other ways as much as possible

samuelxhu avatar Mar 09 '19 04:03 samuelxhu

Just to chime in - we use ZFS heavily with VM workloads and there is a huge tradeoff between using a 128KiB volblocksize or smaller. Higher volblocksizes actually perform much better up to a point when throughput is saturated, while smaller volblocksizes almost always perform worse, but don't cause throughput problems. And I found it quite difficult to actually predict/benchmark this behaviour because it works very differently on new unfragmented pools, new ZVOLs (no overwrites), different layers of caching (I am absolutely certain that linux pagecache still does something with ZFS as I'm seeing misses that never hit the drives) and various caching problems (ZFS doesn't seem to cache everything it should or could in ARC).

This all makes it very hard to compare performance of ZFS/ZVOLs to any other block device, it makes it hard to tune and it makes it extremely hard to compete with "dumb" solutions like mdraid when performance is all over the place.

If there is any possibility to improve merging to avoid throughput saturation then please investigate it, the other solution (to the problems I am seeing in my environment) is to fix performance issues with smaller volblocksizes, but I guess that will be much more difficult and I have seen it already discussed elsewhere multiple times (like ZFS not being able to use vdev queues efficiently when those vdevs are fast, like NVMe where I have rarely seen a queue size >1).

zviratko avatar Apr 03 '19 08:04 zviratko

We did a lot of experimentation with ZVOLs here and I'd like to offer a few suggestions.

  1. RMW can come from above you as well as from within ZFS. Depending on what parameters you're using on your filesystem and what you set for your block device, you can end up with either the VM subsystem or user land thinking that you have a large minimum IO size, and they will try to pull in data from you before they write out.

With zvols, always always always blktrace them as you're setting up to see what is going on. We found that some filesystem options (large XFS allocsize=) could provoke RMW from the pager when things were being flushed out. If you blktrace and see reads for a block coming in before the writes do, you are in this situation.

  1. Proper setup is essential and "proper" is a matter of perspective. Usually it's best to configure a filesystem as though it was on a RAID stripe either the size of the volblocksize, or half that size. The reason you might choose a smaller size is if you are on a pool with no SLOG and you want all FIO writes to the zvol to go to ZIL blocks instead of indirect sync, as large block zvols do with full-block writes. Or, you may want to refactor your data into larger chunks for efficiency or synchronization purposes.

  2. Poor inbound IO merge. It's best to configure a filesystem on a zvol to expose a large preferred IO size to applications, allowing FIO to come through in big chunks.

  3. Always use primarycache=all.

  4. If you use XFS on zvols, use a separate 4K volblocksize ZVOL for XFS filesystem journaling. This can be small, 100MB is more than enough. This keeps the constant flushing that XFS does out of your primary ZVOL, and allows things to aggregate much more effectively.

Here's an example:

zfs create -V 1g -o volblocksize=128k tank/xfs zfs create -V 100m -o volblocksize=4k tank/xfsjournal

mkfs.xfs -s size=4096 -d sw=1,su=131072 -m crc=0 -l logdev=/dev/zvol/tank/xfsjournal /dev/zvol/tank/xfs mount -o largeio,discard,noatime,logbsize=256K,logbufs=8 /dev/zvol/tank/xfs /somewhere

largeio + large stripe unit + separate XFS journal has been the winning combination for us.

Hope this helps.

janetcampbell avatar Apr 06 '19 03:04 janetcampbell

Very good points. Thanks a lot

Samuel

On Sat, Apr 6, 2019 at 5:08 AM janetcampbell [email protected] wrote:

We did a lot of experimentation with ZVOLs here and I'd like to offer a few suggestions.

  1. RMW can come from above you as well as from within ZFS. Depending on what parameters you're using on your filesystem and what you set for your block device, you can end up with either the VM subsystem or user land thinking that you have a large minimum IO size, and they will try to pull in data from you before they write out.

With zvols, always always always blktrace them as you're setting up to see what is going on. We found that some filesystem options (large XFS allocsize=) could provoke RMW from the pager when things were being flushed out. If you blktrace and see reads for a block coming in before the writes do, you are in this situation.

Proper setup is essential and "proper" is a matter of perspective. Usually it's best to configure a volume as though it was on a RAID stripe either the size of the volblocksize, or half that size. The reason you might choose a smaller size is if you are on a pool with no SLOG and you want all writes to the zvol to go to ZIL blocks instead of indirect sync, as large block zvols do with full-block writes. Or, you may want to refactor your data into larger chunks for efficiency or synchronization purposes. 2.

Poor inbound IO merge. It's best to configure a filesystem on a zvol to expose a large preferred IO size to applications, allowing FIO to come through in big chunks. 3.

Always use primarycache=all. 4.

If you use XFS on zvols, use a separate 4K volblocksize ZVOL for XFS filesystem journaling. This can be small, 100MB is more than enough. This keeps the constant flushing that XFS does out of your primary ZVOL, and allows things to aggregate much more effectively.

Here's an example:

zfs create -V 1g -o volblocksize=128k tank/xfs zfs create -V 100m -o volblocksize=4k tank/xfsjournal

mkfs.xfs -s size=4096 -d sw=1,su=131072 -m crc=0 -l logdev=/dev/zvol/tank/xfsjournal /dev/zvol/tank/xfs mount -o largeio,discard,noatime,logbsize=256K,logbufs=8 /dev/zvol/tank/xfs /somewhere

largeio + large stripe unit have been the winning combination for us.

Hope this helps.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/zfsonlinux/zfs/issues/8472#issuecomment-480469011, or mute the thread https://github.com/notifications/unsubscribe-auth/ALDBAf-NmeyIqfvQFI1w277HQiTFgaaxks5veA-UgaJpZM4baO63 .

samuelxhu avatar Apr 06 '19 06:04 samuelxhu

Just to chime in - we use ZFS heavily with VM workloads and there is a huge tradeoff between using a 128KiB volblocksize or smaller. Higher volblocksizes actually perform much better up to a point when throughput is saturated, while smaller volblocksizes almost always perform worse, but don't cause throughput problems.

A little gem I came up with that I haven't seen elsewhere...

Large zvols cause more TxG commit activity. The big danger from this is RMW reads, which can stomp on other IO that's going around.

Measure TxG commit speed. Open the ZIO throttle. Then, set zfs_sync_taskq_batch_pct=1 and do a TxG commit. Raise it slowly until TxG commit speed is a little slower than it was before the test. This will rate limit the TxG commit and the RMW reads that come off of it, and also can help I/O aggregation. I came up with this approach when I developed a remote backup system that went to block devices on the far side of a WAN.

With this you can run long intervals between commits and carry plenty of dirty data, which helps reduce RMW. Once you set the sync taskq, turn the ZIO throttle on and adjust it to just before where it starts to have an effect. This will match these two parameters to the natural flow of the system. At this point you can usually turn aggregation way up and drop the number of async writers some.

Oh, and make sure your dirty data write throttle is calibrated correctly and has enough room to work. ndirty should stabilize in the middle of its range during high throughput workloads.

We mostly use 128K-256K zvols. They work very well and beat out ZPL mounts for MongoDB performance for us. Performance is more consistent than ZPL mounts provided you're good to them (don't do indirect sync writes with a small to moderate block size zvol unless you don't care about read performance).

janetcampbell avatar Apr 06 '19 13:04 janetcampbell

I realized there are a lot of comments here that are coming from the wrong place on RMW reads and how ZFS handles data going into the DMU and such. Unless in the midst of a TxG commit, ZFS will not issue RMW reads for partial blocksize writes unless they are indirect sync writes, and you can't get a partial block indirect sync write on a ZVOL due to how zvol_immediate_write_size is handled. Normally the txg commit handles all RMW reads when necessary at the start of the commit, and none happen between commits.

The RMW reads people are bothered by are actually coming from the Linux kernel, in fs/buffer.c. Here's a long winded explanation of why and how to fix it (easy with ZVOLs):

https://github.com/zfsonlinux/zfs/issues/8590

With a 4k superblock inode size you can run a ZVOL with a huge volblocksize, txg commit once a minute, and handle tiny writes without problem. Zero RMW if all the pieces of the block show up before TxG commit.

Hope this helps.

janetcampbell avatar Apr 07 '19 22:04 janetcampbell

@janetcampbell while I agree that a reasonable sized recordsize is key to extract good read performance, especially from rotating media, I think you are missing the fact that RMW can and will happen very early in the write process, as early as accepting the write buffer into the DMU. Let me do a practical example:

# create a 128K recordsize test pool
[root@singularity ~]# zfs create tank/test
[root@singularity ~]# zfs get recordsize tank/test
NAME       PROPERTY    VALUE    SOURCE
tank/test  recordsize  128K     default

# create a 1GB test file and drop caches
[root@singularity ~]# dd if=/dev/urandom of=/tank/test/test.img bs=1M count=1024
1024+0 records in
1024+0 records out
1073741824 bytes (1.1 GB) copied, 4.94275 s, 217 MB/s
[root@singularity ~]# sync
[root@singularity ~]# echo 3 > /proc/sys/vm/drop_caches

# rewrite some sequential 4k blocks
[root@singularity ~]# dd if=/dev/urandom of=/tank/test/test.img bs=4k count=1024 conv=notrunc,nocreat
1024+0 records in
1024+0 records out
4194304 bytes (4.2 MB) copied, 1.05854 s, 4.0 MB/s

# on another terminal, monitor disk io - rmw happens
[root@singularity ~]# zpool iostat 1
              capacity     operations     bandwidth
pool        alloc   free   read  write   read  write
----------  -----  -----  -----  -----  -----  -----
tank        92.0G  1.72T      0      4  2.16K   494K
tank        92.0G  1.72T      0      0      0      0
tank        92.0G  1.72T      0      0      0      0
tank        92.0G  1.72T      0      0      0      0
tank        92.0G  1.72T      3      0   511K      0
tank        92.0G  1.72T     27      0  3.50M      0
tank        92.0G  1.72T      0    169      0  9.35M
tank        92.0G  1.72T      0      0      0      0
tank        92.0G  1.72T      0      0      0      0
tank        92.0G  1.72T      0      0      0      0
tank        92.0G  1.72T      0      0      0      0

# retry the same *without* dropping the cache
[root@singularity ~]# dd if=/dev/urandom of=/tank/test/test.img bs=4k count=1024 conv=notrunc,nocreat
1024+0 records in
1024+0 records out
4194304 bytes (4.2 MB) copied, 0.0306379 s, 137 MB/s

# no rmw happens
[root@singularity ~]# zpool iostat 1
              capacity     operations     bandwidth
pool        alloc   free   read  write   read  write
----------  -----  -----  -----  -----  -----  -----
tank        92.0G  1.72T      0      4  3.07K   489K
tank        92.0G  1.72T      0      0      0      0
tank        92.0G  1.72T      0      0      0      0
tank        92.0G  1.72T      0      0      0      0
tank        92.0G  1.72T      0      0      0      0
tank        92.0G  1.72T      0      0      0      0
tank        92.0G  1.72T      0     61      0  7.63M
tank        92.0G  1.72T      0    104      0  1.78M

Please note how, on the first 4k write test, rmw (with synchronous reads) happens as soon as the write buffers are accepted in the DMU (this is reflected by the very low dd throughput). This happens even if dd, being sequential, completely overwrites the affected zfs records. In other words, we don't really have a merging problem here; rather, we see io amplification due to rmw. Merging at writeout time is working correctly.

The second identical write test, which is done without dropping the cache, avoids the rmw part (especially its synchronous read part) and shows much higher write performance. Again, merging at write time is working correctly.

This is, in my opinion, the key reason why peoples tell ZFS needs tons of memory to have good performance: being so penalizing, reducing the R part of rmw using very large ARC can be extremely important. It should be noted that L2ARC works very well in this scenario, and it is the main reason why I often use cache device even on workloads with low L2ARC hit rate.

shodanshok avatar Apr 08 '19 08:04 shodanshok

You can enable merge at runtime on ZVOLs, IIRC that patch was in part trying to deal with Oracle RAC/iSCSI/ZFS. Disabling merges, and upping nr_requests helps with some workloads, but doesnt do as much as one would think in my testing. You can try the same via:

diff --git a/include/zfs/linux/blkdev_compat.h b/include/zfs/linux/blkdev_compat.h
index c8cdf38ef4fe..ad3e9537d5b3 100644
--- a/include/zfs/linux/blkdev_compat.h
+++ b/include/zfs/linux/blkdev_compat.h
@@ -632,4 +632,11 @@ blk_generic_end_io_acct(struct request_queue *q, int rw,
 #endif
 }
 
+static inline void blk_update_nr_requests(struct request_queue *q, unsigned int nr)
+{
+        spin_lock_irq(q->queue_lock);
+        q->nr_requests = nr;
+        spin_unlock_irq(q->queue_lock);
+}
+
 #endif /* _ZFS_BLKDEV_H */
diff --git a/fs/zfs/zfs/zvol.c b/fs/zfs/zfs/zvol.c
index 6eb926cee6ef..05823a24ce5b 100644
--- a/fs/zfs/zfs/zvol.c
+++ b/fs/zfs/zfs/zvol.c
@@ -1666,8 +1666,12 @@ zvol_alloc(dev_t dev, const char *name)
        /* Limit read-ahead to a single page to prevent over-prefetching. */
        blk_queue_set_read_ahead(zv->zv_queue, 1);
 
+
+       /* Set deeper IO queue for modern zpools: default is 128, SSDs easily do > 512*/
+       blk_update_nr_requests(zv->zv_queue, 1024);
+
        /* Disable write merging in favor of the ZIO pipeline. */
-       blk_queue_flag_set(QUEUE_FLAG_NOMERGES, zv->zv_queue);
+       // blk_queue_flag_set(QUEUE_FLAG_NOMERGES, zv->zv_queue);
 
        zv->zv_disk = alloc_disk(ZVOL_MINORS);
        if (zv->zv_disk == NULL)

sempervictus avatar Apr 09 '19 18:04 sempervictus

Any news? This issue is annoying. I'm seeing queue > 5K for my ssd and sometimes it starts to produce errors: [ 2111.023567] sd 11:0:0:0: [sdd] tag#12 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_TIMEOUT [ 2111.023570] sd 11:0:0:0: [sdd] tag#12 CDB: Write(10) 2a 00 2b 0e ea ea 00 00 03 00 [ 2111.023573] blk_update_request: I/O error, dev sdd, sector 722397930 op 0x1:(WRITE) flags 0x700 phys_seg 1 prio class 0

It's not a problem with ssd i think. Just a slow model, smart is OK and it works flawlessly under other workloads.

Temtaime avatar Jan 21 '20 22:01 Temtaime