zfs
zfs copied to clipboard
Lack of fairness of sync writes
System information
| Type | Version/Name |
|---|---|
| Distribution Name | Arch Linux |
| Distribution Version | rolling |
| Linux Kernel | Linux 5.2.7-arch1-1-ARCH #3 SMP, PREEMPT_VOLUNTARY |
| Architecture | x86-64 |
| ZFS Version | 0.8.2 |
| SPL Version |
Describe the problem you're observing
I am observing unfair sync write scheduling and severe userspace process io starvation in certain situation. It appears to be that fsync call on a file with a lot of unwritten dirty data will stall the system and cause FIFO-like sync write order, where no other processes get their share untill the dirty data is flushed. On my home system this causes severe stalls when the guest VM with cache=writeback virtio-scsi disk decides to sync the scsi barrier while having a lot of dirty data in the hypervisor's RAM. All other hypervisor writers block completely and userspace starts chimping out with various timeouts and locks. It effectively acts as a DoS.
Describe how to reproduce the problem
1). Prepare a reasonably-default dataset.
zfs get all rootpool1/arch1/varcache
NAME PROPERTY VALUE SOURCE
rootpool1/arch1/varcache type filesystem -
rootpool1/arch1/varcache creation Thu Aug 15 23:37 2019 -
rootpool1/arch1/varcache used 13.4G -
rootpool1/arch1/varcache available 122G -
rootpool1/arch1/varcache referenced 13.4G -
rootpool1/arch1/varcache compressratio 1.08x -
rootpool1/arch1/varcache mounted yes -
rootpool1/arch1/varcache quota none default
rootpool1/arch1/varcache reservation none default
rootpool1/arch1/varcache recordsize 128K default
rootpool1/arch1/varcache mountpoint /var/cache local
rootpool1/arch1/varcache sharenfs off default
rootpool1/arch1/varcache checksum on default
rootpool1/arch1/varcache compression lz4 inherited from rootpool1
rootpool1/arch1/varcache atime off inherited from rootpool1
rootpool1/arch1/varcache devices on default
rootpool1/arch1/varcache exec on default
rootpool1/arch1/varcache setuid on default
rootpool1/arch1/varcache readonly off default
rootpool1/arch1/varcache zoned off default
rootpool1/arch1/varcache snapdir hidden default
rootpool1/arch1/varcache aclinherit restricted default
rootpool1/arch1/varcache createtxg 920 -
rootpool1/arch1/varcache canmount on default
rootpool1/arch1/varcache xattr sa inherited from rootpool1
rootpool1/arch1/varcache copies 1 default
rootpool1/arch1/varcache version 5 -
rootpool1/arch1/varcache utf8only off -
rootpool1/arch1/varcache normalization none -
rootpool1/arch1/varcache casesensitivity sensitive -
rootpool1/arch1/varcache vscan off default
rootpool1/arch1/varcache nbmand off default
rootpool1/arch1/varcache sharesmb off default
rootpool1/arch1/varcache refquota none default
rootpool1/arch1/varcache refreservation none default
rootpool1/arch1/varcache guid 394889745699357232 -
rootpool1/arch1/varcache primarycache all default
rootpool1/arch1/varcache secondarycache all default
rootpool1/arch1/varcache usedbysnapshots 0B -
rootpool1/arch1/varcache usedbydataset 13.4G -
rootpool1/arch1/varcache usedbychildren 0B -
rootpool1/arch1/varcache usedbyrefreservation 0B -
rootpool1/arch1/varcache logbias latency default
rootpool1/arch1/varcache objsetid 660 -
rootpool1/arch1/varcache dedup off default
rootpool1/arch1/varcache mlslabel none default
rootpool1/arch1/varcache sync standard default
rootpool1/arch1/varcache dnodesize legacy default
rootpool1/arch1/varcache refcompressratio 1.08x -
rootpool1/arch1/varcache written 13.4G -
rootpool1/arch1/varcache logicalused 14.5G -
rootpool1/arch1/varcache logicalreferenced 14.5G -
rootpool1/arch1/varcache volmode default default
rootpool1/arch1/varcache filesystem_limit none default
rootpool1/arch1/varcache snapshot_limit none default
rootpool1/arch1/varcache filesystem_count none default
rootpool1/arch1/varcache snapshot_count none default
rootpool1/arch1/varcache snapdev hidden default
rootpool1/arch1/varcache acltype off default
rootpool1/arch1/varcache context none default
rootpool1/arch1/varcache fscontext none default
rootpool1/arch1/varcache defcontext none default
rootpool1/arch1/varcache rootcontext none default
rootpool1/arch1/varcache relatime off default
rootpool1/arch1/varcache redundant_metadata all default
rootpool1/arch1/varcache overlay on local
rootpool1/arch1/varcache encryption off default
rootpool1/arch1/varcache keylocation none default
rootpool1/arch1/varcache keyformat none default
rootpool1/arch1/varcache pbkdf2iters 0 default
rootpool1/arch1/varcache special_small_blocks 0 default
2). Prepare 2 terminal tabs, cd to this dataset mount point. In them prepare the following fio commands: "big-write"
fio --name=big-write --ioengine=sync --rw=write --bs=32k --direct=1 --size=2G --numjobs=1 --end_fsync=1
and "small-write"
fio --name=small-write --ioengine=sync --rw=write --bs=128k --direct=1 --size=128k --numjobs=1 --end_fsync=1
3). Let them run once to prepare the necessary benchmark files. In the meantime observe the iostat on the pool:
zpool iostat -qv rootpool1 0.1
capacity operations bandwidth syncq_read syncq_write asyncq_read asyncq_write scrubq_read trimq_write
pool alloc free read write read write pend activ pend activ pend activ pend activ pend activ pend activ
------------------------------------------------------ ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- -----
rootpool1 273G 171G 0 7.15K 0 598M 0 0 0 0 0 0 213 8 0 0 0 0
mirror 273G 171G 0 7.15K 0 597M 0 0 0 0 0 0 213 8 0 0 0 0
ata-PNY_CS900_480GB_SSD_PNY111900057103045C0-part2 - - 0 3.57K 0 300M 0 0 0 0 0 0 105 4 0 0 0 0
ata-Patriot_Burst_9128079B175300025792-part2 - - 0 3.56K 0 297M 0 0 0 0 0 0 108 4 0 0 0 0
------------------------------------------------------ ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- -----
...
capacity operations bandwidth syncq_read syncq_write asyncq_read asyncq_write scrubq_read trimq_write
pool alloc free read write read write pend activ pend activ pend activ pend activ pend activ pend activ
------------------------------------------------------ ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- -----
rootpool1 274G 170G 0 7.19K 0 565M 0 0 0 7 0 0 0 0 0 0 0 0
mirror 274G 170G 0 7.18K 0 565M 0 0 0 7 0 0 0 0 0 0 0 0
ata-PNY_CS900_480GB_SSD_PNY111900057103045C0-part2 - - 0 3.59K 0 282M 0 0 0 3 0 0 0 0 0 0 0 0
ata-Patriot_Burst_9128079B175300025792-part2 - - 0 3.59K 0 282M 0 0 0 4 0 0 0 0 0 0 0 0
------------------------------------------------------ ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- -----
Note that when fio issues 2G of async writes it calls fsync at the very end, wich moves them from async to sync class.
4). When fios are finished, do the following: start "big-write" and then after 2-3 seconds (when "Jobs: 1" appears) start the "small-write". Note that the small 128K write will never finish before the 2G one, and the second fio remains blocked until the first one finishes.
Include any warning/errors/backtraces from the system logs
Lowering zfs_dirty_data_max significantly (to 100-200M values from default 3G) mitigates the problem for me, but with 50% performance drop.
After some code investigation the problem appears to be too deeply ingrained in the write path. There are multiple problems that cause this:
- Writer task_id is effectively lost on VFS layer (vnops). Apart from zio priority, there is no QoS-related data tracking for dmu itxses and zios. While current()'s task_id can be added to these structures, there is no data to track any type of QoS at this moment.
- DMU throttle that implements the zfs_dirty_data_max logic and bounds the total size of dirty dbufs of a transaction group is unfair. Assuming very busy system with one full txg already syncing, single process can eat up all ram and will be as throttled as the interactive processes that arrive later, after single ddoser's write flow. Transaction allocation is effectively as fair as cpu scheduler, relying on thread priorities. Even worse, unless their files are open with O_SYNC flag, latency-sentive applications that rely on multiple writes + fsync still go through DMU throttle and receive a severe penalty to their write speed.
- ZIL is serialized per-dataset. I repeated the experiment from the OP on two separate datasets and the second small fio indeed managed to complete before the first big one in some cases. The speed of the second fio was still abyssimal, as nothing in the SPA/ZIO guarantees fairness to zils. Such FIFO ordering is incompatible with any form of bandwidth sharing, and this is the original cause of my problems: ZIL processes 2G of first fio and only then 128K of small fio.
- SPA allocation throttle and vdev schedulers are oblivious to the source of zios they handle. They are not fair to processes nor to any other entities, and only look at graph ordering of zios and their priority class.
I am afraid my workaround is currently the only viable option for acceptable latency under overwhelming fsync load. ZFS is nor designed nor built to be bandwidth-fair to consumer entities.
This issue has been automatically marked as "stale" because it has not had any activity for a while. It will be closed in 90 days if no further activity occurs. Thank you for your contributions.
It's a design issue, so I guess a fix probability is effectively zero. Still, a desirable feature in both desktop and multi-tenant headless environments. Let's hear out developers on the subject of complexity and then close/wontfix it.
Probably related to #11140
apparently, we had this in the past and maybe i was wrong that it was resolved?
https://github.com/openzfs/zfs/issues/4603
anyhow, what about https://github.com/openzfs/zfs/pull/11929#issuecomment-829710720 and https://github.com/openzfs/zfs/pull/11912 ?
apparently, we had this in the past and maybe i was wrong that it was resolved?
#4603
anyhow, what about #11929 (comment) and #11912 ?
I can still reproduce it on 0.8.4:
main big write:
WRITE: bw=62.0MiB/s (65.0MB/s), 62.0MiB/s-62.0MiB/s (65.0MB/s-65.0MB/s), io=2048MiB (2147MB), run=33055-33055msec
starved small write:
WRITE: bw=5384B/s (5384B/s), 5384B/s-5384B/s (5384B/s-5384B/s), io=128KiB (131kB), run=24344-24344msec
dd example from #4603 seems incorrect, I did not see any non-zero numbers in syncq_write column of zpool iostat while running it.
The solutions/comments you linked revolve around queue depth limitation and reducing the latency at the cost of bandwidth reduction. This will not replace a theoretical writer-aware fair scheduler, that could fix the latency without making a write queue universally-shallow.
@stale stale bot closed this on 25 Nov 2020
#4603 was open for 4 years with no activity and then closed by stale bot. not the best way to handle issues, and this is a real one.
our whole proxmox (with local zfs storage) migration from xenserver stalls because of this for weeks now and all our effort up to now may go poof if we don't get this resolved.
but besides the fsync stall there may be other stalling issues in kvm. i put this bugreport for reference, as it is at least related: https://bugzilla.kernel.org/show_bug.cgi?id=199727
i also think this is a significant one, rendering kvm on top of zfs really unusable when you have fsync centric workloads.
thanks for reporting the details about it and for your analysis !
our whole proxmox (with local zfs storage) migration from xenserver stalls because of this for weeks now and all our effort up to now may go poof if we don't get this resolved.
Yeah, if you're multi-tenant (many VMs) you'll have better luck with boring qcow2s on ext/xfs+raid.
no option. i want zfs snapshots and replication and i care about my data.
@behlendorf wondering what your thoughts on this issue are. I'm on Proxmox as well and have occasionally noticed the same thing as @Boris-Barboris and @devZer0 have.
this is a real bummer.
this following output delays happen for example by simply copying a file inside a virtual machine
you can clearly see that sync io from ioping is getting completely starved inside the VM. I guess i have never seen a single IO need 5.35min for completion.
[root@gitlab backups]# ioping -WWWYy ioping.dat 4 KiB >>> ioping.dat (xfs /dev/dm-3): request=43 time=87.7 ms 4 KiB >>> ioping.dat (xfs /dev/dm-3): request=44 time=65.9 ms 4 KiB >>> ioping.dat (xfs /dev/dm-3): request=45 time=24.9 ms (fast) 4 KiB >>> ioping.dat (xfs /dev/dm-3): request=46 time=25.4 ms (fast) 4 KiB >>> ioping.dat (xfs /dev/dm-3): request=47 time=42.9 ms 4 KiB >>> ioping.dat (xfs /dev/dm-3): request=48 time=16.3 s (slow) 4 KiB >>> ioping.dat (xfs /dev/dm-3): request=49 time=5.35 min (slow) 4 KiB >>> ioping.dat (xfs /dev/dm-3): request=50 time=99.5 ms (fast) 4 KiB >>> ioping.dat (xfs /dev/dm-3): request=51 time=311.5 ms (fast) 4 KiB >>> ioping.dat (xfs /dev/dm-3): request=52 time=322.6 ms (fast) 4 KiB >>> ioping.dat (xfs /dev/dm-3): request=53 time=49.9 ms (fast)
this long starvation causing the following kernel message :
[87720.075195] INFO: task xfsaild/dm-3:871 blocked for more than 120 seconds.
[87720.075320] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[87720.075384] xfsaild/dm-3 D ffff8c48f43605e0 0 871 2 0x00000000
[87720.075387] Call Trace:
[87720.075405] [
i'm not completely sure if this is a zfs problem alone, as with "zpool iostat -w hddpool 1" i would expect to see the outstanding IO from ioping (which hangs for minuts) in syncq_wait queue, but in 137s row, there is no single IO shown. is there a way to make this visible on ZFS layer ?
hddpool total_wait disk_wait syncq_wait asyncq_wait
latency read write read write read write read write scrub trim
---------- ----- ----- ----- ----- ----- ----- ----- ----- ----- -----
1ns 0 0 0 0 0 0 0 0 0 0
3ns 0 0 0 0 0 0 0 0 0 0
7ns 0 0 0 0 0 0 0 0 0 0
15ns 0 0 0 0 0 0 0 0 0 0
31ns 0 0 0 0 0 0 0 0 0 0
63ns 0 0 0 0 0 0 0 0 0 0
127ns 0 0 0 0 0 0 0 0 0 0
255ns 0 0 0 0 0 0 0 0 0 0
511ns 0 0 0 0 0 0 0 4 0 0
1us 0 0 0 0 0 0 0 32 0 0
2us 0 0 0 0 0 0 0 3 0 0
4us 0 0 0 0 0 0 0 3 0 0
8us 0 0 0 0 0 0 0 0 0 0
16us 0 0 0 0 0 0 0 0 0 0
32us 0 0 0 0 0 0 0 0 0 0
65us 0 0 0 0 0 0 0 0 0 0
131us 0 0 0 0 0 0 0 1 0 0
262us 0 0 0 0 0 0 0 0 0 0
524us 0 5 0 5 0 0 0 6 0 0
1ms 0 11 0 11 0 0 0 5 0 0
2ms 0 17 0 17 0 0 0 5 0 0
4ms 0 5 0 8 0 0 0 0 0 0
8ms 0 6 0 3 0 0 0 0 0 0
16ms 0 2 0 4 0 0 0 0 0 0
33ms 0 8 0 6 0 0 0 3 0 0
67ms 0 0 0 0 0 0 0 1 0 0
134ms 0 0 0 0 0 0 0 2 0 0
268ms 0 5 0 18 0 0 0 6 0 0
536ms 0 18 0 39 0 0 0 3 0 0
1s 0 5 0 2 0 0 0 0 0 0
2s 0 0 0 0 0 0 0 0 0 0
4s 0 0 0 0 0 0 0 0 0 0
8s 0 0 0 0 0 0 0 0 0 0
17s 0 0 0 0 0 0 0 0 0 0
34s 0 0 0 0 0 0 0 0 0 0
68s 0 32 0 0 0 32 0 0 0 0
137s 0 0 0 0 0 0 0 0 0 0
--------------------------------------------------------------------------------
i know this can be mitigated to some point with adding a SLOG, but we have a ssd and a hdd mirror or raidz on each hypervisor server and adding another enterprise ssd just to make the hdd's run without a problem feels a little bit ugly, as you could better switch to "ssd only" system then.
Any updates? Are there plans to resolve this in the future versions?
Why hasn't this been escalated as a serious issue? Performance before features imo
So it's kinda a design limitation: Normally a filesystem is offering a mount and is accessing a disk. The fairness here is provided by the IO-scheduler which is attributing the individual requests to the processes issuing these.
ZFS however isn't working that way. The actual IO to the disks are issued by ZFS processes. The scheduler therefore cannot "see" which application is behind individual IO.
In addition ZFS has its own scheduler built-in and thus an IO scheduler below isn't considered helpful, as the IO gets optimized for low latency by making sure they get issued in an order which completes individual requests as fast as possible.
The scheduler is also sorting the requests to complete synchronous reads first and synchronous writes with second priority, followed by asynchronous reads and then asynchronous writes.
These priorities are not super strict however: The amount of concurrent queues for each type of the described IO classes are tuned up and down, based on the outstanding requests. The tuneable you modified is adjusting the limit of how many writes can be cached.
Lowering this value increases the amount of writing threads earlier, as the thresholds are percentages. In addition ZFS starts to throttle the incoming IO by applications, by introducing a sleep time if this cache gets fuller.
So there are a couple of things you could try to lower the impact of issues you're seeing:
Reduce parallel IO jobs per vdev
check if your disks can keep up with the amount of concurrency:
- Open atop as root with
atop 1. - Have a look on the header for your disk(s) running the pool
- Check how large the latency for the IO spikes on them (last value on the right)
If it's often above say 15ms (on SSDs)/50ms (HDDs) the disk has trouble keeping up with the amount of concurrent IO.
The tuneable zfs_vdev_max_active allows you to lower the maximum amount of concurrent read/write jobs.
Earlier throttling and adjusting the delay introduced for throttling
Instead of lowering the maximum amount of "dirty" data for async writes its better IMHO to adjust on what percentage ZFS starts throttling the writes accepted by applications.
The threshold can be configured with zfs_dirty_data_sync_percent.
In addition zfs_delay_scale needs to be adjusted. It should be set to 1 billion divided by the IOPS possible on your pool (says the docs).
It's probably best to adjust that to a mix of random and sequential IO tested on one disk. Depending on your pool layout you need to multiply that:
- If your pool contains multiple mirrors or zraids, you need to multiply the single disk IO with the amount of mirror- or raidz-groups.
- If you use multiple disks in a "plain" configuration, just extending the pool over multiple disks, multiply it with the amount of disks.
@ShadowJonathan wrote:
Why hasn't this been escalated as a serious issue? Performance before features imo
It has, there's a feature request open to implement the missing feature: Balancing IO between processes and creating IONice levels as well, so background IO can be marked as such.
See #14151
hello, does this new feature of sync parallism help adressing this problem ?
https://www.phoronix.com/news/OpenZFS-Sync-Parallelism
The cause of this issue is that each fsync() request on a file concatenates all the async write data of that file that are not yet written to a stable storage to a tail of sync write list of the ZIL. That sync write list is strictly ordered to ensure data consistency, that makes later small fsync() to wait for the previous large one. Would they both go to the same file, or be intermixed with some metadata operations, it could have no general case solution, or at least no easy one due to possible dependencies. If they go to a different files, then their ZIL writes could potentially be intermixed to implement some QoS policy, but at this moment it was not done yet.
@amotin maybe a long shot. But I think sch_cake's algorithm could be used here:
I think the solution would be to reconsider the way we handle queuing for processes that send too much requests. Traditionally, requests would be accepted until the queue hits its limit. But this isn't really a fair approach if there are processes which use a lot of IO, as other processes can't skip the long queue - as you explained.
So I think the solution would be to avoid queuing new requests much earlier for processes issuing a lot of requests, by accepting requests in a fair manner, like sch_cake is sending packages for conversation partners in a fair manner.
So we would assign a latency target to the queue, and accept freely all requests until the latency target is exceeded. Then we start fair queuing, giving each process, based on their used latency "bandwidth" a chance to issue another request.
This would require us to create statistic how long each request took from start to completion, per process. If we have no data on that, because the process is new, or the type of request wasn't issued before by the process, we would just accept one request, and accept the second one only if we measured the first one.
The system would then track the "cost" (in time) of requests for each pool and thread, roughly in categories of the sizes for each request type - to prevent large requests from distorting latency predictions that are based on many smaller requests and read and write requests to be mixed.
To keep the statistics accurate, cache-served reads wouldn't be included and write-modify requests probably would need a separate statistical category, given their higher cost due to Copy-On-Write.
ZFS currently has a model of distributing sync reads/async reads/sync writes/async writes with different tiers. This could be accomplished too, by using the QoS method by sch_cake, where upper percentage limits would be defined for each category, so if the queue is full, only so many requests would be accepted.
With an three-tier associative queue the request queue could also maintain a fair queuing over pools, processes and threads, so if one process is using a lot of threads, it is still handled fair compared to other processes.
This allows ZFS to process the requests still in a linear fashion, to ensure data consistency, but will cut down the size of the queues, so a lower latency for issued requests can be maintained.
@RubenKelevra You've lost me pretty quick. ;)
I think the solution would be to reconsider the way we handle queuing for processes that send too much requests. Traditionally, requests would be accepted until the queue hits its limit. But this isn't really a fair approach if there are processes which use a lot of IO, as other processes can't skip the long queue - as you explained.
It is not not so straight. Lets say there was some process A that has written 1GB of data, but does not care about persistence. ZFS has already throttled its write on the level of ARC dirty data, but not that much and that is a different topic. After that some process B writes 1 byte to the same file, but wants it to be persistent. In this situation ZIL will immediately receive ~1GB of data to write to ensure the 1B write persistence, and there is no other way really to handle it. Just after that some process C writes 1B to another file and also request persistence. From one side this write does not depend on the previous one and could be written to ZIL immediately. But from another, ZIL has only one queue and that queue already has 1GB of data.
In this situation process A has already completed and we can't penalize it, plus it didn't do anything bad, so what penalize it for? Process B written only one byte, but at at cost of 1GB -- it is already disproportionally penalized and we can't help it. Process C could theoretically run in parallel, but since all on-disk ZIL structures are sequential, we would need to redesign some in-memory representations so that we could have multiple queues for multiple files when it is possible, but serialize them at some critical points. It is not so much a question of statistics or precise QoS, as redesigning and complicating internal structures. May be some ideas for it could be obtained from https://github.com/openzfs/zfs/pull/12731, but it is quite a big change to grasp, going too deep into Intel DCPMM specifics.
ARC-level fairness for per-dataset dirty data seems to me as the cheapest (in man-hours) option (zfs_dirty_data_max_per_dataset tunable or something like that). Redesigning ZIL is imho unfeasible.
@amotin well, let's try again.
In my suggestion the time we completed a write to the ZIL (or the dataset itself, if sync=disabled) we would consider the write as "completed".
So no, the plan wasn't to change how the ZIL works. Instead, I want to redesign how we accept read/write requests.
The idea is, that there are some guarantees regarding read/write requests, say a processes writes to a file, get's the operation completed back, and at exactly this time we need to provide the new data, even if another process asks for it.
But this isn't true, if we haven't even given back a sync write.
Which means, we can stall writes and only read the to-be-written data once the queue is sufficiently empty - as long as we haven't returned the write as completed. Same goes for read requests. They can be stalled as necessary.
So the idea is to stall process not if the queue is full, like we currently do, but stall the processes if the latency target is full.
So instead of accepting say 1 GB to write, because we got 1 GB of space in memory assigned for a write buffer and then struggling to write it out in a timely fashion, we only accept so much writes that we can write it out in a timely fashion, leaving other appliations chances to step in and issue a write request as well.
In your example the 1 GB wouldn't have been accepted as a whole, but only parts, until we hit a latency target of say 100 ms. Then this would have been issued to the ZIL, and once it returned, another 100 ms would have been issued and so forth.
If another application needs to write, it can issue a 1 byte write, and instead of waiting for 1 GB to be written to the ZIL, it would just take roughly 100 ms to complete, because process A has used a lot of resources in the past, so it has a low priority to get new operations to be accepted.
So by stalling applications requests, we can give more concurrency to a linear process, that is basically the same thing sch_cake is doing, which also has a linear process, as the Ethernet wire has no concurrency.
So instead of accepting say 1 GB to write, because we got 1 GB of space in memory assigned for a write buffer and then struggling to write it out in a timely fashion, we only accept so much writes that we can write it out in a timely fashion, leaving other appliations chances to step in and issue a write request as well.
From some degree you are saying a reasonable thing without actually saying it. ZFS async write latency is effectively a TXG commit time. If the pool is faster than incoming data stream, that latency can be small and everything should be nice already. But the moment your pool is slower, your TXG size grows up to ARC's dirty_data_max, which may take many seconds if your RAM is big and the pool is slow. For async writes we don't care much, but if at that point somebody executes fsync(), it creates a huge spike of ZIL traffic and latency. What we really need to do (and I though about it before) is to limit TXG size not only in terms of used memory, but also in amount of data that pool can write in reasonable time. Doing that could reduce async write performance on bursty workloads by making app to wait when not necessary, but it would also reduce latency effects like this ZIL one.
Is there a workaround to help with this, even if it isn't a full solution? I'm pretty sure I just encountered this issue while rsyncing 100GB of data between datasets on the same SSDs. Firefox crashed and other things ground to a halt, and then when rsync completed everything (eventually) came back.
@ryantrinkle I am not sure why would rsync request sync writes. But if it really does, and if your storage is very slow comparing to the amount of RAM, you may check automatically set dirty_data_max parameter value and possible override it lower. This way reduce write caching of ZFS, but also reduce spikes in case of rare syncs. But I personally hate this kind of tuning.
@amotin i think the point is more that rsync is opening and closing files very rapidly. Every written and closed file is getting an flush to disk request by Linux automatically, which is then converted to a sync operation on the SSD by ZFS, which can be extremely slow.
Modifying dirty_data_max does not help, as the SSDs behavior here is the issue in regards of the delay after an fsync has been requested.
The solution here is to turn sync in ZFS off, to avoid that ZFS is requesting a high amount of syncs from the SSD before returning the writes to rsync
i think the point is more that rsync is opening and closing files very rapidly
@RubenKelevra is opening/closing files rapidly such a special use case ?
in which way does rsync behave different in comparison with other copy tools?
could you perhaps check how other copy tools behave? cp, tar via pipe, unison, rclone to mention some....
does the problem happen there , too ?
what type/model off ssd(s) do you use? there are many ssd's which suck with zfs....
@RubenKelevra Closing file does not have an fsync() semantics, otherwise copying zillion of tiny files with cp to HDD pool would be a disaster. If as you are saying that sync=disabled helps performance, it means that application explicitly calls fsync(). Whether it is doing this for a good reason -- question to the application. I was assuming so, but who knows.
If as you are saying that sync=disabled helps performance, it means that application explicitly calls fsync()
but rsync does not fsync by default. you need to add --fsync . that option exists since rsync 3.2.4
# strace -f rsync -av ./test ./test2 2>&1 |grep sync
execve("/usr/bin/rsync", ["rsync", "-av", "./test", "./test2"], 0x7ffde1bd7e40 /* 22 vars */) = 0
getcwd("/root/roland/rsync", 4095) = 19
[pid 1976255] chdir("/root/roland/rsync/.") = 0
[pid 1976256] chdir("/root/roland/rsync/./test2" <unfinished ...>
# grep -ri fsync *
NEWS.md: - Added the [`--fsync`](rsync.1#opt) option (promoted from the patches repo).
options.c:int do_fsync = 0;
options.c: {"fsync", 0, POPT_ARG_NONE, &do_fsync, 0, 0, 0 },
options.c: if (do_fsync)
options.c: args[ac++] = "--fsync";
receiver.c:extern int do_fsync;
receiver.c: if (do_fsync && fd != -1 && fsync(fd) != 0) {
receiver.c: rsyserr(FERROR, errno, "fsync failed on %s", full_fname(fname));
rsync.1.md:--fsync fsync every written file
rsync.1.md:0. `--fsync`
rsync.1.md: Cause the receiving side to fsync each finished file. This may slow down
support/rrsync: 'fsync': 0,
t_stub.c:int do_fsync = 0;
util1.c:extern int do_fsync;
util1.c: if (do_fsync && fsync(ofd) < 0) {
util1.c: rsyserr(FERROR, errno, "fsync failed on %s", full_fname(dest));
root@backupvm1:~/roland/rsync-3.2.3# grep -ri fsync .
root@backupvm1:~/roland/rsync-3.2.3# grep -ri fdatasync .
root@backupvm1:~/roland/rsync-3.2.3#
I can just answer with chunks of ZFS close() code:
static int
zpl_release(struct inode *ip, struct file *filp)
{
<------>cred_t *cr = CRED();
<------>int error;
<------>fstrans_cookie_t cookie;
<------>cookie = spl_fstrans_mark();
<------>if (ITOZ(ip)->z_atime_dirty)
<------><------>zfs_mark_inode_dirty(ip);
<------>crhold(cr);
<------>error = -zfs_close(ip, filp->f_flags, cr);
<------>spl_fstrans_unmark(cookie);
<------>crfree(cr);
<------>ASSERT3S(error, <=, 0);
<------>return (error);
}
and
int
zfs_close(struct inode *ip, int flag, cred_t *cr)
{
<------>(void) cr;
<------>znode_t>*zp = ITOZ(ip);
<------>zfsvfs_t *zfsvfs = ITOZSB(ip);
<------>int error;
<------>if ((error = zfs_enter_verify_zp(zfsvfs, zp, FTAG)) != 0)
<------><------>return (error);
<------>/* Decrement the synchronous opens in the znode */
<------>if (flag & O_SYNC)
<------><------>atomic_dec_32(&zp->z_sync_cnt);
<------>zfs_exit(zfsvfs, FTAG);
<------>return (0);
}
As you can see, there is nothing really done here.
You may use bpftrace/dtrace/whatever to tap on zil_commit() function to see what is calling it.
PS: There is also system-wide sync() syscall and respective command.