sysstat
sysstat copied to clipboard
iostat -x %util statistics on 4.17+ kernels
Environment
Debian stretch sysstat version 11.6.1 (Debian package version 11.6.1-1)
Description
With kernel 4.17 and 4.18-rc8 I'm seeing unexpected statistics with iostat -x. What I see with AWS EC2 instances (r5/m5.large) using EBS backed volumes is that small writes spike the %util to 40-80%. On kernel 4.16 and earlier similar writes would show as zero or single digits.
e.g. on kernel 4.9 running "iostat 1 -x" I'm seeing statistics like this:
avg-cpu: %user %nice %system %iowait %steal %idle
0.00 0.00 0.00 2.51 0.00 97.49
Device r/s w/s rkB/s wkB/s rrqm/s wrqm/s %rrqm %wrqm r_await w_await aqu-sz rareq-sz wareq-sz svctm %util
nvme1n1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
nvme0n1 7.00 130.00 28.00 25060.00 0.00 3.00 0.00 2.26 2.86 13.66 1.80 4.00 192.77 0.50 6.80
On the same instance with new kernels I'm instead seeing spikes like this in %util:
avg-cpu: %user %nice %system %iowait %steal %idle
0.50 0.00 0.50 0.00 0.00 99.00
Device r/s w/s rkB/s wkB/s rrqm/s wrqm/s %rrqm %wrqm r_await w_await aqu-sz rareq-sz wareq-sz svctm %util
nvme0n1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
nvme1n1 2.00 0.00 8.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.84 4.00 0.00 418.00 83.60
Device performance seems to be the same regardless of the kernel used.
I went through the changes in kernel between 4.16 and 4.17 and found this one commit that touches the statistics. Reverting it seems to get iostat to report %util like it did before, but I haven't done extensive testing with this.
https://www.mail-archive.com/[email protected]/msg20916.html
I'm not familiar with the internals to say where the problem is or if this is actually expected. It also seems like not all instance types are showing this behaviour, so it could have something to do with the new EBS volumes that appear as NVMe devices.
This looks rather similar to this issue, which appeared back in RHEL 6.5, some three years ago.
- Red Hat - iostat output is incorrect for nvme SSD disks "...iostat output for nvme disks shows high device utilization and svctm > await time for no or small load conditions..."
- kernel.org - block: fix non-atomic access to genhd inflight structures "After the stack plugging introduction, these are called lockless. Ensure that the counters are updated atomically."
Any update on this? How to see iostats on a NVMe device with latest kernels (4.20+ and 5.0)?
It seams it is still an issue in 4.19 ... my nvmes are almost 100% while writing just some bytes. Munin shows also the ambiguous values, see munin-monitoring/munin#1119
Is there any progress on this thread, i am facing the same issue on kernel : 4.18.0-147.el8.x86_64 RHEL 8.2
FYI, the kernel commit mentioned in the original message seems to be reverted in 5.10:
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=a926c7afffcc0f2e35e6acbccb16921bacf34617
Hey Team,
We are still seeing the same behavior on kernel 5.10, Is there any update on this thread?
- After further research and trials It looks like there are 2 commits that caused the increase in the reported Disk I/O utilization in iostat which I am listing below:
blk-mq: count allocated but not started requests in iostats inflight this caused the uptick in the reported Disk utilization between kernel 4.16 & kernel 4.17 till 4.19.
block: delete part_round_stats and switch to less precise counting this has been merged starting from Kernel 5.0 and on wards and its main target was I/O performance enhancement as The function
part_round_stats
(that has been deleted in this commit) needs the in_flight counter every jiffy, it would be too costly to sum all the percpu variables every jiffy, so it must be deleted. part_round_stats is used to calculate two counters - time_in_queue and io_ticks. Now with that patch io_ticks can be approximated by increasing the value when I/O is started or ended and the jiffies value has changed. If the I/Os take less than a jiffy, the value is as exact as the previously calculated value.
-
Looking at the previously mentioned patch it's clear that the new
update_io_ticks
function that's counting theio_ticks
(used to calculate the Disk utilization) doesn't only increment when there is inflight I/O as this would require addingblk_mq_queue_inflight()
function back to IO path brings cost which turns out to be visible and causing possible soft lockups so here we are trading of I/O Disk utilization metric accuracy for performance. -
It looks like there was proposed patch for that behavior as shown in [PATCH v5 0/2] fix inaccurate io_ticks block: fix inaccurate io_ticks but it went nowhere because of the possible performance cost of checking the existence of inflight I/O in particular on Devices with High count of H/W queue and with system with high count of cpu cores as shown below:
1) one device has 256 hw queues, and the system has 256 cpu cores, and
each hw queue's depth is 1k.
2) there isn't any io load on CPUs(0 ~ 254)
3) heavy io load is run on CPU 255
So with the proposed patch the code still need to iterate hw queues from 0 to 254, and
the load isn't something which can be ignored. Especially it is just for io accounting.
- I have tried applying block: fix inaccurate io_ticks on latest kernel 5.10 and I am getting a bit of lower Disk utilization but still higher than that I got on kernels pre 4.17.
Before Applying the patch
[root@ip-172-31-14-152 ec2-user]# fio --name=test --filename=/dev/nvme1n1 --rw=randwrite --direct=1 --ioengine libaio --bs 8k --runtime=60
avg-cpu: %user %nice %system %iowait %steal %idle
0.02 0.00 0.04 0.00 0.00 99.94
Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz avgqu-sz await r_await w_await svctm %util
nvme1n1 0.00 0.00 0.00 844.00 0.00 6.59 16.00 0.93 1.10 0.00 1.10 1.18 100.00
avg-cpu: %user %nice %system %iowait %steal %idle
0.02 0.00 0.06 0.00 0.02 99.90
Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz avgqu-sz await r_await w_await svctm %util
nvme1n1 0.00 0.00 0.00 860.00 0.00 6.72 16.00 0.93 1.08 0.00 1.08 1.16 100.00
avg-cpu: %user %nice %system %iowait %steal %idle
0.04 0.00 0.02 0.00 0.00 99.94
Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz avgqu-sz await r_await w_await svctm %util
nvme1n1 0.00 0.00 0.00 855.00 0.00 6.68 16.00 0.93 1.09 0.00 1.09 1.17 100.00
After Applying the patch
[root@ip-172-31-14-152 ec2-user]# fio --name=test --filename=/dev/nvme1n1 --rw=randwrite --direct=1 --ioengine libaio --bs 8k --runtime=60
Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz avgqu-sz await r_await w_await svctm %util
nvme1n1 0.00 0.00 0.00 885.00 0.00 6.91 16.00 0.96 1.09 0.00 1.09 1.10 97.60
avg-cpu: %user %nice %system %iowait %steal %idle
0.04 0.00 0.04 0.00 0.00 99.92
Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz avgqu-sz await r_await w_await svctm %util
nvme1n1 0.00 0.00 0.00 863.00 0.00 6.74 16.00 0.97 1.12 0.00 1.12 1.10 95.20
avg-cpu: %user %nice %system %iowait %steal %idle
0.02 0.00 0.08 0.00 0.00 99.90
Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz avgqu-sz await r_await w_await svctm %util
nvme1n1 0.00 0.00 0.00 869.00 0.00 6.79 16.00 0.96 1.11 0.00 1.11 1.08 93.60
avg-cpu: %user %nice %system %iowait %steal %idle
0.02 0.00 0.06 0.00 0.00 99.92
Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz avgqu-sz await r_await w_await svctm %util
nvme1n1 0.00 0.00 0.00 869.00 0.00 6.79 16.00 0.96 1.11 0.00 1.11 1.10 95.20
avg-cpu: %user %nice %system %iowait %steal %idle
0.04 0.00 0.04 0.00 0.00 99.92
Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz avgqu-sz await r_await w_await svctm %util
nvme1n1 0.00 0.00 0.00 877.00 0.00 6.85 16.00 0.96 1.10 0.00 1.10 1.08 94.40
- Redhat has also chosen keeping the block: delete part_round_stats and switch to less precise counting patch as mentioned in https://access.redhat.com/solutions/6257421 to avoid the potential performance degradation and softlockups.
5.18.0-0.deb11.4-cloud-arm64 ec2 r6gd.4xlarge
13k IOPS and 100%util
sysbench oltp_read_only --tables=10 --table_size=10000000 --mysql-socket=/var/run/mysqld/mysqld.sock --mysql-user=root --mysql-password=test --mysql-db=sbtest --threads=50 --time=600 run
Device r/s rkB/s rrqm/s %rrqm r_await rareq-sz w/s wkB/s wrqm/s %wrqm w_await wareq-sz d/s dkB/s drqm/s %drqm d_await dareq-sz f/s f_await aqu-sz %util
nvme1n1 13526.00 216416.00 0.00 0.00 0.10 16.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 1.40 100.00
Here just 4k IOPS and still 100%util
sysbench oltp_read_only --tables=10 --table_size=10000000 --mysql-socket=/var/run/mysqld/mysqld.sock --mysql-user=root --mysql-password=test --mysql-db=sbtest --threads=5 --time=600 run
Device r/s rkB/s rrqm/s %rrqm r_await rareq-sz w/s wkB/s wrqm/s %wrqm w_await wareq-sz d/s dkB/s drqm/s %drqm d_await dareq-sz f/s f_await aqu-sz %util
nvme1n1 4380.00 70080.00 0.00 0.00 0.10 16.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.43 100.00