sysstat iostat -x %util statistics on 4.17+ kernels

Environment

Debian stretch sysstat version 11.6.1 (Debian package version 11.6.1-1)

Description

With kernel 4.17 and 4.18-rc8 I'm seeing unexpected statistics with iostat -x. What I see with AWS EC2 instances (r5/m5.large) using EBS backed volumes is that small writes spike the %util to 40-80%. On kernel 4.16 and earlier similar writes would show as zero or single digits.

e.g. on kernel 4.9 running "iostat 1 -x" I'm seeing statistics like this:

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           0.00    0.00    0.00    2.51    0.00   97.49

Device            r/s     w/s     rkB/s     wkB/s   rrqm/s   wrqm/s  %rrqm  %wrqm r_await w_await aqu-sz rareq-sz wareq-sz  svctm  %util
nvme1n1          0.00    0.00      0.00      0.00     0.00     0.00   0.00   0.00    0.00    0.00   0.00     0.00     0.00   0.00   0.00
nvme0n1          7.00  130.00     28.00  25060.00     0.00     3.00   0.00   2.26    2.86   13.66   1.80     4.00   192.77   0.50   6.80

On the same instance with new kernels I'm instead seeing spikes like this in %util:

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           0.50    0.00    0.50    0.00    0.00   99.00

Device            r/s     w/s     rkB/s     wkB/s   rrqm/s   wrqm/s  %rrqm  %wrqm r_await w_await aqu-sz rareq-sz wareq-sz  svctm  %util
nvme0n1          0.00    0.00      0.00      0.00     0.00     0.00   0.00   0.00    0.00    0.00   0.00     0.00     0.00   0.00   0.00
nvme1n1          2.00    0.00      8.00      0.00     0.00     0.00   0.00   0.00    0.00    0.00   0.84     4.00     0.00 418.00  83.60

Device performance seems to be the same regardless of the kernel used.

I went through the changes in kernel between 4.16 and 4.17 and found this one commit that touches the statistics. Reverting it seems to get iostat to report %util like it did before, but I haven't done extensive testing with this.

https://www.mail-archive.com/[email protected]/msg20916.html

I'm not familiar with the internals to say where the problem is or if this is actually expected. It also seems like not all instance types are showing this behaviour, so it could have something to do with the new EBS volumes that appear as NVMe devices.

Aug 09 '18 11:08 vmlintu-nosto

This looks rather similar to this issue, which appeared back in RHEL 6.5, some three years ago.

Red Hat - iostat output is incorrect for nvme SSD disks "...iostat output for nvme disks shows high device utilization and svctm > await time for no or small load conditions..."
kernel.org - block: fix non-atomic access to genhd inflight structures "After the stack plugging introduction, these are called lockless. Ensure that the counters are updated atomically."

Nov 23 '18 15:11 stevekay

Any update on this? How to see iostats on a NVMe device with latest kernels (4.20+ and 5.0)?

Mar 07 '19 11:03 Ayyanar85

It seams it is still an issue in 4.19 ... my nvmes are almost 100% while writing just some bytes. Munin shows also the ambiguous values, see munin-monitoring/munin#1119

Jul 11 '19 09:07 AInteriorB

Is there any progress on this thread, i am facing the same issue on kernel : 4.18.0-147.el8.x86_64 RHEL 8.2

May 29 '20 15:05 ksingh7

FYI, the kernel commit mentioned in the original message seems to be reverted in 5.10:

https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=a926c7afffcc0f2e35e6acbccb16921bacf34617

Jul 02 '21 15:07 erthalion

Hey Team,

We are still seeing the same behavior on kernel 5.10, Is there any update on this thread?

Dec 21 '21 12:12 abuehaze

After further research and trials It looks like there are 2 commits that caused the increase in the reported Disk I/O utilization in iostat which I am listing below:

blk-mq: count allocated but not started requests in iostats inflight this caused the uptick in the reported Disk utilization between kernel 4.16 & kernel 4.17 till 4.19.

block: delete part_round_stats and switch to less precise counting this has been merged starting from Kernel 5.0 and on wards and its main target was I/O performance enhancement as The function part_round_stats (that has been deleted in this commit) needs the in_flight counter every jiffy, it would be too costly to sum all the percpu variables every jiffy, so it must be deleted. part_round_stats is used to calculate two counters - time_in_queue and io_ticks. Now with that patch io_ticks can be approximated by increasing the value when I/O is started or ended and the jiffies value has changed. If the I/Os take less than a jiffy, the value is as exact as the previously calculated value.

Looking at the previously mentioned patch it's clear that the new update_io_ticks function that's counting the io_ticks (used to calculate the Disk utilization) doesn't only increment when there is inflight I/O as this would require adding blk_mq_queue_inflight() function back to IO path brings cost which turns out to be visible and causing possible soft lockups so here we are trading of I/O Disk utilization metric accuracy for performance.
It looks like there was proposed patch for that behavior as shown in [PATCH v5 0/2] fix inaccurate io_ticks block: fix inaccurate io_ticks but it went nowhere because of the possible performance cost of checking the existence of inflight I/O in particular on Devices with High count of H/W queue and with system with high count of cpu cores as shown below:

1) one device has 256 hw queues, and the system has 256 cpu cores, and
each hw queue's depth is 1k.

2) there isn't any io load on CPUs(0 ~ 254)

3) heavy io load is run on CPU 255

So with the proposed patch  the code still need to iterate hw queues from 0 to 254, and
the load isn't something which can be ignored. Especially it is just for io accounting.

I have tried applying block: fix inaccurate io_ticks on latest kernel 5.10 and I am getting a bit of lower Disk utilization but still higher than that I got on kernels pre 4.17.

Before Applying the patch

[root@ip-172-31-14-152 ec2-user]# fio --name=test --filename=/dev/nvme1n1 --rw=randwrite  --direct=1 --ioengine libaio --bs 8k --runtime=60

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           0.02    0.00    0.04    0.00    0.00   99.94

Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
nvme1n1           0.00     0.00    0.00  844.00     0.00     6.59    16.00     0.93    1.10    0.00    1.10   1.18 100.00

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           0.02    0.00    0.06    0.00    0.02   99.90

Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
nvme1n1           0.00     0.00    0.00  860.00     0.00     6.72    16.00     0.93    1.08    0.00    1.08   1.16 100.00

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           0.04    0.00    0.02    0.00    0.00   99.94

Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
nvme1n1           0.00     0.00    0.00  855.00     0.00     6.68    16.00     0.93    1.09    0.00    1.09   1.17 100.00

After Applying the patch

[root@ip-172-31-14-152 ec2-user]# fio --name=test --filename=/dev/nvme1n1 --rw=randwrite  --direct=1 --ioengine libaio --bs 8k --runtime=60

Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
nvme1n1           0.00     0.00    0.00  885.00     0.00     6.91    16.00     0.96    1.09    0.00    1.09   1.10  97.60

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           0.04    0.00    0.04    0.00    0.00   99.92

Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
nvme1n1           0.00     0.00    0.00  863.00     0.00     6.74    16.00     0.97    1.12    0.00    1.12   1.10  95.20

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           0.02    0.00    0.08    0.00    0.00   99.90

Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
nvme1n1           0.00     0.00    0.00  869.00     0.00     6.79    16.00     0.96    1.11    0.00    1.11   1.08  93.60

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           0.02    0.00    0.06    0.00    0.00   99.92

Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
nvme1n1           0.00     0.00    0.00  869.00     0.00     6.79    16.00     0.96    1.11    0.00    1.11   1.10  95.20

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           0.04    0.00    0.04    0.00    0.00   99.92

Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
nvme1n1           0.00     0.00    0.00  877.00     0.00     6.85    16.00     0.96    1.10    0.00    1.10   1.08  94.40

Redhat has also chosen keeping the block: delete part_round_stats and switch to less precise counting patch as mentioned in https://access.redhat.com/solutions/6257421 to avoid the potential performance degradation and softlockups.

Dec 23 '21 17:12 abuehaze

5.18.0-0.deb11.4-cloud-arm64 ec2 r6gd.4xlarge

13k IOPS and 100%util sysbench oltp_read_only --tables=10 --table_size=10000000 --mysql-socket=/var/run/mysqld/mysqld.sock --mysql-user=root --mysql-password=test --mysql-db=sbtest --threads=50 --time=600 run

Device            r/s     rkB/s   rrqm/s  %rrqm r_await rareq-sz     w/s     wkB/s   wrqm/s  %wrqm w_await wareq-sz     d/s     dkB/s   drqm/s  %drqm d_await dareq-sz     f/s f_await  aqu-sz  %util
nvme1n1       13526.00 216416.00     0.00   0.00    0.10    16.00    0.00      0.00     0.00   0.00    0.00     0.00    0.00      0.00     0.00   0.00    0.00     0.00    0.00    0.00    1.40 100.00

Here just 4k IOPS and still 100%util sysbench oltp_read_only --tables=10 --table_size=10000000 --mysql-socket=/var/run/mysqld/mysqld.sock --mysql-user=root --mysql-password=test --mysql-db=sbtest --threads=5 --time=600 run

Device            r/s     rkB/s   rrqm/s  %rrqm r_await rareq-sz     w/s     wkB/s   wrqm/s  %wrqm w_await wareq-sz     d/s     dkB/s   drqm/s  %drqm d_await dareq-sz     f/s f_await  aqu-sz  %util
nvme1n1       4380.00  70080.00     0.00   0.00    0.10    16.00    0.00      0.00     0.00   0.00    0.00     0.00    0.00      0.00     0.00   0.00    0.00     0.00    0.00    0.00    0.43 100.00

Oct 18 '22 22:10 everythings-gonna-be-alright

sysstat sysstat copied to clipboard

iostat -x %util statistics on 4.17+ kernels

sysstat
sysstat copied to clipboard