fio latency report not correct if using iops_rate with null engine
Please acknowledge the following before creating a ticket
- [x] I have read the GitHub issues section of REPORTING-BUGS.
Description of the bug: if we use iops_rate, test fio on OptaneSSD P5800 that is latency very stable SSD, avg latency report data not correctly. even TLC NVMe has this issue too. Environment: FedoraOS33 fio version: fio-3.21
Reproduction steps
run fio on the OptaneSSD or very good TLC NVMe, set iops_rate at 300 and 6000, then you will see the avg latency of 300 is worse than avg latency of 6000
please check below test result carefully, notice the avg latency and rate_iops
[root@cyp03 csal-performance-test]# fio -filename=/dev/nvme2n1 -direct=1 -iodepth=1 -thread -rw=randwrite -ioengine=libaio -bs=4k -numjobs=1 -runtime=180 -group_reporting -name=mytest -time_based -rate_iops=300
mytest: (g=0): rw=randwrite, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=1
fio-3.21
Starting 1 thread
Jobs: 1 (f=1), 0-300 IOPS: [w(1)][100.0%][w=1200KiB/s][w=300 IOPS][eta 00m:00s]
mytest: (groupid=0, jobs=1): err= 0: pid=1779481: Sun May 1 19:29:21 2022
write: IOPS=300, BW=1200KiB/s (1229kB/s)(211MiB/180001msec); 0 zone resets
slat (usec): min=8, max=5205, avg=36.09, stdev=22.69
clat (nsec): min=4387, max=72564, avg=26220.23, stdev=6613.11
lat (usec): min=26, max=5218, avg=63.55, stdev=23.44
clat percentiles (nsec):
| 1.00th=[19072], 5.00th=[19584], 10.00th=[21120], 20.00th=[21376],
| 30.00th=[21376], 40.00th=[21632], 50.00th=[21632], 60.00th=[31104],
| 70.00th=[32128], 80.00th=[32384], 90.00th=[32640], 95.00th=[33024],
| 99.00th=[50432], 99.50th=[53504], 99.90th=[62208], 99.95th=[64768],
| 99.99th=[68096]
bw ( KiB/s): min= 1197, max= 1208, per=100.00%, avg=1200.26, stdev= 3.31, samples=359
iops : min= 299, max= 302, avg=299.92, stdev= 0.85, samples=359
lat (usec) : 10=0.20%, 20=5.71%, 50=93.06%, 100=1.02%
cpu : usr=0.89%, sys=2.53%, ctx=75538, majf=0, minf=47019
IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
issued rwts: total=0,54001,0,0 short=0,0,0,0 dropped=0,0,0,0
latency : target=0, window=0, percentile=100.00%, depth=1
Run status group 0 (all jobs):
WRITE: bw=1200KiB/s (1229kB/s), 1200KiB/s-1200KiB/s (1229kB/s-1229kB/s), io=211MiB (221MB), run=180001-180001msec
Disk stats (read/write):
nvme2n1: ios=50/53926, merge=0/0, ticks=4/1443, in_queue=1448, util=30.47%
[root@cyp03 csal-performance-test]# fio -filename=/dev/nvme2n1 -direct=1 -iodepth=1 -thread -rw=randwrite -ioengine=libaio -bs=4k -numjobs=1 -runtime=180 -group_reporting -name=mytest -time_based -rate_iops=6000
mytest: (g=0): rw=randwrite, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=1
fio-3.21
Starting 1 thread
^Cbs: 1 (f=1), 0-6000 IOPS: [w(1)][5.6%][w=23.4MiB/s][w=6002 IOPS][eta 02m:50s]
fio: terminating on signal 2
mytest: (groupid=0, jobs=1): err= 0: pid=1779554: Sun May 1 19:30:11 2022
write: IOPS=5999, BW=23.4MiB/s (24.6MB/s)(236MiB/10067msec); 0 zone resets
slat (usec): min=6, max=5159, avg=12.28, stdev=21.06
clat (nsec): min=1133, max=865260, avg=21276.93, stdev=9599.16
lat (usec): min=26, max=5172, avg=33.82, stdev=23.12
clat percentiles (usec):
| 1.00th=[ 19], 5.00th=[ 20], 10.00th=[ 20], 20.00th=[ 21],
| 30.00th=[ 21], 40.00th=[ 21], 50.00th=[ 21], 60.00th=[ 21],
| 70.00th=[ 21], 80.00th=[ 21], 90.00th=[ 21], 95.00th=[ 22],
| 99.00th=[ 40], 99.50th=[ 53], 99.90th=[ 196], 99.95th=[ 210],
| 99.99th=[ 241]
bw ( KiB/s): min=24000, max=24048, per=100.00%, avg=24012.35, stdev=11.86, samples=20
iops : min= 6000, max= 6012, avg=6002.75, stdev= 3.09, samples=20
lat (usec) : 2=0.09%, 20=16.52%, 50=82.85%, 100=0.34%, 250=0.20%
lat (usec) : 500=0.01%, 1000=0.01%
cpu : usr=9.67%, sys=16.55%, ctx=119810, majf=0, minf=27947
IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
issued rwts: total=0,60401,0,0 short=0,0,0,0 dropped=0,0,0,0
latency : target=0, window=0, percentile=100.00%, depth=1
Run status group 0 (all jobs):
WRITE: bw=23.4MiB/s (24.6MB/s), 23.4MiB/s-23.4MiB/s (24.6MB/s-24.6MB/s), io=236MiB (247MB), run=10067-10067msec
Disk stats (read/write):
nvme2n1: ios=50/59832, merge=0/0, ticks=3/1272, in_queue=1276, util=96.81%
I see now. slat diff is caused by rate limit, clat is from SSD+kernel. since clat is stable, this is correct. is there anything added?
@axboe I can reproduce the clat is not stable problem even with null engine, could you please take a look at it? below is fio null engine result
-----------IOPS 300----------------------
root@smfast:~# fio -size=800g -iodepth=1 -thread -rw=randwrite -ioengine=null -bs=4k -numjobs=1 -runtime=180 -group_reporting -name=mytest -time_based -rate_iops=300
mytest: (g=0): rw=randwrite, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=null, iodepth=1
fio-3.21
Starting 1 thread
Jobs: 1 (f=1), 0-300 IOPS: [w(1)][100.0%][w=1201KiB/s][w=300 IOPS][eta 00m:00s]
mytest: (groupid=0, jobs=1): err= 0: pid=245290: Wed May 4 19:13:37 2022
write: IOPS=300, BW=1200KiB/s (1229kB/s)(211MiB/180001msec); 0 zone resets
clat (nsec): min=56, max=32320, avg=870.16, stdev=497.92
lat (nsec): min=157, max=34645, avg=3269.80, stdev=951.14
clat percentiles (nsec):
| 1.00th=[ 764], 5.00th=[ 844], 10.00th=[ 852], 20.00th=[ 852],
| 30.00th=[ 852], 40.00th=[ 860], 50.00th=[ 860], 60.00th=[ 860],
| 70.00th=[ 860], 80.00th=[ 868], 90.00th=[ 892], 95.00th=[ 908],
| 99.00th=[ 924], 99.50th=[ 932], 99.90th=[ 1064], 99.95th=[16192],
| 99.99th=[19328]
bw ( KiB/s): min= 1200, max= 1208, per=100.00%, avg=1201.78, stdev= 3.33, samples=359
iops : min= 300, max= 302, avg=300.45, stdev= 0.83, samples=359
lat (nsec) : 100=0.01%, 250=0.90%, 500=0.02%, 750=0.02%, 1000=98.90%
lat (usec) : 2=0.07%, 20=0.08%, 50=0.01%
cpu : usr=0.56%, sys=0.25%, ctx=54002, majf=0, minf=0
IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
submit : 0=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
issued rwts: total=0,54001,0,0 short=0,0,0,0 dropped=0,0,0,0
latency : target=0, window=0, percentile=100.00%, depth=1
Run status group 0 (all jobs):
WRITE: bw=1200KiB/s (1229kB/s), 1200KiB/s-1200KiB/s (1229kB/s-1229kB/s), io=211MiB (221MB), run=180001-180001msec
------ IOPS 6000-----------
root@smfast:~# fio -size=800g -iodepth=1 -thread -rw=randwrite -ioengine=null -bs=4k -numjobs=1 -runtime=180 -group_reporting -name=mytest -time_based -rate_iops=6000
mytest: (g=0): rw=randwrite, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=null, iodepth=1
fio-3.21
Starting 1 thread
Jobs: 1 (f=1), 0-6000 IOPS: [w(1)][100.0%][w=23.4MiB/s][w=6003 IOPS][eta 00m:00s]
mytest: (groupid=0, jobs=1): err= 0: pid=245380: Wed May 4 19:19:05 2022
write: IOPS=6000, BW=23.4MiB/s (24.6MB/s)(4219MiB/180000msec); 0 zone resets
clat (nsec): min=46, max=3535, avg=66.96, stdev=17.88
lat (nsec): min=100, max=6357, avg=252.70, stdev=67.25
clat percentiles (nsec):
| 1.00th=[ 59], 5.00th=[ 60], 10.00th=[ 60], 20.00th=[ 61],
| 30.00th=[ 61], 40.00th=[ 62], 50.00th=[ 62], 60.00th=[ 71],
| 70.00th=[ 72], 80.00th=[ 72], 90.00th=[ 73], 95.00th=[ 76],
| 99.00th=[ 90], 99.50th=[ 91], 99.90th=[ 96], 99.95th=[ 133],
| 99.99th=[ 161]
bw ( KiB/s): min=24016, max=24040, per=100.00%, avg=24035.23, stdev= 4.11, samples=359
iops : min= 6004, max= 6010, avg=6008.81, stdev= 1.03, samples=359
lat (nsec) : 50=0.01%, 100=99.91%, 250=0.08%, 750=0.01%
lat (usec) : 2=0.01%, 4=0.01%
cpu : usr=3.98%, sys=0.29%, ctx=1079993, majf=0, minf=0
IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
submit : 0=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
issued rwts: total=0,1080001,0,0 short=0,0,0,0 dropped=0,0,0,0
latency : target=0, window=0, percentile=100.00%, depth=1
Run status group 0 (all jobs):
WRITE: bw=23.4MiB/s (24.6MB/s), 23.4MiB/s-23.4MiB/s (24.6MB/s-24.6MB/s), io=4219MiB (4424MB), run=180000-180000msec
root@smfast:~#
I'm also able to observe large differences in completion latencies with rate_iops=300 vs 6000. I agree that completion latencies for these jobs should be similar and any waiting time due to rate_iops should not be captured by these latencies.
From the last time I looked at this my recollection is that timestamps are collected at three points for async ioengines:
a. when an io_u is selected for preparation
b. when the io_u is queued/submitted
c. when the io_u is completed
(b - a) is slat, (c - b) is clat, and (c - a) is lat
io_u.c:rate_ddir() has a usec_sleep() call which appears to be what slows down IO submission to obtain the desired rate_iops. This is called somewhat early in the preparation of an io_u, certainly before it is actually queued. It is definitely a surprise to see completion latency depend on rate_iops seeing how IOPS is modulated in response to rate_iops.
This is as much investigation as I have time for now but this is definitely something that we should get to the bottom of.
root@ubuntu:~/fio-dev/fio-canonical# ./fio -size=1g -iodepth=1 -thread -rw=randwrite -ioengine=null -bs=4k -numjobs=1 -runtime=10 -group_reporting -name=mytest -time_based -rate_iops=300
mytest: (g=0): rw=randwrite, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=null, iodepth=1
fio-3.30-20-g12db
Starting 1 thread
Jobs: 1 (f=1), 0-300 IOPS: [w(1)][100.0%][w=1201KiB/s][w=300 IOPS][eta 00m:00s]
mytest: (groupid=0, jobs=1): err= 0: pid=22877: Mon May 9 10:47:40 2022
write: IOPS=299, BW=1200KiB/s (1229kB/s)(11.7MiB/10001msec); 0 zone resets
clat (nsec): min=50, max=827, avg=495.18, stdev=148.99
lat (nsec): min=285, max=15020, avg=2242.98, stdev=711.68
clat percentiles (nsec):
| 1.00th=[ 118], 5.00th=[ 253], 10.00th=[ 270], 20.00th=[ 318],
| 30.00th=[ 338], 40.00th=[ 580], 50.00th=[ 588], 60.00th=[ 588],
| 70.00th=[ 596], 80.00th=[ 604], 90.00th=[ 604], 95.00th=[ 620],
| 99.00th=[ 644], 99.50th=[ 724], 99.90th=[ 780], 99.95th=[ 796],
| 99.99th=[ 828]
bw ( KiB/s): min= 1200, max= 1208, per=100.00%, avg=1201.26, stdev= 3.00, samples=19
iops : min= 300, max= 302, avg=300.32, stdev= 0.75, samples=19
lat (nsec) : 100=0.70%, 250=3.87%, 500=28.53%, 750=66.50%, 1000=0.40%
cpu : usr=0.83%, sys=0.00%, ctx=3000, majf=0, minf=0
IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
submit : 0=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
issued rwts: total=0,3000,0,0 short=0,0,0,0 dropped=0,0,0,0
latency : target=0, window=0, percentile=100.00%, depth=1
Run status group 0 (all jobs):
WRITE: bw=1200KiB/s (1229kB/s), 1200KiB/s-1200KiB/s (1229kB/s-1229kB/s), io=11.7MiB (12.3MB), run=10001-10001msec
root@ubuntu:~/fio-dev/fio-canonical# ./fio -size=1g -iodepth=1 -thread -rw=randwrite -ioengine=null -bs=4k -numjobs=1 -runtime=10 -group_reporting -name=mytest -time_based -rate_iops=6000
mytest: (g=0): rw=randwrite, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=null, iodepth=1
fio-3.30-20-g12db
Starting 1 thread
Jobs: 1 (f=1), 0-6000 IOPS: [w(1)][100.0%][w=23.5MiB/s][w=6004 IOPS][eta 00m:00s]
mytest: (groupid=0, jobs=1): err= 0: pid=22903: Mon May 9 10:48:38 2022
write: IOPS=5999, BW=23.4MiB/s (24.6MB/s)(234MiB/10001msec); 0 zone resets
clat (nsec): min=50, max=14743, avg=157.51, stdev=119.39
lat (nsec): min=132, max=16747, avg=558.40, stdev=270.29
clat percentiles (nsec):
| 1.00th=[ 75], 5.00th=[ 86], 10.00th=[ 94], 20.00th=[ 100],
| 30.00th=[ 107], 40.00th=[ 177], 50.00th=[ 181], 60.00th=[ 181],
| 70.00th=[ 183], 80.00th=[ 187], 90.00th=[ 191], 95.00th=[ 203],
| 99.00th=[ 310], 99.50th=[ 374], 99.90th=[ 532], 99.95th=[ 596],
| 99.99th=[ 4448]
bw ( KiB/s): min=24008, max=24040, per=100.00%, avg=24018.95, stdev= 7.16, samples=19
iops : min= 6002, max= 6010, avg=6004.74, stdev= 1.79, samples=19
lat (nsec) : 100=19.91%, 250=78.23%, 500=1.73%, 750=0.12%, 1000=0.01%
lat (usec) : 2=0.01%, 10=0.01%, 20=0.01%
cpu : usr=9.60%, sys=0.00%, ctx=59961, majf=0, minf=0
IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
submit : 0=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
issued rwts: total=0,60000,0,0 short=0,0,0,0 dropped=0,0,0,0
latency : target=0, window=0, percentile=100.00%, depth=1
Run status group 0 (all jobs):
WRITE: bw=23.4MiB/s (24.6MB/s), 23.4MiB/s-23.4MiB/s (24.6MB/s-24.6MB/s), io=234MiB (246MB), run=10001-10001msec