zfs
zfs copied to clipboard
Direct IO Support
Adding O_DIRECT support to ZFS.
Motivation and Context
By adding Direct IO support to ZFS, the ARC can be bypassed when issuing reads/writes. There are certain cases where caching data in the ARC can decrease overall performance. In particular the performance of ZPool's composed of NVMe devices displayed poor read/write performance due to the extra overhead of memcpy's issued to the ARC.
There are also cases where caching in the ARC may not make sense such as when data will not be referenced later. By using the O_DIRECT flag, unnecessary data copies to the ARC can be avoided.
Closes Issue: https://github.com/zfsonlinux/zfs/issues/8381
Description
O_DIRECT support in ZFS will always ensure there is coherency between buffered and O_DIRECT IO requests.
This ensures that all IO requests, whether buffered or direct, will see the same file contents at all times. Just
as in other FS's , O_DIRECT does not imply O_SYNC. While data is written directly to VDEV disks, metadata will
not be synced until the associated TXG is synced.
For both O_DIRECT read and write request the offset and requeset sizes, at a minimum, must be PAGE_SIZE aligned.
In the event they are not, then EINVAL is returned except for in the event the direct property is set to always.
For O_DIRECT writes: The request also must be block aligned (recordsize) or the write request will take the normal (buffered) write path. In the event that request is block aligned and a cached copy of the buffer in the ARC, then it will be discarded from the ARC forcing all further reads to retrieve the data from disk.
For O_DIRECT reads: The only alignment restrictions are PAGE_SIZE alignment. In the event that the requested data is in buffered (in the ARC) it will just be copied from the ARC into the user buffer.
To ensure data integrity for all data written using O_DIRECT, all user pages are made stable in the event one of the following is required: Checksum Compression Encryption Parity By making the user pages stable, we make sure the contents of the user provided buffer can not be changed after any of the above operations have taken place.
A new dataset property direct has been added with the following 3
allowable values:
-
disabled - Accepts O_DIRECT flag, but silently ignores it and treats the request as a buffered IO request.
-
standard - Follows the alignment restrictions outlined above for write/read IO requests when the O_DIRECT flag is used.
-
always - Treats every write/read IO request as though it passed O_DIRECT. In the event the request is not page aligned, it will be redirected through the ARC. All other alignment restrictions are followed.
Direct IO does not bypass the ZIO pipeline, so all checksums, compression, etc. are still all supported with Direct IO.
Some issues that still need to be addressed:
- [ ] Create ZTS tests for O_DIRECT
- [ ] Possibly allow for DVA throttle with O_DIRECT writes
- [ ] Further testing/verification of FreeBSD (majority of debugging has been on Linux)
- [ ] Possibly allow for O_DIRECT with zvols
- [ ] Address race conditions in dbuf code with O_DIRECT
How Has This Been Tested?
Testing was primarily done using FIO and XDD with striping, mirror, raidz, and dRAID VDEV ZPool's.
Tests were performed on CentOS using various kernel's ranging from 3.10, 4.18, and 4.20.
Types of changes
- [ ] Bug fix (non-breaking change which fixes an issue)
- [x] New feature (non-breaking change which adds functionality)
- [x] Performance enhancement (non-breaking change which improves efficiency)
- [ ] Code cleanup (non-breaking change which makes code smaller or more readable)
- [ ] Breaking change (fix or feature that would cause existing functionality to change)
- [x] Documentation (a change to man pages or other documentation)
Checklist:
- [x] My code follows the ZFS on Linux code style requirements.
- [x] I have updated the documentation accordingly.
- [x] I have read the contributing document.
- [ ] I have added tests to cover my changes.
- [ ] I have run the ZFS Test Suite with this change applied.
- [x] All commit messages are properly formatted and contain
Signed-off-by.
Codecov Report
Attention: Patch coverage is 63.17044% with 309 lines in your changes are missing coverage. Please review.
Project coverage is 61.94%. Comparing base (
161ed82) to head (04e3a35). Report is 2456 commits behind head on master.
:exclamation: Current head 04e3a35 differs from pull request most recent head a83e237. Consider uploading reports for the commit a83e237 to get more accurate results
| Files | Patch % | Lines |
|---|---|---|
| module/zfs/dmu.c | 51.01% | 265 Missing :warning: |
| module/os/linux/zfs/abd.c | 88.30% | 20 Missing :warning: |
| module/zfs/dbuf.c | 75.71% | 17 Missing :warning: |
| lib/libzpool/kernel.c | 0.00% | 5 Missing :warning: |
| include/sys/abd.h | 50.00% | 2 Missing :warning: |
Additional details and impacted files
@@ Coverage Diff @@
## master #10018 +/- ##
===========================================
- Coverage 75.17% 61.94% -13.24%
===========================================
Files 402 260 -142
Lines 128071 73582 -54489
===========================================
- Hits 96283 45578 -50705
+ Misses 31788 28004 -3784
| Flag | Coverage Δ | |
|---|---|---|
| kernel | 51.01% <43.78%> (-27.75%) |
:arrow_down: |
| user | 59.10% <59.33%> (+11.67%) |
:arrow_up: |
Flags with carried forward coverage won't be shown. Click here to find out more.
:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.
Overall, I have to say, thanks for taking this one on! This looks like it wasn't trivial to figure out.
With regards to zio_dva_throttle() and performance, I'd like to point to an older PR here: https://github.com/openzfs/zfs/pull/7560 - so it looks like skipping it might have some justification. Ideally IMHO it'd be best to leave it up to the user (ie. configurable).
I'm excited about this PR, it looks to be a solid basis for support of .splice_read()/.splice_write() in order to support IO to/from pipes. I was looking at it this week because of https://github.com/vpsfreecz/linux/commit/1a980b8cbf0059a5308eea61522f232fd03002e2 - with OverlayFS on top of ZFS, this patch makes all apps using sendfile(2) go tra-la. Issue about that one: https://github.com/openzfs/zfs/issues/1156
page.h is a really generic name. Could you please call it zfs_page.h
Does zfs 2.0 have support for Direct I/O ?
From last week's OpenZFS Leadership Meeting notes:
Status of specific PRs: DirectIO Brian Atkinson is still working on it. We expect it to be updated soon, at which point we’ll need reviewers
FWIW I'm happy to review, to the extent of my ability.
(Summary of a private Slack discussion)
zfs_log_write currently re-reads the O_DIRECTly written block back from disk if the WR_* selection logic decides it's going to be a WR_COPIED record. The performance impact is quite significant:
for i in 1; do fio --rw=randwrite --bs=4k --filename_format '/dut/ds$jobnum/benchmark' --name=foo --time_based --size=4G --group_reporting=1 --sync=1 --runtime=30s --numjobs=$i --direct=1; done
...
./funclatency_specialized -T -i 3 zfs_write,zfs_log_write,dmu_read_by_dnode zfs_write,dmu_write_uio_dbuf -S
STACKFUNC PROBE AVG COUNT SUM
0,0 zfs_write 141386.90 7265 1027175834
0,1 zfs_log_write 87372.27 7265 634759549
0,2 dmu_read_by_dnode 85745.36 7265 622940020
STACKFUNC PROBE AVG COUNT SUM
1,0 zfs_write 142031.61 7265 1031859636
1,1 dmu_write_uio_dbuf 41214.84 7265 299425788
On this pool of Micron_7300_MTFDHBA960TDF the zfs_write takes 40us zfs_log_write takes 87us.
Proposal: If a write was written directly we should always log this write as a WR_INDIRECT record with lr_bp= the block pointer produced by dmu_write_direct().
Implementation v1:
- Add an additional argument to
zfs_log_writethat forces the selection algorithm to use WR_INDIRECT. - Set this flag iff O_DIRECT
- => the dmu_sync() call will re-use the already written block pointer because it's in the
dr- (This is @problame 's interpretation of the chat log)
Implementation v2 (long-term):
- Break up
zfs_log_writeso ITXs can be allocated and filled in the zfs_write() copy loop. - Add a facility to bubble up the block pointer that is produced by
dmu_write_directup to thezfs_writecopy loop. - => for each bubbled-up block pointer allocate a WR_INDIRECT ITX
- After the copy loop, at the location where we currently call
zfs_log_write, assign all the ITXs we just created.
I have a PoC for the zfs_log_write breakup ready. It was developed to avoid the dmu_read() overhead for WR_COPIED records but can be generalized to this use case as well. I won't have time to iterate on it for 2 weeks though, so @bwatkinson is likely going to implement v1, merge this PR and move the proposal for v2 into a separate issue.
zfs_log_writeimplementation v1:
This has been implemented in the updated PR. For O_DIRECT writes WR_INDIRECT log records are now always used and the block pointer is stored in the log record without any re-read.
It would be nice to funnel the ioflag through to dmu_sync so that it can assert that IMPLY(O_DIRECT as set, got the bp from the dr).
I looked in to doing exactly this, since it would be nice, but in practice it ended up being a pretty invasive change which didn't seem worthwhile in the end.
FWIW I gave this PR a spin (well, full disclosure: Brian's direct_page_aligned.wip.3) for a day or two and it appeared to work okay. No explosions. Mostly non-direct-IO workloads, though I did some noodling with dd iflag=direct/oflag=direct and some C O_DIRECT hacks.
@bwatkinson - what is the current state of this PR? Would you consider it safe for testing on semi-production systems (real data, but can be replaced)?
@bwatkinson - what is the current state of this PR? Would you consider it safe for testing on semi-production systems (real data, but can be replaced)?
So the PR is still in the WIP progress state. It is up to date with master as of Friday (Nov. 5th). I think it safe for experimenting with but really only for experimentation at this point. There are a few known bugs that we are sorting through at the moment.
hi All,
Is there any update on this PR? Is there any estimation on its merge date?
I decided to give this branch a test since I have a bunch of NVME pools and would love some performance increase.
I ran a read test with my 10 disk nvme pool and noticed a pretty dramatic difference with direct=disabled vs always.
I have 3 machines with this spec that I won't be using until some networking is in place and am available to run tests.
Enabled:
[root@ac-1f-6b-a5-ab-ea bar]# nvme list
Node SN Model Namespace Usage Format FW Rev
--------------------- -------------------- ---------------------------------------- --------- -------------------------- ---------------- --------
/dev/nvme0n1 BTLN902103WV3P2BGN INTEL SSDPE2KE032T8 1 3.20 TB / 3.20 TB 512 B + 0 B VDV10170
/dev/nvme1n1 BTLN902005083P2BGN INTEL SSDPE2KE032T8 1 3.20 TB / 3.20 TB 512 B + 0 B VDV10170
/dev/nvme2n1 BTLN9050021N3P2BGN INTEL SSDPE2KE032T8 1 3.20 TB / 3.20 TB 512 B + 0 B VDV10170
/dev/nvme3n1 BTLN907504P03P2BGN INTEL SSDPE2KE032T8 1 3.20 TB / 3.20 TB 512 B + 0 B VDV10170
/dev/nvme4n1 BTLN902101103P2BGN INTEL SSDPE2KE032T8 1 3.20 TB / 3.20 TB 512 B + 0 B VDV10170
/dev/nvme5n1 BTLN905001BD3P2BGN INTEL SSDPE2KE032T8 1 3.20 TB / 3.20 TB 512 B + 0 B VDV10170
/dev/nvme6n1 BTLN902004DJ3P2BGN INTEL SSDPE2KE032T8 1 3.20 TB / 3.20 TB 512 B + 0 B VDV10170
/dev/nvme7n1 BTLN907504N03P2BGN INTEL SSDPE2KE032T8 1 3.20 TB / 3.20 TB 512 B + 0 B VDV10170
/dev/nvme8n1 BTLN9050027H3P2BGN INTEL SSDPE2KE032T8 1 3.20 TB / 3.20 TB 512 B + 0 B VDV10170
/dev/nvme9n1 BTLN85110HDJ3P2BGN INTEL SSDPE2KE032T8 1 3.20 TB / 3.20 TB 512 B + 0 B VDV10170
[root@ac-1f-6b-a5-ab-ea bar]# zpool status
pool: test
state: ONLINE
config:
NAME STATE READ WRITE CKSUM
test ONLINE 0 0 0
nvme1n1 ONLINE 0 0 0
nvme2n1 ONLINE 0 0 0
nvme3n1 ONLINE 0 0 0
nvme5n1 ONLINE 0 0 0
nvme6n1 ONLINE 0 0 0
nvme8n1 ONLINE 0 0 0
nvme9n1 ONLINE 0 0 0
errors: No known data errors
[root@ac-1f-6b-a5-ab-ea bar]# zfs get direct test/bar
NAME PROPERTY VALUE SOURCE
test/bar direct always local
[root@ac-1f-6b-a5-ab-ea bar]# zpool --version
zfs-2.1.99-1310_g7ac3b7ae9
zfs-kmod-2.1.99-1310_g7ac3b7ae9
[root@ac-1f-6b-a5-ab-ea bar]# fio --time_based --name=benchmark --size=2G --runtime=30 --filename=./test --ioengine=sync --randrepeat=0 --iodepth=32 --direct=1 --invalidate=1 --verify=0 --verify_fatal=0 --numjobs=4 --rw=read --blocksize=4k --group_reporting
benchmark: (g=0): rw=read, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=sync, iodepth=32
...
fio-3.19
Starting 4 processes
Jobs: 4 (f=4): [R(4)][100.0%][r=140MiB/s][r=35.9k IOPS][eta 00m:00s]
benchmark: (groupid=0, jobs=4): err= 0: pid=1289094: Fri Jul 22 18:24:47 2022
read: IOPS=50.5k, BW=197MiB/s (207MB/s)(5914MiB/30001msec)
clat (nsec): min=1772, max=4948.3k, avg=78834.03, stdev=112800.83
lat (nsec): min=1810, max=4948.4k, avg=78884.69, stdev=112806.53
clat percentiles (usec):
| 1.00th=[ 3], 5.00th=[ 3], 10.00th=[ 3], 20.00th=[ 3],
| 30.00th=[ 4], 40.00th=[ 76], 50.00th=[ 86], 60.00th=[ 92],
| 70.00th=[ 100], 80.00th=[ 111], 90.00th=[ 130], 95.00th=[ 223],
| 99.00th=[ 359], 99.50th=[ 424], 99.90th=[ 1037], 99.95th=[ 2835],
| 99.99th=[ 3425]
bw ( KiB/s): min=107584, max=666672, per=100.00%, avg=202896.98, stdev=23198.14, samples=236
iops : min=26896, max=166668, avg=50724.19, stdev=5799.55, samples=236
lat (usec) : 2=0.40%, 4=33.20%, 10=3.36%, 20=0.11%, 50=0.01%
lat (usec) : 100=32.59%, 250=25.74%, 500=4.22%, 750=0.16%, 1000=0.10%
lat (msec) : 2=0.06%, 4=0.06%, 10=0.01%
cpu : usr=0.98%, sys=13.18%, ctx=952786, majf=0, minf=78
IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
issued rwts: total=1513906,0,0,0 short=0,0,0,0 dropped=0,0,0,0
latency : target=0, window=0, percentile=100.00%, depth=32
Run status group 0 (all jobs):
READ: bw=197MiB/s (207MB/s), 197MiB/s-197MiB/s (207MB/s-207MB/s), io=5914MiB (6201MB), run=30001-30001msec
Disabled:
[root@ac-1f-6b-a5-ab-ea bar]# zfs set direct=disabled test/bar
[root@ac-1f-6b-a5-ab-ea bar]# for i in {1..3}; do sync; echo $i > /proc/sys/vm/drop_caches; done
[root@ac-1f-6b-a5-ab-ea bar]# rm -rf test
[root@ac-1f-6b-a5-ab-ea bar]# fio --time_based --name=benchmark --size=2G --runtime=30 --filename=./test --ioengine=sync --randrepeat=0 --iodepth=32 --direct=1 --invalidate=1 --verify=0 --verify_fatal=0 --numjobs=4 --rw=read --blocksize=4k --group_reporting
benchmark: (g=0): rw=read, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=sync, iodepth=32
...
fio-3.19
Starting 4 processes
benchmark: Laying out IO file (1 file / 2048MiB)
Jobs: 4 (f=4): [R(4)][100.0%][r=2095MiB/s][r=536k IOPS][eta 00m:00s]
benchmark: (groupid=0, jobs=4): err= 0: pid=1399856: Fri Jul 22 18:25:43 2022
read: IOPS=584k, BW=2283MiB/s (2394MB/s)(66.9GiB/30001msec)
clat (nsec): min=1195, max=145299, avg=6568.14, stdev=5384.80
lat (nsec): min=1231, max=145336, avg=6603.92, stdev=5385.07
clat percentiles (nsec):
| 1.00th=[ 1512], 5.00th=[ 2576], 10.00th=[ 3152], 20.00th=[ 3920],
| 30.00th=[ 4704], 40.00th=[ 5408], 50.00th=[ 5984], 60.00th=[ 6496],
| 70.00th=[ 7008], 80.00th=[ 7584], 90.00th=[ 8384], 95.00th=[ 9280],
| 99.00th=[37120], 99.50th=[42240], 99.90th=[52480], 99.95th=[55552],
| 99.99th=[60160]
bw ( MiB/s): min= 2084, max= 3905, per=100.00%, avg=2289.42, stdev=122.06, samples=236
iops : min=533706, max=999856, avg=586091.93, stdev=31247.29, samples=236
lat (usec) : 2=2.16%, 4=18.91%, 10=75.22%, 20=0.93%, 50=2.61%
lat (usec) : 100=0.16%, 250=0.01%
cpu : usr=5.64%, sys=86.69%, ctx=348979, majf=0, minf=51
IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
issued rwts: total=17534987,0,0,0 short=0,0,0,0 dropped=0,0,0,0
latency : target=0, window=0, percentile=100.00%, depth=32
Run status group 0 (all jobs):
READ: bw=2283MiB/s (2394MB/s), 2283MiB/s-2283MiB/s (2394MB/s-2394MB/s), io=66.9GiB (71.8GB), run=30001-30001msec
Another test: Enabed:
[root@ac-1f-6b-a5-ab-ea foo]# zfs get recordsize test/foo
NAME PROPERTY VALUE SOURCE
test/foo recordsize 128K default
[root@ac-1f-6b-a5-ab-ea foo]# pwd
/test/foo
[root@ac-1f-6b-a5-ab-ea foo]# fio --time_based --name=benchmark --size=2G --runtime=300 --filename=./boot --ioengine=libaio --randrepeat=0 --iodepth=128 --direct=1 --invalidate=1 --verify=0 --verify_fatal=0 --numjobs=24 --rw=randwrite --blocksize=128k --group_reporting
benchmark: (g=0): rw=randwrite, bs=(R) 128KiB-128KiB, (W) 128KiB-128KiB, (T) 128KiB-128KiB, ioengine=libaio, iodepth=128
...
fio-3.19
Starting 24 processes
benchmark: Laying out IO file (1 file / 2048MiB)
^Cbs: 24 (f=24): [w(24)][4.7%][w=5150MiB/s][w=41.2k IOPS][eta 04m:47s]
fio: terminating on signal 2
benchmark: (groupid=0, jobs=24): err= 0: pid=1409094: Tue Jul 26 11:41:53 2022
write: IOPS=42.8k, BW=5345MiB/s (5605MB/s)(68.5GiB/13118msec); 0 zone resets
slat (usec): min=245, max=3462, avg=558.80, stdev=130.00
clat (usec): min=2, max=101349, avg=71107.74, stdev=11837.83
lat (usec): min=416, max=102170, avg=71666.77, stdev=11911.16
clat percentiles (msec):
| 1.00th=[ 56], 5.00th=[ 61], 10.00th=[ 62], 20.00th=[ 63],
| 30.00th=[ 64], 40.00th=[ 64], 50.00th=[ 65], 60.00th=[ 69],
| 70.00th=[ 75], 80.00th=[ 83], 90.00th=[ 92], 95.00th=[ 95],
| 99.00th=[ 97], 99.50th=[ 99], 99.90th=[ 100], 99.95th=[ 101],
| 99.99th=[ 101]
bw ( MiB/s): min= 4247, max= 6095, per=99.60%, avg=5323.87, stdev=23.61, samples=624
iops : min=33980, max=48760, avg=42590.50, stdev=188.86, samples=624
lat (usec) : 4=0.01%, 10=0.01%, 500=0.01%, 750=0.01%, 1000=0.01%
lat (msec) : 2=0.01%, 4=0.02%, 10=0.05%, 20=0.08%, 50=0.24%
lat (msec) : 100=99.58%, 250=0.02%
cpu : usr=1.83%, sys=73.94%, ctx=566417, majf=0, minf=16943
IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=99.7%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.1%
issued rwts: total=0,560978,0,0 short=0,0,0,0 dropped=0,0,0,0
latency : target=0, window=0, percentile=100.00%, depth=128
Run status group 0 (all jobs):
WRITE: bw=5345MiB/s (5605MB/s), 5345MiB/s-5345MiB/s (5605MB/s-5605MB/s), io=68.5GiB (73.5GB), run=13118-13118msec
Disabed:
[root@ac-1f-6b-a5-ab-ea foo]# zfs set direct=disabled test/foo
[root@ac-1f-6b-a5-ab-ea foo]# for i in {1..3}; do sync; echo $i > /proc/sys/vm/drop_caches; done
[root@ac-1f-6b-a5-ab-ea foo]# rm -rf boot
[root@ac-1f-6b-a5-ab-ea foo]# fio --time_based --name=benchmark --size=2G --runtime=300 --filename=./boot --ioengine=libaio --randrepeat=0 --iodepth=128 --direct=1 --invalidate=1 --verify=0 --verify_fatal=0 --numjobs=24 --rw=randwrite --blocksize=128k --group_reporting
benchmark: (g=0): rw=randwrite, bs=(R) 128KiB-128KiB, (W) 128KiB-128KiB, (T) 128KiB-128KiB, ioengine=libaio, iodepth=128
...
fio-3.19
Starting 24 processes
benchmark: Laying out IO file (1 file / 2048MiB)
^Cbs: 24 (f=24): [w(24)][3.3%][w=17.7GiB/s][w=145k IOPS][eta 04m:51s]
fio: terminating on signal 2
benchmark: (groupid=0, jobs=24): err= 0: pid=1479104: Tue Jul 26 11:42:16 2022
write: IOPS=141k, BW=17.2GiB/s (18.5GB/s)(161GiB/9361msec); 0 zone resets
slat (usec): min=36, max=67814, avg=167.46, stdev=682.26
clat (usec): min=2, max=124187, avg=21584.85, stdev=9289.76
lat (usec): min=58, max=124312, avg=21752.62, stdev=9330.98
clat percentiles (usec):
| 1.00th=[10028], 5.00th=[11863], 10.00th=[13042], 20.00th=[14877],
| 30.00th=[16450], 40.00th=[17957], 50.00th=[19268], 60.00th=[20841],
| 70.00th=[22938], 80.00th=[26346], 90.00th=[32900], 95.00th=[39584],
| 99.00th=[56361], 99.50th=[64750], 99.90th=[81265], 99.95th=[90702],
| 99.99th=[98042]
bw ( MiB/s): min=14225, max=20963, per=99.79%, avg=17598.44, stdev=74.08, samples=432
iops : min=113793, max=167698, avg=140778.78, stdev=592.60, samples=432
lat (usec) : 4=0.01%, 10=0.01%, 100=0.01%, 250=0.01%, 500=0.01%
lat (usec) : 750=0.01%, 1000=0.01%
lat (msec) : 2=0.01%, 4=0.03%, 10=0.87%, 20=53.54%, 50=43.80%
lat (msec) : 100=1.74%, 250=0.01%
cpu : usr=9.64%, sys=65.26%, ctx=68796, majf=0, minf=25694
IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=99.9%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.1%
issued rwts: total=0,1320646,0,0 short=0,0,0,0 dropped=0,0,0,0
latency : target=0, window=0, percentile=100.00%, depth=128
Run status group 0 (all jobs):
WRITE: bw=17.2GiB/s (18.5GB/s), 17.2GiB/s-17.2GiB/s (18.5GB/s-18.5GB/s), io=161GiB (173GB), run=9361-9361msec
I also noticed that you can't set direct on a zvol.
"cannot set property for 'test/foo': 'direct' does not apply to datasets of this type". Do zvols suffer from the double mem copy?
I decided to give this branch a test since I have a bunch of NVME pools and would love some performance increase.
I ran a read test with my 10 disk nvme pool and noticed a pretty dramatic difference with direct=disabled vs always.
I have 3 machines with this spec that I won't be using until some networking is in place and am available to run tests.
Enabled:
[root@ac-1f-6b-a5-ab-ea bar]# nvme list Node SN Model Namespace Usage Format FW Rev --------------------- -------------------- ---------------------------------------- --------- -------------------------- ---------------- -------- /dev/nvme0n1 BTLN902103WV3P2BGN INTEL SSDPE2KE032T8 1 3.20 TB / 3.20 TB 512 B + 0 B VDV10170 /dev/nvme1n1 BTLN902005083P2BGN INTEL SSDPE2KE032T8 1 3.20 TB / 3.20 TB 512 B + 0 B VDV10170 /dev/nvme2n1 BTLN9050021N3P2BGN INTEL SSDPE2KE032T8 1 3.20 TB / 3.20 TB 512 B + 0 B VDV10170 /dev/nvme3n1 BTLN907504P03P2BGN INTEL SSDPE2KE032T8 1 3.20 TB / 3.20 TB 512 B + 0 B VDV10170 /dev/nvme4n1 BTLN902101103P2BGN INTEL SSDPE2KE032T8 1 3.20 TB / 3.20 TB 512 B + 0 B VDV10170 /dev/nvme5n1 BTLN905001BD3P2BGN INTEL SSDPE2KE032T8 1 3.20 TB / 3.20 TB 512 B + 0 B VDV10170 /dev/nvme6n1 BTLN902004DJ3P2BGN INTEL SSDPE2KE032T8 1 3.20 TB / 3.20 TB 512 B + 0 B VDV10170 /dev/nvme7n1 BTLN907504N03P2BGN INTEL SSDPE2KE032T8 1 3.20 TB / 3.20 TB 512 B + 0 B VDV10170 /dev/nvme8n1 BTLN9050027H3P2BGN INTEL SSDPE2KE032T8 1 3.20 TB / 3.20 TB 512 B + 0 B VDV10170 /dev/nvme9n1 BTLN85110HDJ3P2BGN INTEL SSDPE2KE032T8 1 3.20 TB / 3.20 TB 512 B + 0 B VDV10170 [root@ac-1f-6b-a5-ab-ea bar]# zpool status pool: test state: ONLINE config: NAME STATE READ WRITE CKSUM test ONLINE 0 0 0 nvme1n1 ONLINE 0 0 0 nvme2n1 ONLINE 0 0 0 nvme3n1 ONLINE 0 0 0 nvme5n1 ONLINE 0 0 0 nvme6n1 ONLINE 0 0 0 nvme8n1 ONLINE 0 0 0 nvme9n1 ONLINE 0 0 0 errors: No known data errors [root@ac-1f-6b-a5-ab-ea bar]# zfs get direct test/bar NAME PROPERTY VALUE SOURCE test/bar direct always local [root@ac-1f-6b-a5-ab-ea bar]# zpool --version zfs-2.1.99-1310_g7ac3b7ae9 zfs-kmod-2.1.99-1310_g7ac3b7ae9 [root@ac-1f-6b-a5-ab-ea bar]# fio --time_based --name=benchmark --size=2G --runtime=30 --filename=./test --ioengine=sync --randrepeat=0 --iodepth=32 --direct=1 --invalidate=1 --verify=0 --verify_fatal=0 --numjobs=4 --rw=read --blocksize=4k --group_reporting benchmark: (g=0): rw=read, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=sync, iodepth=32 ... fio-3.19 Starting 4 processes Jobs: 4 (f=4): [R(4)][100.0%][r=140MiB/s][r=35.9k IOPS][eta 00m:00s] benchmark: (groupid=0, jobs=4): err= 0: pid=1289094: Fri Jul 22 18:24:47 2022 read: IOPS=50.5k, BW=197MiB/s (207MB/s)(5914MiB/30001msec) clat (nsec): min=1772, max=4948.3k, avg=78834.03, stdev=112800.83 lat (nsec): min=1810, max=4948.4k, avg=78884.69, stdev=112806.53 clat percentiles (usec): | 1.00th=[ 3], 5.00th=[ 3], 10.00th=[ 3], 20.00th=[ 3], | 30.00th=[ 4], 40.00th=[ 76], 50.00th=[ 86], 60.00th=[ 92], | 70.00th=[ 100], 80.00th=[ 111], 90.00th=[ 130], 95.00th=[ 223], | 99.00th=[ 359], 99.50th=[ 424], 99.90th=[ 1037], 99.95th=[ 2835], | 99.99th=[ 3425] bw ( KiB/s): min=107584, max=666672, per=100.00%, avg=202896.98, stdev=23198.14, samples=236 iops : min=26896, max=166668, avg=50724.19, stdev=5799.55, samples=236 lat (usec) : 2=0.40%, 4=33.20%, 10=3.36%, 20=0.11%, 50=0.01% lat (usec) : 100=32.59%, 250=25.74%, 500=4.22%, 750=0.16%, 1000=0.10% lat (msec) : 2=0.06%, 4=0.06%, 10=0.01% cpu : usr=0.98%, sys=13.18%, ctx=952786, majf=0, minf=78 IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0% submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% issued rwts: total=1513906,0,0,0 short=0,0,0,0 dropped=0,0,0,0 latency : target=0, window=0, percentile=100.00%, depth=32 Run status group 0 (all jobs): READ: bw=197MiB/s (207MB/s), 197MiB/s-197MiB/s (207MB/s-207MB/s), io=5914MiB (6201MB), run=30001-30001msecDisabled:
[root@ac-1f-6b-a5-ab-ea bar]# zfs set direct=disabled test/bar [root@ac-1f-6b-a5-ab-ea bar]# for i in {1..3}; do sync; echo $i > /proc/sys/vm/drop_caches; done [root@ac-1f-6b-a5-ab-ea bar]# rm -rf test [root@ac-1f-6b-a5-ab-ea bar]# fio --time_based --name=benchmark --size=2G --runtime=30 --filename=./test --ioengine=sync --randrepeat=0 --iodepth=32 --direct=1 --invalidate=1 --verify=0 --verify_fatal=0 --numjobs=4 --rw=read --blocksize=4k --group_reporting benchmark: (g=0): rw=read, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=sync, iodepth=32 ... fio-3.19 Starting 4 processes benchmark: Laying out IO file (1 file / 2048MiB) Jobs: 4 (f=4): [R(4)][100.0%][r=2095MiB/s][r=536k IOPS][eta 00m:00s] benchmark: (groupid=0, jobs=4): err= 0: pid=1399856: Fri Jul 22 18:25:43 2022 read: IOPS=584k, BW=2283MiB/s (2394MB/s)(66.9GiB/30001msec) clat (nsec): min=1195, max=145299, avg=6568.14, stdev=5384.80 lat (nsec): min=1231, max=145336, avg=6603.92, stdev=5385.07 clat percentiles (nsec): | 1.00th=[ 1512], 5.00th=[ 2576], 10.00th=[ 3152], 20.00th=[ 3920], | 30.00th=[ 4704], 40.00th=[ 5408], 50.00th=[ 5984], 60.00th=[ 6496], | 70.00th=[ 7008], 80.00th=[ 7584], 90.00th=[ 8384], 95.00th=[ 9280], | 99.00th=[37120], 99.50th=[42240], 99.90th=[52480], 99.95th=[55552], | 99.99th=[60160] bw ( MiB/s): min= 2084, max= 3905, per=100.00%, avg=2289.42, stdev=122.06, samples=236 iops : min=533706, max=999856, avg=586091.93, stdev=31247.29, samples=236 lat (usec) : 2=2.16%, 4=18.91%, 10=75.22%, 20=0.93%, 50=2.61% lat (usec) : 100=0.16%, 250=0.01% cpu : usr=5.64%, sys=86.69%, ctx=348979, majf=0, minf=51 IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0% submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% issued rwts: total=17534987,0,0,0 short=0,0,0,0 dropped=0,0,0,0 latency : target=0, window=0, percentile=100.00%, depth=32 Run status group 0 (all jobs): READ: bw=2283MiB/s (2394MB/s), 2283MiB/s-2283MiB/s (2394MB/s-2394MB/s), io=66.9GiB (71.8GB), run=30001-30001msecAnother test: Enabed:
[root@ac-1f-6b-a5-ab-ea foo]# zfs get recordsize test/foo NAME PROPERTY VALUE SOURCE test/foo recordsize 128K default [root@ac-1f-6b-a5-ab-ea foo]# pwd /test/foo [root@ac-1f-6b-a5-ab-ea foo]# fio --time_based --name=benchmark --size=2G --runtime=300 --filename=./boot --ioengine=libaio --randrepeat=0 --iodepth=128 --direct=1 --invalidate=1 --verify=0 --verify_fatal=0 --numjobs=24 --rw=randwrite --blocksize=128k --group_reporting benchmark: (g=0): rw=randwrite, bs=(R) 128KiB-128KiB, (W) 128KiB-128KiB, (T) 128KiB-128KiB, ioengine=libaio, iodepth=128 ... fio-3.19 Starting 24 processes benchmark: Laying out IO file (1 file / 2048MiB) ^Cbs: 24 (f=24): [w(24)][4.7%][w=5150MiB/s][w=41.2k IOPS][eta 04m:47s] fio: terminating on signal 2 benchmark: (groupid=0, jobs=24): err= 0: pid=1409094: Tue Jul 26 11:41:53 2022 write: IOPS=42.8k, BW=5345MiB/s (5605MB/s)(68.5GiB/13118msec); 0 zone resets slat (usec): min=245, max=3462, avg=558.80, stdev=130.00 clat (usec): min=2, max=101349, avg=71107.74, stdev=11837.83 lat (usec): min=416, max=102170, avg=71666.77, stdev=11911.16 clat percentiles (msec): | 1.00th=[ 56], 5.00th=[ 61], 10.00th=[ 62], 20.00th=[ 63], | 30.00th=[ 64], 40.00th=[ 64], 50.00th=[ 65], 60.00th=[ 69], | 70.00th=[ 75], 80.00th=[ 83], 90.00th=[ 92], 95.00th=[ 95], | 99.00th=[ 97], 99.50th=[ 99], 99.90th=[ 100], 99.95th=[ 101], | 99.99th=[ 101] bw ( MiB/s): min= 4247, max= 6095, per=99.60%, avg=5323.87, stdev=23.61, samples=624 iops : min=33980, max=48760, avg=42590.50, stdev=188.86, samples=624 lat (usec) : 4=0.01%, 10=0.01%, 500=0.01%, 750=0.01%, 1000=0.01% lat (msec) : 2=0.01%, 4=0.02%, 10=0.05%, 20=0.08%, 50=0.24% lat (msec) : 100=99.58%, 250=0.02% cpu : usr=1.83%, sys=73.94%, ctx=566417, majf=0, minf=16943 IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=99.7% submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.1% issued rwts: total=0,560978,0,0 short=0,0,0,0 dropped=0,0,0,0 latency : target=0, window=0, percentile=100.00%, depth=128 Run status group 0 (all jobs): WRITE: bw=5345MiB/s (5605MB/s), 5345MiB/s-5345MiB/s (5605MB/s-5605MB/s), io=68.5GiB (73.5GB), run=13118-13118msecDisabed:
[root@ac-1f-6b-a5-ab-ea foo]# zfs set direct=disabled test/foo [root@ac-1f-6b-a5-ab-ea foo]# for i in {1..3}; do sync; echo $i > /proc/sys/vm/drop_caches; done [root@ac-1f-6b-a5-ab-ea foo]# rm -rf boot [root@ac-1f-6b-a5-ab-ea foo]# fio --time_based --name=benchmark --size=2G --runtime=300 --filename=./boot --ioengine=libaio --randrepeat=0 --iodepth=128 --direct=1 --invalidate=1 --verify=0 --verify_fatal=0 --numjobs=24 --rw=randwrite --blocksize=128k --group_reporting benchmark: (g=0): rw=randwrite, bs=(R) 128KiB-128KiB, (W) 128KiB-128KiB, (T) 128KiB-128KiB, ioengine=libaio, iodepth=128 ... fio-3.19 Starting 24 processes benchmark: Laying out IO file (1 file / 2048MiB) ^Cbs: 24 (f=24): [w(24)][3.3%][w=17.7GiB/s][w=145k IOPS][eta 04m:51s] fio: terminating on signal 2 benchmark: (groupid=0, jobs=24): err= 0: pid=1479104: Tue Jul 26 11:42:16 2022 write: IOPS=141k, BW=17.2GiB/s (18.5GB/s)(161GiB/9361msec); 0 zone resets slat (usec): min=36, max=67814, avg=167.46, stdev=682.26 clat (usec): min=2, max=124187, avg=21584.85, stdev=9289.76 lat (usec): min=58, max=124312, avg=21752.62, stdev=9330.98 clat percentiles (usec): | 1.00th=[10028], 5.00th=[11863], 10.00th=[13042], 20.00th=[14877], | 30.00th=[16450], 40.00th=[17957], 50.00th=[19268], 60.00th=[20841], | 70.00th=[22938], 80.00th=[26346], 90.00th=[32900], 95.00th=[39584], | 99.00th=[56361], 99.50th=[64750], 99.90th=[81265], 99.95th=[90702], | 99.99th=[98042] bw ( MiB/s): min=14225, max=20963, per=99.79%, avg=17598.44, stdev=74.08, samples=432 iops : min=113793, max=167698, avg=140778.78, stdev=592.60, samples=432 lat (usec) : 4=0.01%, 10=0.01%, 100=0.01%, 250=0.01%, 500=0.01% lat (usec) : 750=0.01%, 1000=0.01% lat (msec) : 2=0.01%, 4=0.03%, 10=0.87%, 20=53.54%, 50=43.80% lat (msec) : 100=1.74%, 250=0.01% cpu : usr=9.64%, sys=65.26%, ctx=68796, majf=0, minf=25694 IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=99.9% submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.1% issued rwts: total=0,1320646,0,0 short=0,0,0,0 dropped=0,0,0,0 latency : target=0, window=0, percentile=100.00%, depth=128 Run status group 0 (all jobs): WRITE: bw=17.2GiB/s (18.5GB/s), 17.2GiB/s-17.2GiB/s (18.5GB/s-18.5GB/s), io=161GiB (173GB), run=9361-9361msec
@Smithx10 so I noticed a couple of things with your results. So you are doing libaio for the ioengine with fio for the 128k tests; however, the asynchronous engines do not behave the way you think they would with ZFS. There is no asynchronous hooks in ZFS currently and all those IO calls get sent down the synchronous paths. The numjobs parameter will create the identical IO workloads, but the iodepth will not behave as it might normally would with other FS's. In fact there is a PR to make the asynchronous IO API's (this is what the async ioengine's use) work with ZFS https://github.com/openzfs/zfs/pull/12166. Also, I am not so certain that iodepth does anything when ioengine=sync with the 4k tests. I believe the fio man page might mention this. When I have looked at the FIO code, in the past, it seems to have no effect in that case. Could be wrong though. Feel free to double check me or tell me I am wrong on that one.
Also, I noticed your size was 2g correct? There can be no expectation that O_DIRECT can surpass ARC speed if the entire working set is living the ARC. You are just getting all the memory bandwidth the ARC can give at that point. Also if you are only writing 2GB of data, I would have to imagine you are not reading from all of those NVMe's in the Zpool. You can use zpool iostat -vq 1 to confirm or deny this though. It maybe the case you are not reading from all devices. This was just some quick observations.
I think if you look at the slides from the OpenZFS Developer Summit:
https://docs.google.com/presentation/d/1f9bE1S6KqwHWVJtsOOfCu_cVKAFQO94h/edit#slide=id.p1
I was sequentially writing and reading 2TB of data to/from the Zpools. The argument has never been to not use the ARC with NVMe devices. In order to get good throughput performance with O_DIRECT you really need to step outside of the bounds of the ARC or be reading data sets that are not cached in the ARC that are spread across devices (larger than 2GB). There are certain situations where direct IO is a valid solution, and others where it isn't. It is not as simple as asking for it and everything improves.
I also was curious what your recordsize was set to for the 4k tests? I imagine you were trying to measure IOPS there? If it was set to the default 128k the IOPS results are not too surprising to me. Just like the normal ARC reads, a 4k request will not just fetch that amount if it has to go down to disk and the recordsize > 4k. The reason for this has to do with data validation with checksums. There is no way to validate the data unless you fetch the whole block. So, every time a read IO with O_DIRECT is issued, it will read back the entire block (AKA recordsize) to validate the data before returning it to the user.
My response is also based on the idea your ARC size was >= 2GB. If that assumption is wrong, then there is egg on my face. Please let me know if this is not the case.
I also noticed that you can't set direct on a zvol.
"cannot set property for 'test/foo': 'direct' does not apply to datasets of this type". Do zvols suffer from the double mem copy?
@Smithx10 that is correct. @behlendorf and I were having some issues with hooking in O_DIRECT with zvols. However, if my memory serves me correctly, that was possibly due to placing the pages in writeback which we no longer do. That needs to be revisited. However, I think that is additional work outside of this PR. If this PR is merged, then using the hooks in the zvol code should in theory be fine. It would just require some more investigation as there might still be issues there.
I decided to give this branch a test since I have a bunch of NVME pools and would love some performance increase. I ran a read test with my 10 disk nvme pool and noticed a pretty dramatic difference with direct=disabled vs always. I have 3 machines with this spec that I won't be using until some networking is in place and am available to run tests. Enabled:
[root@ac-1f-6b-a5-ab-ea bar]# nvme list Node SN Model Namespace Usage Format FW Rev --------------------- -------------------- ---------------------------------------- --------- -------------------------- ---------------- -------- /dev/nvme0n1 BTLN902103WV3P2BGN INTEL SSDPE2KE032T8 1 3.20 TB / 3.20 TB 512 B + 0 B VDV10170 /dev/nvme1n1 BTLN902005083P2BGN INTEL SSDPE2KE032T8 1 3.20 TB / 3.20 TB 512 B + 0 B VDV10170 /dev/nvme2n1 BTLN9050021N3P2BGN INTEL SSDPE2KE032T8 1 3.20 TB / 3.20 TB 512 B + 0 B VDV10170 /dev/nvme3n1 BTLN907504P03P2BGN INTEL SSDPE2KE032T8 1 3.20 TB / 3.20 TB 512 B + 0 B VDV10170 /dev/nvme4n1 BTLN902101103P2BGN INTEL SSDPE2KE032T8 1 3.20 TB / 3.20 TB 512 B + 0 B VDV10170 /dev/nvme5n1 BTLN905001BD3P2BGN INTEL SSDPE2KE032T8 1 3.20 TB / 3.20 TB 512 B + 0 B VDV10170 /dev/nvme6n1 BTLN902004DJ3P2BGN INTEL SSDPE2KE032T8 1 3.20 TB / 3.20 TB 512 B + 0 B VDV10170 /dev/nvme7n1 BTLN907504N03P2BGN INTEL SSDPE2KE032T8 1 3.20 TB / 3.20 TB 512 B + 0 B VDV10170 /dev/nvme8n1 BTLN9050027H3P2BGN INTEL SSDPE2KE032T8 1 3.20 TB / 3.20 TB 512 B + 0 B VDV10170 /dev/nvme9n1 BTLN85110HDJ3P2BGN INTEL SSDPE2KE032T8 1 3.20 TB / 3.20 TB 512 B + 0 B VDV10170 [root@ac-1f-6b-a5-ab-ea bar]# zpool status pool: test state: ONLINE config: NAME STATE READ WRITE CKSUM test ONLINE 0 0 0 nvme1n1 ONLINE 0 0 0 nvme2n1 ONLINE 0 0 0 nvme3n1 ONLINE 0 0 0 nvme5n1 ONLINE 0 0 0 nvme6n1 ONLINE 0 0 0 nvme8n1 ONLINE 0 0 0 nvme9n1 ONLINE 0 0 0 errors: No known data errors [root@ac-1f-6b-a5-ab-ea bar]# zfs get direct test/bar NAME PROPERTY VALUE SOURCE test/bar direct always local [root@ac-1f-6b-a5-ab-ea bar]# zpool --version zfs-2.1.99-1310_g7ac3b7ae9 zfs-kmod-2.1.99-1310_g7ac3b7ae9 [root@ac-1f-6b-a5-ab-ea bar]# fio --time_based --name=benchmark --size=2G --runtime=30 --filename=./test --ioengine=sync --randrepeat=0 --iodepth=32 --direct=1 --invalidate=1 --verify=0 --verify_fatal=0 --numjobs=4 --rw=read --blocksize=4k --group_reporting benchmark: (g=0): rw=read, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=sync, iodepth=32 ... fio-3.19 Starting 4 processes Jobs: 4 (f=4): [R(4)][100.0%][r=140MiB/s][r=35.9k IOPS][eta 00m:00s] benchmark: (groupid=0, jobs=4): err= 0: pid=1289094: Fri Jul 22 18:24:47 2022 read: IOPS=50.5k, BW=197MiB/s (207MB/s)(5914MiB/30001msec) clat (nsec): min=1772, max=4948.3k, avg=78834.03, stdev=112800.83 lat (nsec): min=1810, max=4948.4k, avg=78884.69, stdev=112806.53 clat percentiles (usec): | 1.00th=[ 3], 5.00th=[ 3], 10.00th=[ 3], 20.00th=[ 3], | 30.00th=[ 4], 40.00th=[ 76], 50.00th=[ 86], 60.00th=[ 92], | 70.00th=[ 100], 80.00th=[ 111], 90.00th=[ 130], 95.00th=[ 223], | 99.00th=[ 359], 99.50th=[ 424], 99.90th=[ 1037], 99.95th=[ 2835], | 99.99th=[ 3425] bw ( KiB/s): min=107584, max=666672, per=100.00%, avg=202896.98, stdev=23198.14, samples=236 iops : min=26896, max=166668, avg=50724.19, stdev=5799.55, samples=236 lat (usec) : 2=0.40%, 4=33.20%, 10=3.36%, 20=0.11%, 50=0.01% lat (usec) : 100=32.59%, 250=25.74%, 500=4.22%, 750=0.16%, 1000=0.10% lat (msec) : 2=0.06%, 4=0.06%, 10=0.01% cpu : usr=0.98%, sys=13.18%, ctx=952786, majf=0, minf=78 IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0% submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% issued rwts: total=1513906,0,0,0 short=0,0,0,0 dropped=0,0,0,0 latency : target=0, window=0, percentile=100.00%, depth=32 Run status group 0 (all jobs): READ: bw=197MiB/s (207MB/s), 197MiB/s-197MiB/s (207MB/s-207MB/s), io=5914MiB (6201MB), run=30001-30001msecDisabled:
[root@ac-1f-6b-a5-ab-ea bar]# zfs set direct=disabled test/bar [root@ac-1f-6b-a5-ab-ea bar]# for i in {1..3}; do sync; echo $i > /proc/sys/vm/drop_caches; done [root@ac-1f-6b-a5-ab-ea bar]# rm -rf test [root@ac-1f-6b-a5-ab-ea bar]# fio --time_based --name=benchmark --size=2G --runtime=30 --filename=./test --ioengine=sync --randrepeat=0 --iodepth=32 --direct=1 --invalidate=1 --verify=0 --verify_fatal=0 --numjobs=4 --rw=read --blocksize=4k --group_reporting benchmark: (g=0): rw=read, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=sync, iodepth=32 ... fio-3.19 Starting 4 processes benchmark: Laying out IO file (1 file / 2048MiB) Jobs: 4 (f=4): [R(4)][100.0%][r=2095MiB/s][r=536k IOPS][eta 00m:00s] benchmark: (groupid=0, jobs=4): err= 0: pid=1399856: Fri Jul 22 18:25:43 2022 read: IOPS=584k, BW=2283MiB/s (2394MB/s)(66.9GiB/30001msec) clat (nsec): min=1195, max=145299, avg=6568.14, stdev=5384.80 lat (nsec): min=1231, max=145336, avg=6603.92, stdev=5385.07 clat percentiles (nsec): | 1.00th=[ 1512], 5.00th=[ 2576], 10.00th=[ 3152], 20.00th=[ 3920], | 30.00th=[ 4704], 40.00th=[ 5408], 50.00th=[ 5984], 60.00th=[ 6496], | 70.00th=[ 7008], 80.00th=[ 7584], 90.00th=[ 8384], 95.00th=[ 9280], | 99.00th=[37120], 99.50th=[42240], 99.90th=[52480], 99.95th=[55552], | 99.99th=[60160] bw ( MiB/s): min= 2084, max= 3905, per=100.00%, avg=2289.42, stdev=122.06, samples=236 iops : min=533706, max=999856, avg=586091.93, stdev=31247.29, samples=236 lat (usec) : 2=2.16%, 4=18.91%, 10=75.22%, 20=0.93%, 50=2.61% lat (usec) : 100=0.16%, 250=0.01% cpu : usr=5.64%, sys=86.69%, ctx=348979, majf=0, minf=51 IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0% submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% issued rwts: total=17534987,0,0,0 short=0,0,0,0 dropped=0,0,0,0 latency : target=0, window=0, percentile=100.00%, depth=32 Run status group 0 (all jobs): READ: bw=2283MiB/s (2394MB/s), 2283MiB/s-2283MiB/s (2394MB/s-2394MB/s), io=66.9GiB (71.8GB), run=30001-30001msecAnother test: Enabed:
[root@ac-1f-6b-a5-ab-ea foo]# zfs get recordsize test/foo NAME PROPERTY VALUE SOURCE test/foo recordsize 128K default [root@ac-1f-6b-a5-ab-ea foo]# pwd /test/foo [root@ac-1f-6b-a5-ab-ea foo]# fio --time_based --name=benchmark --size=2G --runtime=300 --filename=./boot --ioengine=libaio --randrepeat=0 --iodepth=128 --direct=1 --invalidate=1 --verify=0 --verify_fatal=0 --numjobs=24 --rw=randwrite --blocksize=128k --group_reporting benchmark: (g=0): rw=randwrite, bs=(R) 128KiB-128KiB, (W) 128KiB-128KiB, (T) 128KiB-128KiB, ioengine=libaio, iodepth=128 ... fio-3.19 Starting 24 processes benchmark: Laying out IO file (1 file / 2048MiB) ^Cbs: 24 (f=24): [w(24)][4.7%][w=5150MiB/s][w=41.2k IOPS][eta 04m:47s] fio: terminating on signal 2 benchmark: (groupid=0, jobs=24): err= 0: pid=1409094: Tue Jul 26 11:41:53 2022 write: IOPS=42.8k, BW=5345MiB/s (5605MB/s)(68.5GiB/13118msec); 0 zone resets slat (usec): min=245, max=3462, avg=558.80, stdev=130.00 clat (usec): min=2, max=101349, avg=71107.74, stdev=11837.83 lat (usec): min=416, max=102170, avg=71666.77, stdev=11911.16 clat percentiles (msec): | 1.00th=[ 56], 5.00th=[ 61], 10.00th=[ 62], 20.00th=[ 63], | 30.00th=[ 64], 40.00th=[ 64], 50.00th=[ 65], 60.00th=[ 69], | 70.00th=[ 75], 80.00th=[ 83], 90.00th=[ 92], 95.00th=[ 95], | 99.00th=[ 97], 99.50th=[ 99], 99.90th=[ 100], 99.95th=[ 101], | 99.99th=[ 101] bw ( MiB/s): min= 4247, max= 6095, per=99.60%, avg=5323.87, stdev=23.61, samples=624 iops : min=33980, max=48760, avg=42590.50, stdev=188.86, samples=624 lat (usec) : 4=0.01%, 10=0.01%, 500=0.01%, 750=0.01%, 1000=0.01% lat (msec) : 2=0.01%, 4=0.02%, 10=0.05%, 20=0.08%, 50=0.24% lat (msec) : 100=99.58%, 250=0.02% cpu : usr=1.83%, sys=73.94%, ctx=566417, majf=0, minf=16943 IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=99.7% submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.1% issued rwts: total=0,560978,0,0 short=0,0,0,0 dropped=0,0,0,0 latency : target=0, window=0, percentile=100.00%, depth=128 Run status group 0 (all jobs): WRITE: bw=5345MiB/s (5605MB/s), 5345MiB/s-5345MiB/s (5605MB/s-5605MB/s), io=68.5GiB (73.5GB), run=13118-13118msecDisabed:
[root@ac-1f-6b-a5-ab-ea foo]# zfs set direct=disabled test/foo [root@ac-1f-6b-a5-ab-ea foo]# for i in {1..3}; do sync; echo $i > /proc/sys/vm/drop_caches; done [root@ac-1f-6b-a5-ab-ea foo]# rm -rf boot [root@ac-1f-6b-a5-ab-ea foo]# fio --time_based --name=benchmark --size=2G --runtime=300 --filename=./boot --ioengine=libaio --randrepeat=0 --iodepth=128 --direct=1 --invalidate=1 --verify=0 --verify_fatal=0 --numjobs=24 --rw=randwrite --blocksize=128k --group_reporting benchmark: (g=0): rw=randwrite, bs=(R) 128KiB-128KiB, (W) 128KiB-128KiB, (T) 128KiB-128KiB, ioengine=libaio, iodepth=128 ... fio-3.19 Starting 24 processes benchmark: Laying out IO file (1 file / 2048MiB) ^Cbs: 24 (f=24): [w(24)][3.3%][w=17.7GiB/s][w=145k IOPS][eta 04m:51s] fio: terminating on signal 2 benchmark: (groupid=0, jobs=24): err= 0: pid=1479104: Tue Jul 26 11:42:16 2022 write: IOPS=141k, BW=17.2GiB/s (18.5GB/s)(161GiB/9361msec); 0 zone resets slat (usec): min=36, max=67814, avg=167.46, stdev=682.26 clat (usec): min=2, max=124187, avg=21584.85, stdev=9289.76 lat (usec): min=58, max=124312, avg=21752.62, stdev=9330.98 clat percentiles (usec): | 1.00th=[10028], 5.00th=[11863], 10.00th=[13042], 20.00th=[14877], | 30.00th=[16450], 40.00th=[17957], 50.00th=[19268], 60.00th=[20841], | 70.00th=[22938], 80.00th=[26346], 90.00th=[32900], 95.00th=[39584], | 99.00th=[56361], 99.50th=[64750], 99.90th=[81265], 99.95th=[90702], | 99.99th=[98042] bw ( MiB/s): min=14225, max=20963, per=99.79%, avg=17598.44, stdev=74.08, samples=432 iops : min=113793, max=167698, avg=140778.78, stdev=592.60, samples=432 lat (usec) : 4=0.01%, 10=0.01%, 100=0.01%, 250=0.01%, 500=0.01% lat (usec) : 750=0.01%, 1000=0.01% lat (msec) : 2=0.01%, 4=0.03%, 10=0.87%, 20=53.54%, 50=43.80% lat (msec) : 100=1.74%, 250=0.01% cpu : usr=9.64%, sys=65.26%, ctx=68796, majf=0, minf=25694 IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=99.9% submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.1% issued rwts: total=0,1320646,0,0 short=0,0,0,0 dropped=0,0,0,0 latency : target=0, window=0, percentile=100.00%, depth=128 Run status group 0 (all jobs): WRITE: bw=17.2GiB/s (18.5GB/s), 17.2GiB/s-17.2GiB/s (18.5GB/s-18.5GB/s), io=161GiB (173GB), run=9361-9361msec@Smithx10 so I noticed a couple of things with your results. So you are doing
libaiofor theioenginewithfiofor the128ktests; however, the asynchronous engines do not behave the way you think they would with ZFS. There is no asynchronous hooks in ZFS currently and all those IO calls get sent down the synchronous paths. Thenumjobsparameter will create the identical IO workloads, but theiodepthwill not behave as it might normally would with other FS's. In fact there is a PR to make the asynchronous IO API's (this is what the asyncioengine's use) work with ZFS #12166. Also, I am not so certain thatiodepthdoes anything whenioengine=syncwith the4ktests. I believe thefioman page might mention this. When I have looked at the FIO code, in the past, it seems to have no effect in that case. Could be wrong though. Feel free to double check me or tell me I am wrong on that one.Also, I noticed your size was 2g correct? There can be no expectation that
O_DIRECTcan surpass ARC speed if the entire working set is living the ARC. You are just getting all the memory bandwidth the ARC can give at that point. Also if you are only writing 2GB of data, I would have to imagine you are not reading from all of those NVMe's in the Zpool. You can usezpool iostat -vq 1to confirm or deny this though. It maybe the case you are not reading from all devices. This was just some quick observations.I think if you look at the slides from the OpenZFS Developer Summit: https://docs.google.com/presentation/d/1f9bE1S6KqwHWVJtsOOfCu_cVKAFQO94h/edit#slide=id.p1 I was sequentially writing and reading 2TB of data to/from the Zpools. The argument has never been to not use the ARC with NVMe devices. In order to get good throughput performance with
O_DIRECTyou really need to step outside of the bounds of the ARC or be reading data sets that are not cached in the ARC that are spread across devices (larger than 2GB). There are certain situations where direct IO is a valid solution, and others where it isn't. It is not as simple as asking for it and everything improves.I also was curious what your
recordsizewas set to for the4ktests? I imagine you were trying to measure IOPS there? If it was set to the default128kthe IOPS results are not too surprising to me. Just like the normal ARC reads, a4krequest will not just fetch that amount if it has to go down to disk and therecordsize>4k. The reason for this has to do with data validation with checksums. There is no way to validate the data unless you fetch the whole block. So, every time a read IO withO_DIRECTis issued, it will read back the entire block (AKArecordsize) to validate the data before returning it to the user.My response is also based on the idea your ARC size was >= 2GB. If that assumption is wrong, then there is egg on my face. Please let me know if this is not the case.
Thank You So much for this explanation. I learned a bunch about what is happening here. I really appreciate you taking the time to help educate me. I will go experiment with this new knowledge. Thanks again!
I decided to give this branch a test since I have a bunch of NVME pools and would love some performance increase. I ran a read test with my 10 disk nvme pool and noticed a pretty dramatic difference with direct=disabled vs always. I have 3 machines with this spec that I won't be using until some networking is in place and am available to run tests. Enabled:
[root@ac-1f-6b-a5-ab-ea bar]# nvme list Node SN Model Namespace Usage Format FW Rev --------------------- -------------------- ---------------------------------------- --------- -------------------------- ---------------- -------- /dev/nvme0n1 BTLN902103WV3P2BGN INTEL SSDPE2KE032T8 1 3.20 TB / 3.20 TB 512 B + 0 B VDV10170 /dev/nvme1n1 BTLN902005083P2BGN INTEL SSDPE2KE032T8 1 3.20 TB / 3.20 TB 512 B + 0 B VDV10170 /dev/nvme2n1 BTLN9050021N3P2BGN INTEL SSDPE2KE032T8 1 3.20 TB / 3.20 TB 512 B + 0 B VDV10170 /dev/nvme3n1 BTLN907504P03P2BGN INTEL SSDPE2KE032T8 1 3.20 TB / 3.20 TB 512 B + 0 B VDV10170 /dev/nvme4n1 BTLN902101103P2BGN INTEL SSDPE2KE032T8 1 3.20 TB / 3.20 TB 512 B + 0 B VDV10170 /dev/nvme5n1 BTLN905001BD3P2BGN INTEL SSDPE2KE032T8 1 3.20 TB / 3.20 TB 512 B + 0 B VDV10170 /dev/nvme6n1 BTLN902004DJ3P2BGN INTEL SSDPE2KE032T8 1 3.20 TB / 3.20 TB 512 B + 0 B VDV10170 /dev/nvme7n1 BTLN907504N03P2BGN INTEL SSDPE2KE032T8 1 3.20 TB / 3.20 TB 512 B + 0 B VDV10170 /dev/nvme8n1 BTLN9050027H3P2BGN INTEL SSDPE2KE032T8 1 3.20 TB / 3.20 TB 512 B + 0 B VDV10170 /dev/nvme9n1 BTLN85110HDJ3P2BGN INTEL SSDPE2KE032T8 1 3.20 TB / 3.20 TB 512 B + 0 B VDV10170 [root@ac-1f-6b-a5-ab-ea bar]# zpool status pool: test state: ONLINE config: NAME STATE READ WRITE CKSUM test ONLINE 0 0 0 nvme1n1 ONLINE 0 0 0 nvme2n1 ONLINE 0 0 0 nvme3n1 ONLINE 0 0 0 nvme5n1 ONLINE 0 0 0 nvme6n1 ONLINE 0 0 0 nvme8n1 ONLINE 0 0 0 nvme9n1 ONLINE 0 0 0 errors: No known data errors [root@ac-1f-6b-a5-ab-ea bar]# zfs get direct test/bar NAME PROPERTY VALUE SOURCE test/bar direct always local [root@ac-1f-6b-a5-ab-ea bar]# zpool --version zfs-2.1.99-1310_g7ac3b7ae9 zfs-kmod-2.1.99-1310_g7ac3b7ae9 [root@ac-1f-6b-a5-ab-ea bar]# fio --time_based --name=benchmark --size=2G --runtime=30 --filename=./test --ioengine=sync --randrepeat=0 --iodepth=32 --direct=1 --invalidate=1 --verify=0 --verify_fatal=0 --numjobs=4 --rw=read --blocksize=4k --group_reporting benchmark: (g=0): rw=read, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=sync, iodepth=32 ... fio-3.19 Starting 4 processes Jobs: 4 (f=4): [R(4)][100.0%][r=140MiB/s][r=35.9k IOPS][eta 00m:00s] benchmark: (groupid=0, jobs=4): err= 0: pid=1289094: Fri Jul 22 18:24:47 2022 read: IOPS=50.5k, BW=197MiB/s (207MB/s)(5914MiB/30001msec) clat (nsec): min=1772, max=4948.3k, avg=78834.03, stdev=112800.83 lat (nsec): min=1810, max=4948.4k, avg=78884.69, stdev=112806.53 clat percentiles (usec): | 1.00th=[ 3], 5.00th=[ 3], 10.00th=[ 3], 20.00th=[ 3], | 30.00th=[ 4], 40.00th=[ 76], 50.00th=[ 86], 60.00th=[ 92], | 70.00th=[ 100], 80.00th=[ 111], 90.00th=[ 130], 95.00th=[ 223], | 99.00th=[ 359], 99.50th=[ 424], 99.90th=[ 1037], 99.95th=[ 2835], | 99.99th=[ 3425] bw ( KiB/s): min=107584, max=666672, per=100.00%, avg=202896.98, stdev=23198.14, samples=236 iops : min=26896, max=166668, avg=50724.19, stdev=5799.55, samples=236 lat (usec) : 2=0.40%, 4=33.20%, 10=3.36%, 20=0.11%, 50=0.01% lat (usec) : 100=32.59%, 250=25.74%, 500=4.22%, 750=0.16%, 1000=0.10% lat (msec) : 2=0.06%, 4=0.06%, 10=0.01% cpu : usr=0.98%, sys=13.18%, ctx=952786, majf=0, minf=78 IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0% submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% issued rwts: total=1513906,0,0,0 short=0,0,0,0 dropped=0,0,0,0 latency : target=0, window=0, percentile=100.00%, depth=32 Run status group 0 (all jobs): READ: bw=197MiB/s (207MB/s), 197MiB/s-197MiB/s (207MB/s-207MB/s), io=5914MiB (6201MB), run=30001-30001msecDisabled:
[root@ac-1f-6b-a5-ab-ea bar]# zfs set direct=disabled test/bar [root@ac-1f-6b-a5-ab-ea bar]# for i in {1..3}; do sync; echo $i > /proc/sys/vm/drop_caches; done [root@ac-1f-6b-a5-ab-ea bar]# rm -rf test [root@ac-1f-6b-a5-ab-ea bar]# fio --time_based --name=benchmark --size=2G --runtime=30 --filename=./test --ioengine=sync --randrepeat=0 --iodepth=32 --direct=1 --invalidate=1 --verify=0 --verify_fatal=0 --numjobs=4 --rw=read --blocksize=4k --group_reporting benchmark: (g=0): rw=read, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=sync, iodepth=32 ... fio-3.19 Starting 4 processes benchmark: Laying out IO file (1 file / 2048MiB) Jobs: 4 (f=4): [R(4)][100.0%][r=2095MiB/s][r=536k IOPS][eta 00m:00s] benchmark: (groupid=0, jobs=4): err= 0: pid=1399856: Fri Jul 22 18:25:43 2022 read: IOPS=584k, BW=2283MiB/s (2394MB/s)(66.9GiB/30001msec) clat (nsec): min=1195, max=145299, avg=6568.14, stdev=5384.80 lat (nsec): min=1231, max=145336, avg=6603.92, stdev=5385.07 clat percentiles (nsec): | 1.00th=[ 1512], 5.00th=[ 2576], 10.00th=[ 3152], 20.00th=[ 3920], | 30.00th=[ 4704], 40.00th=[ 5408], 50.00th=[ 5984], 60.00th=[ 6496], | 70.00th=[ 7008], 80.00th=[ 7584], 90.00th=[ 8384], 95.00th=[ 9280], | 99.00th=[37120], 99.50th=[42240], 99.90th=[52480], 99.95th=[55552], | 99.99th=[60160] bw ( MiB/s): min= 2084, max= 3905, per=100.00%, avg=2289.42, stdev=122.06, samples=236 iops : min=533706, max=999856, avg=586091.93, stdev=31247.29, samples=236 lat (usec) : 2=2.16%, 4=18.91%, 10=75.22%, 20=0.93%, 50=2.61% lat (usec) : 100=0.16%, 250=0.01% cpu : usr=5.64%, sys=86.69%, ctx=348979, majf=0, minf=51 IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0% submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% issued rwts: total=17534987,0,0,0 short=0,0,0,0 dropped=0,0,0,0 latency : target=0, window=0, percentile=100.00%, depth=32 Run status group 0 (all jobs): READ: bw=2283MiB/s (2394MB/s), 2283MiB/s-2283MiB/s (2394MB/s-2394MB/s), io=66.9GiB (71.8GB), run=30001-30001msecAnother test: Enabed:
[root@ac-1f-6b-a5-ab-ea foo]# zfs get recordsize test/foo NAME PROPERTY VALUE SOURCE test/foo recordsize 128K default [root@ac-1f-6b-a5-ab-ea foo]# pwd /test/foo [root@ac-1f-6b-a5-ab-ea foo]# fio --time_based --name=benchmark --size=2G --runtime=300 --filename=./boot --ioengine=libaio --randrepeat=0 --iodepth=128 --direct=1 --invalidate=1 --verify=0 --verify_fatal=0 --numjobs=24 --rw=randwrite --blocksize=128k --group_reporting benchmark: (g=0): rw=randwrite, bs=(R) 128KiB-128KiB, (W) 128KiB-128KiB, (T) 128KiB-128KiB, ioengine=libaio, iodepth=128 ... fio-3.19 Starting 24 processes benchmark: Laying out IO file (1 file / 2048MiB) ^Cbs: 24 (f=24): [w(24)][4.7%][w=5150MiB/s][w=41.2k IOPS][eta 04m:47s] fio: terminating on signal 2 benchmark: (groupid=0, jobs=24): err= 0: pid=1409094: Tue Jul 26 11:41:53 2022 write: IOPS=42.8k, BW=5345MiB/s (5605MB/s)(68.5GiB/13118msec); 0 zone resets slat (usec): min=245, max=3462, avg=558.80, stdev=130.00 clat (usec): min=2, max=101349, avg=71107.74, stdev=11837.83 lat (usec): min=416, max=102170, avg=71666.77, stdev=11911.16 clat percentiles (msec): | 1.00th=[ 56], 5.00th=[ 61], 10.00th=[ 62], 20.00th=[ 63], | 30.00th=[ 64], 40.00th=[ 64], 50.00th=[ 65], 60.00th=[ 69], | 70.00th=[ 75], 80.00th=[ 83], 90.00th=[ 92], 95.00th=[ 95], | 99.00th=[ 97], 99.50th=[ 99], 99.90th=[ 100], 99.95th=[ 101], | 99.99th=[ 101] bw ( MiB/s): min= 4247, max= 6095, per=99.60%, avg=5323.87, stdev=23.61, samples=624 iops : min=33980, max=48760, avg=42590.50, stdev=188.86, samples=624 lat (usec) : 4=0.01%, 10=0.01%, 500=0.01%, 750=0.01%, 1000=0.01% lat (msec) : 2=0.01%, 4=0.02%, 10=0.05%, 20=0.08%, 50=0.24% lat (msec) : 100=99.58%, 250=0.02% cpu : usr=1.83%, sys=73.94%, ctx=566417, majf=0, minf=16943 IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=99.7% submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.1% issued rwts: total=0,560978,0,0 short=0,0,0,0 dropped=0,0,0,0 latency : target=0, window=0, percentile=100.00%, depth=128 Run status group 0 (all jobs): WRITE: bw=5345MiB/s (5605MB/s), 5345MiB/s-5345MiB/s (5605MB/s-5605MB/s), io=68.5GiB (73.5GB), run=13118-13118msecDisabed:
[root@ac-1f-6b-a5-ab-ea foo]# zfs set direct=disabled test/foo [root@ac-1f-6b-a5-ab-ea foo]# for i in {1..3}; do sync; echo $i > /proc/sys/vm/drop_caches; done [root@ac-1f-6b-a5-ab-ea foo]# rm -rf boot [root@ac-1f-6b-a5-ab-ea foo]# fio --time_based --name=benchmark --size=2G --runtime=300 --filename=./boot --ioengine=libaio --randrepeat=0 --iodepth=128 --direct=1 --invalidate=1 --verify=0 --verify_fatal=0 --numjobs=24 --rw=randwrite --blocksize=128k --group_reporting benchmark: (g=0): rw=randwrite, bs=(R) 128KiB-128KiB, (W) 128KiB-128KiB, (T) 128KiB-128KiB, ioengine=libaio, iodepth=128 ... fio-3.19 Starting 24 processes benchmark: Laying out IO file (1 file / 2048MiB) ^Cbs: 24 (f=24): [w(24)][3.3%][w=17.7GiB/s][w=145k IOPS][eta 04m:51s] fio: terminating on signal 2 benchmark: (groupid=0, jobs=24): err= 0: pid=1479104: Tue Jul 26 11:42:16 2022 write: IOPS=141k, BW=17.2GiB/s (18.5GB/s)(161GiB/9361msec); 0 zone resets slat (usec): min=36, max=67814, avg=167.46, stdev=682.26 clat (usec): min=2, max=124187, avg=21584.85, stdev=9289.76 lat (usec): min=58, max=124312, avg=21752.62, stdev=9330.98 clat percentiles (usec): | 1.00th=[10028], 5.00th=[11863], 10.00th=[13042], 20.00th=[14877], | 30.00th=[16450], 40.00th=[17957], 50.00th=[19268], 60.00th=[20841], | 70.00th=[22938], 80.00th=[26346], 90.00th=[32900], 95.00th=[39584], | 99.00th=[56361], 99.50th=[64750], 99.90th=[81265], 99.95th=[90702], | 99.99th=[98042] bw ( MiB/s): min=14225, max=20963, per=99.79%, avg=17598.44, stdev=74.08, samples=432 iops : min=113793, max=167698, avg=140778.78, stdev=592.60, samples=432 lat (usec) : 4=0.01%, 10=0.01%, 100=0.01%, 250=0.01%, 500=0.01% lat (usec) : 750=0.01%, 1000=0.01% lat (msec) : 2=0.01%, 4=0.03%, 10=0.87%, 20=53.54%, 50=43.80% lat (msec) : 100=1.74%, 250=0.01% cpu : usr=9.64%, sys=65.26%, ctx=68796, majf=0, minf=25694 IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=99.9% submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.1% issued rwts: total=0,1320646,0,0 short=0,0,0,0 dropped=0,0,0,0 latency : target=0, window=0, percentile=100.00%, depth=128 Run status group 0 (all jobs): WRITE: bw=17.2GiB/s (18.5GB/s), 17.2GiB/s-17.2GiB/s (18.5GB/s-18.5GB/s), io=161GiB (173GB), run=9361-9361msec@Smithx10 so I noticed a couple of things with your results. So you are doing
libaiofor theioenginewithfiofor the128ktests; however, the asynchronous engines do not behave the way you think they would with ZFS. There is no asynchronous hooks in ZFS currently and all those IO calls get sent down the synchronous paths. Thenumjobsparameter will create the identical IO workloads, but theiodepthwill not behave as it might normally would with other FS's. In fact there is a PR to make the asynchronous IO API's (this is what the asyncioengine's use) work with ZFS #12166. Also, I am not so certain thatiodepthdoes anything whenioengine=syncwith the4ktests. I believe thefioman page might mention this. When I have looked at the FIO code, in the past, it seems to have no effect in that case. Could be wrong though. Feel free to double check me or tell me I am wrong on that one. Also, I noticed your size was 2g correct? There can be no expectation thatO_DIRECTcan surpass ARC speed if the entire working set is living the ARC. You are just getting all the memory bandwidth the ARC can give at that point. Also if you are only writing 2GB of data, I would have to imagine you are not reading from all of those NVMe's in the Zpool. You can usezpool iostat -vq 1to confirm or deny this though. It maybe the case you are not reading from all devices. This was just some quick observations. I think if you look at the slides from the OpenZFS Developer Summit: https://docs.google.com/presentation/d/1f9bE1S6KqwHWVJtsOOfCu_cVKAFQO94h/edit#slide=id.p1 I was sequentially writing and reading 2TB of data to/from the Zpools. The argument has never been to not use the ARC with NVMe devices. In order to get good throughput performance withO_DIRECTyou really need to step outside of the bounds of the ARC or be reading data sets that are not cached in the ARC that are spread across devices (larger than 2GB). There are certain situations where direct IO is a valid solution, and others where it isn't. It is not as simple as asking for it and everything improves. I also was curious what yourrecordsizewas set to for the4ktests? I imagine you were trying to measure IOPS there? If it was set to the default128kthe IOPS results are not too surprising to me. Just like the normal ARC reads, a4krequest will not just fetch that amount if it has to go down to disk and therecordsize>4k. The reason for this has to do with data validation with checksums. There is no way to validate the data unless you fetch the whole block. So, every time a read IO withO_DIRECTis issued, it will read back the entire block (AKArecordsize) to validate the data before returning it to the user. My response is also based on the idea your ARC size was >= 2GB. If that assumption is wrong, then there is egg on my face. Please let me know if this is not the case.Thank You So much for this explanation. I learned a bunch about what is happening here. I really appreciate you taking the time to help educate me. I will go experiment with this new knowledge. Thanks again!
No problem, and thank you for taking this PR for a test drive. Let me know if you run into any other performance concerns. I would be happy to help resolve any issues you uncover with this work.
Also, there is one caveat to the data that I shared at the OpenZFS Developer Summit for the O_DIRECT results. Previously compression defaulted to off; however, by default compression is set to on now. I believe that FIO. by default, uses random data for the buffer it writes. So I don't think this contributed to your read results. If decompression is happening though when reading with O_DIRECT with a buffer that was compressed, the buffer will be decompressed before returning to user space. ZFS will only compress a buffer that is at a minimum 1/8 compressible, so more than likely your reads were not decompressing any data buffers. You can always double check this though with zfs get compressratio on the dataset. The results in my slides were using random data, so the compression ratio would have still been more than likely 1.0x; however, it also was not running through the compression code for writes. Just something to keep in mind when measuring O_DIRECT performance.
Away from my workstation with the big screen, excuse the e-mail reply; assume all notes apply to all instances in the diff.
+In the event the checksum is not valid then the I/O operation will return +EINVAL and the write will not be committed.
... will return
.Er EINVAL
and the write ...
+Controls the behavior of direct requests (e.g. +.Sx O_DIRECT Ns +). The Wrong ‒ e.g. ends the sentence, Ns-at-EOL is awful (and almost always an error), O_DIRECT is very much not Sx (well, unless there's an O_DIRECT section but I don't think there is?), and you called it "Direct I/O" above:
... of Direct I/O requests
.Pq e.g. Dv O_DIRECT .
The
Best,
Away from my workstation with the big screen, excuse the e-mail reply; assume all notes apply to all instances in the diff. +In the event the checksum is not valid then the I/O operation will return +EINVAL and the write will not be committed.
... will return .Er EINVAL and the write ...+Controls the behavior of direct requests (e.g. +.Sx O_DIRECT Ns +). The Wrong ‒ e.g. ends the sentence, Ns-at-EOL is awful (and almost always an error), O_DIRECT is very much not Sx (well, unless there's an O_DIRECT section but I don't think there is?), and you called it "Direct I/O" above:... of Direct I/O requests .Pq e.g. Dv O_DIRECT . TheBest,
I went ahead and update the man pages.
hi All,
Is there any update on this PR? Is there any estimation on its merge date?
Sorry for such a late reply on this. My hope is that is will be merged soon. Reviews of PR are under way. When you oringally asked about this, I was still working on figuring out if we could do write protected user pages in Linux, but that does not seem possible.
I do observe the failure from vm_fault_quick_hold_pages() on FreeBSD 13 but not 14. pmap_extract_and_hold() is failing and then vm_fault() fails with ENOENT. @markjdb can you think of what would be causing different behavior between FreeBSD 13 and 14 there?
Good news: This PR improved write performance by ~60% on one of our test systems :+1: :smile:
Bad news: When I tried to do direct IO reads with a block size greater than 16MB, I got kernel assertions:
$ for i in {1..64} ; do dd if=/tank1/test$i of=/dev/null bs=32M iflag=direct & true ; done
...
Message from syslogd@localhost at Oct 12 17:27:48 ...
kernel: PANIC at abd_os.c:858:abd_alloc_from_pages()
Message from syslogd@localhost at Oct 12 17:27:48 ...
kernel: VERIFY3(size <= SPA_MAXBLOCKSIZE) failed (33554432 <= 16777216)
Message from syslogd@localhost at Oct 12 17:27:48 ...
kernel: PANIC at abd_os.c:858:abd_alloc_from_pages()
Oct 12 17:27:48 localhost kernel: Call Trace:
Oct 12 17:27:48 localhost kernel: dump_stack+0x41/0x60
Oct 12 17:27:48 localhost kernel: spl_panic+0xd0/0xf3 [spl]
Oct 12 17:27:48 localhost kernel: ? __get_user_pages+0x1fb/0x800
Oct 12 17:27:48 localhost kernel: ? gup_pgd_range+0x2fd/0xc60
Oct 12 17:27:48 localhost kernel: ? get_user_pages_unlocked+0xd5/0x2a0
Oct 12 17:27:48 localhost kernel: ? get_user_pages_unlocked+0x1f5/0x2a0
Oct 12 17:27:48 localhost kernel: abd_alloc_from_pages+0x196/0x1a0 [zfs]
Oct 12 17:27:48 localhost kernel: ? spl_kmem_alloc+0x11e/0x140 [spl]
Oct 12 17:27:48 localhost kernel: dmu_read_uio_direct+0x3e/0x90 [zfs]
Oct 12 17:27:48 localhost kernel: dmu_read_uio_dnode+0xfa/0x110 [zfs]
Oct 12 17:27:48 localhost kernel: ? zfs_rangelock_enter_impl+0x25b/0x560 [zfs]
Oct 12 17:27:48 localhost kernel: ? xas_load+0x8/0x80
Oct 12 17:27:48 localhost kernel: ? xas_find+0x173/0x1b0
Oct 12 17:27:48 localhost kernel: dmu_read_uio_dbuf+0x3f/0x60 [zfs]
Oct 12 17:27:48 localhost kernel: zfs_read+0x143/0x3d0 [zfs]
Oct 12 17:27:48 localhost kernel: zpl_iter_read_direct+0x182/0x220 [zfs]
Oct 12 17:27:48 localhost kernel: ? _cond_resched+0x15/0x30
Oct 12 17:27:48 localhost kernel: ? mutex_lock+0x21/0x40
Oct 12 17:27:48 localhost kernel: ? rrw_exit+0x65/0x150 [zfs]
Oct 12 17:27:48 localhost kernel: zpl_iter_read+0xae/0xe0 [zfs]
Oct 12 17:27:48 localhost kernel: new_sync_read+0x10f/0x150
Oct 12 17:27:48 localhost kernel: vfs_read+0xa3/0x160
Oct 12 17:27:48 localhost kernel: ksys_read+0x4f/0xb0
Oct 12 17:27:48 localhost kernel: do_syscall_64+0x5b/0x1a0
Oct 12 17:27:48 localhost kernel: entry_SYSCALL_64_after_hwframe+0x65/0xca
Oct 12 17:27:48 localhost kernel: RIP: 0033:0x155555081505
Good news: This PR improved write performance by ~60% on one of our test systems 👍 😄
Bad news: When I tried to do direct IO reads with a block size greater than 16MB, I got kernel assertions:
$ for i in {1..64} ; do dd if=/tank1/test$i of=/dev/null bs=32M iflag=direct & true ; done ... Message from syslogd@localhost at Oct 12 17:27:48 ... kernel: PANIC at abd_os.c:858:abd_alloc_from_pages() Message from syslogd@localhost at Oct 12 17:27:48 ... kernel: VERIFY3(size <= SPA_MAXBLOCKSIZE) failed (33554432 <= 16777216) Message from syslogd@localhost at Oct 12 17:27:48 ... kernel: PANIC at abd_os.c:858:abd_alloc_from_pages() Oct 12 17:27:48 localhost kernel: Call Trace: Oct 12 17:27:48 localhost kernel: dump_stack+0x41/0x60 Oct 12 17:27:48 localhost kernel: spl_panic+0xd0/0xf3 [spl] Oct 12 17:27:48 localhost kernel: ? __get_user_pages+0x1fb/0x800 Oct 12 17:27:48 localhost kernel: ? gup_pgd_range+0x2fd/0xc60 Oct 12 17:27:48 localhost kernel: ? get_user_pages_unlocked+0xd5/0x2a0 Oct 12 17:27:48 localhost kernel: ? get_user_pages_unlocked+0x1f5/0x2a0 Oct 12 17:27:48 localhost kernel: abd_alloc_from_pages+0x196/0x1a0 [zfs] Oct 12 17:27:48 localhost kernel: ? spl_kmem_alloc+0x11e/0x140 [spl] Oct 12 17:27:48 localhost kernel: dmu_read_uio_direct+0x3e/0x90 [zfs] Oct 12 17:27:48 localhost kernel: dmu_read_uio_dnode+0xfa/0x110 [zfs] Oct 12 17:27:48 localhost kernel: ? zfs_rangelock_enter_impl+0x25b/0x560 [zfs] Oct 12 17:27:48 localhost kernel: ? xas_load+0x8/0x80 Oct 12 17:27:48 localhost kernel: ? xas_find+0x173/0x1b0 Oct 12 17:27:48 localhost kernel: dmu_read_uio_dbuf+0x3f/0x60 [zfs] Oct 12 17:27:48 localhost kernel: zfs_read+0x143/0x3d0 [zfs] Oct 12 17:27:48 localhost kernel: zpl_iter_read_direct+0x182/0x220 [zfs] Oct 12 17:27:48 localhost kernel: ? _cond_resched+0x15/0x30 Oct 12 17:27:48 localhost kernel: ? mutex_lock+0x21/0x40 Oct 12 17:27:48 localhost kernel: ? rrw_exit+0x65/0x150 [zfs] Oct 12 17:27:48 localhost kernel: zpl_iter_read+0xae/0xe0 [zfs] Oct 12 17:27:48 localhost kernel: new_sync_read+0x10f/0x150 Oct 12 17:27:48 localhost kernel: vfs_read+0xa3/0x160 Oct 12 17:27:48 localhost kernel: ksys_read+0x4f/0xb0 Oct 12 17:27:48 localhost kernel: do_syscall_64+0x5b/0x1a0 Oct 12 17:27:48 localhost kernel: entry_SYSCALL_64_after_hwframe+0x65/0xca Oct 12 17:27:48 localhost kernel: RIP: 0033:0x155555081505
@tonyhutter good catch! I went ahead and updated the ABD size ASSERT checks to account for DMU_MAX_ACCESS in the event the ABD flag ABD_FLAG_FROM_PAGES is set. This should resolve the issue you observed (at least it did when I tested it out).
Hello,today I tried to test this PR under the following system:
2x E5-2670 v2 (2.5GHz, 10C)
226GB DDR3
root@iser-nvme:/home/vlosev# nvme list
Node SN Model Namespace Usage Format FW Rev
---------------- -------------------- ---------------------------------------- --------- -------------------------- ---------------- --------
/dev/nvme0n1 50026B7684C433E7 KINGSTON SKC2500M81000G 1 735.87 GB / 1.00 TB 4 KiB + 0 B S7780101
/dev/nvme1n1 50026B7684C43738 KINGSTON SKC2500M81000G 1 679.65 GB / 1.00 TB 4 KiB + 0 B S7780101
/dev/nvme2n1 S48ENC0N701594L Samsung SSD 983 DCT M.2 960GB 1 573.66 GB / 960.20 GB 4 KiB + 0 B EDA7602Q
/dev/nvme3n1 S48ENC0N700767V Samsung SSD 983 DCT M.2 960GB 1 625.36 GB / 960.20 GB 4 KiB + 0 B EDA7602Q
/dev/nvme4n1 S4EMNX0R627163 SAMSUNG MZVLB1T0HBLR-000L7 1 702.80 GB / 1.02 TB 512 B + 0 B 5M2QEXF7
/dev/nvme5n1 S64FNE0RB06622 SAMSUNG MZQL2960HCJR-00A07 1 639.14 GB / 960.20 GB 4 KiB + 0 B GDC5502Q
root@iser-nvme:/home/vlosev# zfs version
zfs-2.1.99-1446_gf9bb9f26c
zfs-kmod-2.1.99-1446_gf9bb9f26c
I have created the zpool with following settings
root@iser-nvme:/home/vlosev# zpool create -O direct=always -o ashift=12 -O atime=off -O recordsize=8k nvme /dev/nvme0n1 /dev/nvme1n1 /dev/nvme2n1 /dev/nvme3n1 /dev/nvme4n1 /dev/nvme5n1 -f
But when I try to benchmark the read/write performance with following fio options:
fio --name=WithDirect --size=300G --filename=/nvme/t.1 --ioengine=libaio --randrepeat=0 --iodepth=32 --direct=1 --invalidate=1 --verify=0 --verify_fatal=0 --numjobs=4 --rw=write --blocksize=8k --group_reporting
fio --name=WithDirect --size=300G --filename=/nvme/t.1 --ioengine=libaio --randrepeat=0 --iodepth=32 --direct=1 --invalidate=1 --verify=0 --verify_fatal=0 --numjobs=4 --rw=read --blocksize=8k --group_reporting
I got a very strange result:
root@iser-nvme:/home/vlosev/zfs# fio --name=WithDirect --size=400G --filename=/nvme/t.1 --ioengine=libaio --randrepeat=0 --iodepth=32 --direct=1 --invalidate=1 --verify=0 --verify_fatal=0 --numjobs=4 --rw=write --blocksize=8k --group_reporting
Write:
WithDirect: (g=0): rw=write, bs=(R) 8192B-8192B, (W) 8192B-8192B, (T) 8192B-8192B, ioengine=libaio, iodepth=32
...
fio-3.16
Starting 4 processes
WithDirect: Laying out IO file (1 file / 409600MiB)
Jobs: 4 (f=4): [W(4)][100.0%][w=278MiB/s][w=35.6k IOPS][eta 00m:00s]
WithDirect: (groupid=0, jobs=4): err= 0: pid=1087394: Thu Oct 13 19:51:45 2022
write: IOPS=34.2k, BW=267MiB/s (280MB/s)(1600GiB/6134181msec); 0 zone resets
slat (usec): min=77, max=120593, avg=112.50, stdev=424.76
clat (usec): min=4, max=240172, avg=3628.84, stdev=2601.48
lat (usec): min=92, max=249482, avg=3741.99, stdev=2651.57
clat percentiles (usec):
| 1.00th=[ 2966], 5.00th=[ 3032], 10.00th=[ 3097], 20.00th=[ 3163],
| 30.00th=[ 3195], 40.00th=[ 3228], 50.00th=[ 3261], 60.00th=[ 3326],
| 70.00th=[ 3392], 80.00th=[ 3490], 90.00th=[ 3687], 95.00th=[ 4293],
| 99.00th=[11994], 99.50th=[16909], 99.90th=[31851], 99.95th=[57410],
| 99.99th=[98042]
bw ( KiB/s): min=48967, max=334032, per=99.99%, avg=273463.15, stdev=12311.44, samples=49072
iops : min= 6120, max=41754, avg=34182.76, stdev=1538.93, samples=49072
lat (usec) : 10=0.01%, 100=0.01%, 250=0.01%, 500=0.01%, 750=0.01%
lat (usec) : 1000=0.01%
lat (msec) : 2=0.01%, 4=92.79%, 10=5.72%, 20=1.13%, 50=0.30%
lat (msec) : 100=0.05%, 250=0.01%
cpu : usr=4.80%, sys=65.10%, ctx=256489223, majf=0, minf=63432
IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=100.0%, >=64=0.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.1%, 64=0.0%, >=64=0.0%
issued rwts: total=0,209715200,0,0 short=0,0,0,0 dropped=0,0,0,0
latency : target=0, window=0, percentile=100.00%, depth=32
Run status group 0 (all jobs):
WRITE: bw=267MiB/s (280MB/s), 267MiB/s-267MiB/s (280MB/s-280MB/s), io=1600GiB (1718GB), run=6134181-6134181msec
Read:
root@iser-nvme:/home/vlosev/zfs# fio --name=WithDirect --size=400G --filename=/nvme/t.1 --ioengine=libaio --randrepeat=0 --iodepth=32 --direct=1 --invalidate=1 --verify=0 --verify_fatal=0 --numjobs=4 --rw=read --blocksize=8k --group_reporting
WithDirect: (g=0): rw=read, bs=(R) 8192B-8192B, (W) 8192B-8192B, (T) 8192B-8192B, ioengine=libaio, iodepth=32
...
fio-3.16
Starting 4 processes
Jobs: 4 (f=4): [R(4)][0.3%][r=251MiB/s][r=32.1k IOPS][eta 01h:54m:26s]
and if I disable direct by:
root@iser-nvme:/home/vlosev/zfs# zfs set direct=disabled nvme
root@iser-nvme:/home/vlosev/zfs# zfs get direct nvme
NAME PROPERTY VALUE SOURCE
nvme direct disabled local
Write:
root@iser-nvme:/home/vlosev/zfs# fio --name=WithDirect --size=400G --filename=/nvme/t.1 --ioengine=libaio --randrepeat=0 --iodepth=32 --direct=1 --invalidate=1 --verify=0 --verify_fatal=0 --numjobs=4 --rw=write --blocksize=8k --group_reporting
WithDirect: (g=0): rw=write, bs=(R) 8192B-8192B, (W) 8192B-8192B, (T) 8192B-8192B, ioengine=libaio, iodepth=32
...
fio-3.16
Starting 4 processes
Jobs: 4 (f=4): [W(4)][100.0%][w=944MiB/s][w=121k IOPS][eta 00m:00s]
WithDirect: (groupid=0, jobs=4): err= 0: pid=4050640: Thu Oct 13 21:20:13 2022
write: IOPS=123k, BW=963MiB/s (1009MB/s)(1600GiB/1701924msec); 0 zone resets
slat (usec): min=7, max=26674, avg=29.66, stdev=76.38
clat (usec): min=2, max=27658, avg=1007.56, stdev=443.89
lat (usec): min=17, max=27690, avg=1037.56, stdev=451.44
clat percentiles (usec):
| 1.00th=[ 469], 5.00th=[ 676], 10.00th=[ 824], 20.00th=[ 922],
| 30.00th=[ 963], 40.00th=[ 979], 50.00th=[ 996], 60.00th=[ 1012],
| 70.00th=[ 1037], 80.00th=[ 1074], 90.00th=[ 1139], 95.00th=[ 1205],
| 99.00th=[ 1434], 99.50th=[ 1532], 99.90th=[ 8979], 99.95th=[11600],
| 99.99th=[12911]
bw ( KiB/s): min=816048, max=1426608, per=99.98%, avg=985625.01, stdev=13358.13, samples=13612
iops : min=102006, max=178326, avg=123202.98, stdev=1669.77, samples=13612
lat (usec) : 4=0.01%, 20=0.01%, 50=0.01%, 100=0.01%, 250=0.01%
lat (usec) : 500=1.56%, 750=5.38%, 1000=46.13%
lat (msec) : 2=46.66%, 4=0.02%, 10=0.18%, 20=0.07%, 50=0.01%
cpu : usr=10.83%, sys=71.37%, ctx=205048515, majf=0, minf=93114
IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=100.0%, >=64=0.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.1%, 64=0.0%, >=64=0.0%
issued rwts: total=0,209715200,0,0 short=0,0,0,0 dropped=0,0,0,0
latency : target=0, window=0, percentile=100.00%, depth=32
Run status group 0 (all jobs):
WRITE: bw=963MiB/s (1009MB/s), 963MiB/s-963MiB/s (1009MB/s-1009MB/s), io=1600GiB (1718GB), run=1701924-1701924msec
Read:
@Dante4 try cranking up the number of reader / writer threads.
During my initial testing I was getting better write performance without Direct IO. That is because non-Direct IO writes are async writes, which work well when there are a low number of writer threads (but at the cost of two memory copies). Direct IO writes are handled synchronously, with fewer to no memory copies. Once I cranked up the number or writers from 64 dd parallel writes to 512 parallel dd writes, I got much better write performance with Direct IO.
Hello,today I tried to test this PR under the following system: 2x E5-2670 v2 @ 2.50GHz 226GB DDR3
root@iser-nvme:/home/vlosev/zfs# nvme list
Node SN Model Namespace Usage Format FW Rev
---------------- -------------------- ---------------------------------------- --------- -------------------------- ---------------- --------
/dev/nvme0n1 50026B7684C433E7 KINGSTON SKC2500M81000G 1 833.65 GB / 1.00 TB 4 KiB + 0 B S7780101
/dev/nvme1n1 50026B7684C43738 KINGSTON SKC2500M81000G 1 771.43 GB / 1.00 TB 4 KiB + 0 B S7780101
/dev/nvme2n1 S48ENC0N701594L Samsung SSD 983 DCT M.2 960GB 1 714.24 GB / 960.20 GB 4 KiB + 0 B EDA7602Q
/dev/nvme3n1 S48ENC0N700767V Samsung SSD 983 DCT M.2 960GB 1 752.24 GB / 960.20 GB 4 KiB + 0 B EDA7602Q
/dev/nvme4n1 S4EMNX0R627163 SAMSUNG MZVLB1T0HBLR-000L7 1 807.06 GB / 1.02 TB 512 B + 0 B 5M2QEXF7
/dev/nvme5n1 S64FNE0RB06622 SAMSUNG MZQL2960HCJR-00A07 1 733.04 GB / 960.20 GB 4 KiB + 0 B GDC5502Q
root@iser-nvme:/home/vlosev/zfs# zfs version
zfs-2.1.99-1446_gf9bb9f26c
zfs-kmod-2.1.99-1446_gf9bb9f26c
I have created the zpool with following settings
root@iser-nvme:/home/vlosev# zpool create -O direct=always -o ashift=12 -O atime=off -O recordsize=8k nvme /dev/nvme0n1 /dev/nvme1n1 /dev/nvme2n1 /dev/nvme3n1 /dev/nvme4n1 /dev/nvme5n1 -f
But when I try to benchmark the read/write performance with following fio options:
fio --name=WithDirect --size=300G --filename=/nvme/t.1 --ioengine=libaio --randrepeat=0 --iodepth=32 --direct=1 --invalidate=1 --verify=0 --verify_fatal=0 --numjobs=4 --rw=write --blocksize=8k --group_reporting
fio --name=WithDirect --size=300G --filename=/nvme/t.1 --ioengine=libaio --randrepeat=0 --iodepth=32 --direct=1 --invalidate=1 --verify=0 --verify_fatal=0 --numjobs=4 --rw=read --blocksize=8k --group_reporting
I got a very strange result:
root@iser-nvme:/home/vlosev/zfs# fio --name=WithDirect --size=400G --filename=/nvme/t.1 --ioengine=libaio --randrepeat=0 --iodepth=32 --direct=1 --invalidate=1 --verify=0 --verify_fatal=0 --numjobs=4 --rw=write --blocksize=8k --group_reporting
Write:
WithDirect: (g=0): rw=write, bs=(R) 8192B-8192B, (W) 8192B-8192B, (T) 8192B-8192B, ioengine=libaio, iodepth=32
...
fio-3.16
Starting 4 processes
WithDirect: Laying out IO file (1 file / 409600MiB)
Jobs: 4 (f=4): [W(4)][100.0%][w=278MiB/s][w=35.6k IOPS][eta 00m:00s]
WithDirect: (groupid=0, jobs=4): err= 0: pid=1087394: Thu Oct 13 19:51:45 2022
write: IOPS=34.2k, BW=267MiB/s (280MB/s)(1600GiB/6134181msec); 0 zone resets
slat (usec): min=77, max=120593, avg=112.50, stdev=424.76
clat (usec): min=4, max=240172, avg=3628.84, stdev=2601.48
lat (usec): min=92, max=249482, avg=3741.99, stdev=2651.57
clat percentiles (usec):
| 1.00th=[ 2966], 5.00th=[ 3032], 10.00th=[ 3097], 20.00th=[ 3163],
| 30.00th=[ 3195], 40.00th=[ 3228], 50.00th=[ 3261], 60.00th=[ 3326],
| 70.00th=[ 3392], 80.00th=[ 3490], 90.00th=[ 3687], 95.00th=[ 4293],
| 99.00th=[11994], 99.50th=[16909], 99.90th=[31851], 99.95th=[57410],
| 99.99th=[98042]
bw ( KiB/s): min=48967, max=334032, per=99.99%, avg=273463.15, stdev=12311.44, samples=49072
iops : min= 6120, max=41754, avg=34182.76, stdev=1538.93, samples=49072
lat (usec) : 10=0.01%, 100=0.01%, 250=0.01%, 500=0.01%, 750=0.01%
lat (usec) : 1000=0.01%
lat (msec) : 2=0.01%, 4=92.79%, 10=5.72%, 20=1.13%, 50=0.30%
lat (msec) : 100=0.05%, 250=0.01%
cpu : usr=4.80%, sys=65.10%, ctx=256489223, majf=0, minf=63432
IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=100.0%, >=64=0.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.1%, 64=0.0%, >=64=0.0%
issued rwts: total=0,209715200,0,0 short=0,0,0,0 dropped=0,0,0,0
latency : target=0, window=0, percentile=100.00%, depth=32
Run status group 0 (all jobs):
WRITE: bw=267MiB/s (280MB/s), 267MiB/s-267MiB/s (280MB/s-280MB/s), io=1600GiB (1718GB), run=6134181-6134181msec
Read:
root@iser-nvme:/home/vlosev/zfs# fio --name=WithDirect --size=400G --filename=/nvme/t.1 --ioengine=libaio --randrepeat=0 --iodepth=32 --direct=1 --invalidate=1 --verify=0 --verify_fatal=0 --numjobs=4 --rw=read --blocksize=8k --group_reporting
WithDirect: (g=0): rw=read, bs=(R) 8192B-8192B, (W) 8192B-8192B, (T) 8192B-8192B, ioengine=libaio, iodepth=32
...
fio-3.16
Starting 4 processes
Jobs: 4 (f=4): [R(4)][0.3%][r=251MiB/s][r=32.1k IOPS][eta 01h:54m:26s]
and if I disable direct by:
root@iser-nvme:/home/vlosev/zfs# zfs set direct=disabled nvme
root@iser-nvme:/home/vlosev/zfs# zfs get direct nvme
NAME PROPERTY VALUE SOURCE
nvme direct disabled local
Write:
root@iser-nvme:/home/vlosev/zfs# fio --name=WithDirect --size=400G --filename=/nvme/t.1 --ioengine=libaio --randrepeat=0 --iodepth=32 --direct=1 --invalidate=1 --verify=0 --verify_fatal=0 --numjobs=4 --rw=write --blocksize=8k --group_reporting
WithDirect: (g=0): rw=write, bs=(R) 8192B-8192B, (W) 8192B-8192B, (T) 8192B-8192B, ioengine=libaio, iodepth=32
...
fio-3.16
Starting 4 processes
Jobs: 4 (f=4): [W(4)][100.0%][w=944MiB/s][w=121k IOPS][eta 00m:00s]
WithDirect: (groupid=0, jobs=4): err= 0: pid=4050640: Thu Oct 13 21:20:13 2022
write: IOPS=123k, BW=963MiB/s (1009MB/s)(1600GiB/1701924msec); 0 zone resets
slat (usec): min=7, max=26674, avg=29.66, stdev=76.38
clat (usec): min=2, max=27658, avg=1007.56, stdev=443.89
lat (usec): min=17, max=27690, avg=1037.56, stdev=451.44
clat percentiles (usec):
| 1.00th=[ 469], 5.00th=[ 676], 10.00th=[ 824], 20.00th=[ 922],
| 30.00th=[ 963], 40.00th=[ 979], 50.00th=[ 996], 60.00th=[ 1012],
| 70.00th=[ 1037], 80.00th=[ 1074], 90.00th=[ 1139], 95.00th=[ 1205],
| 99.00th=[ 1434], 99.50th=[ 1532], 99.90th=[ 8979], 99.95th=[11600],
| 99.99th=[12911]
bw ( KiB/s): min=816048, max=1426608, per=99.98%, avg=985625.01, stdev=13358.13, samples=13612
iops : min=102006, max=178326, avg=123202.98, stdev=1669.77, samples=13612
lat (usec) : 4=0.01%, 20=0.01%, 50=0.01%, 100=0.01%, 250=0.01%
lat (usec) : 500=1.56%, 750=5.38%, 1000=46.13%
lat (msec) : 2=46.66%, 4=0.02%, 10=0.18%, 20=0.07%, 50=0.01%
cpu : usr=10.83%, sys=71.37%, ctx=205048515, majf=0, minf=93114
IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=100.0%, >=64=0.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.1%, 64=0.0%, >=64=0.0%
issued rwts: total=0,209715200,0,0 short=0,0,0,0 dropped=0,0,0,0
latency : target=0, window=0, percentile=100.00%, depth=32
Run status group 0 (all jobs):
WRITE: bw=963MiB/s (1009MB/s), 963MiB/s-963MiB/s (1009MB/s-1009MB/s), io=1600GiB (1718GB), run=1701924-1701924msec
Read:
root@iser-nvme:/home/vlosev/zfs# fio --name=WithDirect --size=400G --filename=/nvme/t.1 --ioengine=libaio --randrepeat=0 --iodepth=32 --direct=1 --invalidate=1 --verify=0 --verify_fatal=0 --numjobs=4 --rw=read --blocksize=8k --group_reporting
WithDirect: (g=0): rw=read, bs=(R) 8192B-8192B, (W) 8192B-8192B, (T) 8192B-8192B, ioengine=libaio, iodepth=32
...
fio-3.16
Starting 4 processes
Jobs: 4 (f=4): [R(4)][100.0%][r=1254MiB/s][r=160k IOPS][eta 00m:00s]
WithDirect: (groupid=0, jobs=4): err= 0: pid=2106034: Thu Oct 13 21:44:08 2022
read: IOPS=163k, BW=1273MiB/s (1335MB/s)(1600GiB/1286575msec)
slat (usec): min=4, max=3737, avg=21.73, stdev=13.06
clat (usec): min=2, max=4462, avg=761.81, stdev=47.08
lat (usec): min=18, max=4489, avg=783.84, stdev=48.26
clat percentiles (usec):
| 1.00th=[ 668], 5.00th=[ 693], 10.00th=[ 709], 20.00th=[ 725],
| 30.00th=[ 742], 40.00th=[ 750], 50.00th=[ 758], 60.00th=[ 766],
| 70.00th=[ 783], 80.00th=[ 791], 90.00th=[ 816], 95.00th=[ 840],
| 99.00th=[ 930], 99.50th=[ 955], 99.90th=[ 996], 99.95th=[ 1020],
| 99.99th=[ 1057]
bw ( MiB/s): min= 472, max= 1408, per=99.98%, avg=1273.26, stdev=11.33, samples=10292
iops : min=60504, max=180282, avg=162977.27, stdev=1450.14, samples=10292
lat (usec) : 4=0.01%, 50=0.01%, 100=0.01%, 250=0.01%, 500=0.01%
lat (usec) : 750=42.04%, 1000=57.87%
lat (msec) : 2=0.10%, 4=0.01%, 10=0.01%
cpu : usr=13.76%, sys=77.17%, ctx=79434728, majf=0, minf=2622
IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=100.0%, >=64=0.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.1%, 64=0.0%, >=64=0.0%
issued rwts: total=209715200,0,0,0 short=0,0,0,0 dropped=0,0,0,0
latency : target=0, window=0, percentile=100.00%, depth=32
Run status group 0 (all jobs):
READ: bw=1273MiB/s (1335MB/s), 1273MiB/s-1273MiB/s (1335MB/s-1335MB/s), io=1600GiB (1718GB), run=1286575-1286575msec
I'd also recommend leaving the property set to direct=standard when testing. You shouldn't need to set this and can request Direct I/O like normal with the --direct=1 fio flag. This will have the advantage that if fio isn't creating correctly aligned I/O an error will be reported. When --direct=always is set rather than fail a Direct I/O it'll take the buffered path.
I'd also recommend leaving the property set to
direct=standardwhen testing. You shouldn't need to set this and can request Direct I/O like normal with the--direct=1fio flag. This will have the advantage that iffioisn't creating correctly aligned I/O an error will be reported. When--direct=alwaysis set rather than fail a Direct I/O it'll take the buffered path.
Thank you for your answer. Sadly there were no changes when I used standard instead of always:
direct=standard
``` root@iser-nvme:/home/vlosev# zfs set direct=standard nvme root@iser-nvme:/home/vlosev# fio --name=WithDirect --size=400G --filename=/nvme/t.1 --ioengine=libaio --randrepeat=0 --iodepth=32 --direct=1 --invalidate=1 --verify=0 --verify_fatal=0 --numjobs=4 --rw=read --blocksize=8k --group_reporting WithDirect: (g=0): rw=read, bs=(R) 8192B-8192B, (W) 8192B-8192B, (T) 8192B-8192B, ioengine=libaio, iodepth=32 ... fio-3.16 Starting 4 processes ^Cbs: 4 (f=4): [R(4)][0.2%][r=290MiB/s][r=37.1k IOPS][eta 01h:40m:04s] fio: terminating on signal 2WithDirect: (groupid=0, jobs=4): err= 0: pid=3312285: Fri Oct 14 08:53:21 2022 read: IOPS=35.8k, BW=280MiB/s (293MB/s)(3415MiB/12210msec) slat (usec): min=41, max=2797, avg=104.58, stdev=37.71 clat (usec): min=3, max=6690, avg=3357.27, stdev=839.75 lat (usec): min=114, max=6832, avg=3462.38, stdev=863.39 clat percentiles (usec): | 1.00th=[ 1860], 5.00th=[ 1975], 10.00th=[ 2073], 20.00th=[ 2409], | 30.00th=[ 2737], 40.00th=[ 3195], 50.00th=[ 3621], 60.00th=[ 3851], | 70.00th=[ 3949], 80.00th=[ 4113], 90.00th=[ 4359], 95.00th=[ 4490], | 99.00th=[ 4686], 99.50th=[ 4817], 99.90th=[ 5080], 99.95th=[ 5145], | 99.99th=[ 5997] bw ( KiB/s): min=62560, max=328448, per=99.98%, avg=286327.67, stdev=12347.72, samples=96 iops : min= 7820, max=41056, avg=35790.75, stdev=1543.47, samples=96 lat (usec) : 4=0.01%, 250=0.01%, 500=0.01%, 750=0.01%, 1000=0.01% lat (msec) : 2=6.54%, 4=66.10%, 10=27.35% cpu : usr=4.20%, sys=27.97%, ctx=437188, majf=0, minf=319 IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=100.0%, >=64=0.0% submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.1%, 64=0.0%, >=64=0.0% issued rwts: total=437080,0,0,0 short=0,0,0,0 dropped=0,0,0,0 latency : target=0, window=0, percentile=100.00%, depth=32
Run status group 0 (all jobs): READ: bw=280MiB/s (293MB/s), 280MiB/s-280MiB/s (293MB/s-293MB/s), io=3415MiB (3581MB), run=12210-12210msec root@iser-nvme:/home/vlosev# fio --name=WithDirect --size=400G --filename=/nvme/t.1 --ioengine=libaio --randrepeat=0 --iodepth=32 --direct=1 --invalidate=1 --verify=0 --verify_fatal=0 --numjobs=4 --rw=write --blocksize=8k --group_reporting WithDirect: (g=0): rw=write, bs=(R) 8192B-8192B, (W) 8192B-8192B, (T) 8192B-8192B, ioengine=libaio, iodepth=32 ... fio-3.16 Starting 4 processes ^Cbs: 4 (f=4): [W(4)][0.2%][w=296MiB/s][w=37.9k IOPS][eta 01h:35m:48s] fio: terminating on signal 2
WithDirect: (groupid=0, jobs=4): err= 0: pid=3344955: Fri Oct 14 08:53:41 2022 write: IOPS=37.3k, BW=291MiB/s (305MB/s)(2997MiB/10289msec); 0 zone resets slat (usec): min=79, max=13452, avg=101.92, stdev=93.20 clat (usec): min=4, max=17890, avg=3297.29, stdev=556.83 lat (usec): min=117, max=18015, avg=3399.85, stdev=567.83 clat percentiles (usec): | 1.00th=[ 2933], 5.00th=[ 2999], 10.00th=[ 3064], 20.00th=[ 3130], | 30.00th=[ 3163], 40.00th=[ 3195], 50.00th=[ 3228], 60.00th=[ 3261], | 70.00th=[ 3326], 80.00th=[ 3392], 90.00th=[ 3523], 95.00th=[ 3589], | 99.00th=[ 4228], 99.50th=[ 4359], 99.90th=[15270], 99.95th=[17171], | 99.99th=[17695] bw ( KiB/s): min=225824, max=315472, per=99.89%, avg=297972.85, stdev=4771.26, samples=80 iops : min=28228, max=39434, avg=37246.50, stdev=596.40, samples=80 lat (usec) : 10=0.01%, 250=0.01%, 500=0.01%, 750=0.01%, 1000=0.01% lat (msec) : 2=0.01%, 4=97.43%, 10=2.36%, 20=0.19% cpu : usr=4.72%, sys=71.28%, ctx=478440, majf=0, minf=308 IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=100.0%, >=64=0.0% submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.1%, 64=0.0%, >=64=0.0% issued rwts: total=0,383656,0,0 short=0,0,0,0 dropped=0,0,0,0 latency : target=0, window=0, percentile=100.00%, depth=32
Run status group 0 (all jobs): WRITE: bw=291MiB/s (305MB/s), 291MiB/s-291MiB/s (305MB/s-305MB/s), io=2997MiB (3143MB), run=10289-10289msec root@iser-nvme:/home/vlosev# zfs get direct nvme NAME PROPERTY VALUE SOURCE nvme direct standard local
</details>
@Dante4 try cranking up the number of reader / writer threads.
During my initial testing I was getting better write performance without Direct IO. That is because non-Direct IO writes are async writes, which work well when there are a low number of writer threads (but at the cost of two memory copies). Direct IO writes are handled synchronously, with fewer to no memory copies. Once I cranked up the number or writers from 64
ddparallel writes to 512 parallelddwrites, I got much better write performance with Direct IO.
Sadly, increasing number of jobs did not really helped
Spoiler
``` root@iser-nvme:/home/vlosev# fio --name=WithDirect --size=400G --filename=/nvme/t.1 --ioengine=libaio --randrepeat=0 --iodepth=32 --direct=1 --invalidate=1 --verify=0 --verify_fatal=0 --numjobs=512 --rw=read --blocksize=8k --group_reporting WithDirect: (g=0): rw=read, bs=(R) 8192B-8192B, (W) 8192B-8192B, (T) 8192B-8192B, ioengine=libaio, iodepth=32 ... fio-3.16 Starting 512 processes ^Cbs: 512 (f=512): [R(512)][0.0%][r=1879MiB/s][r=241k IOPS][eta 03d:08h:59m:03s] fio: terminating on signal 2WithDirect: (groupid=0, jobs=512): err= 0: pid=3515147: Fri Oct 14 10:13:12 2022 read: IOPS=175k, BW=1366MiB/s (1432MB/s)(30.1GiB/22542msec) slat (usec): min=71, max=738837, avg=1545.77, stdev=5307.16 clat (usec): min=3, max=1113.1k, avg=47984.72, stdev=39541.12 lat (usec): min=131, max=1113.3k, avg=49531.37, stdev=40460.94 clat percentiles (msec): | 1.00th=[ 4], 5.00th=[ 4], 10.00th=[ 5], 20.00th=[ 5], | 30.00th=[ 14], 40.00th=[ 35], 50.00th=[ 45], 60.00th=[ 56], | 70.00th=[ 70], 80.00th=[ 83], 90.00th=[ 101], 95.00th=[ 116], | 99.00th=[ 150], 99.50th=[ 169], 99.90th=[ 222], 99.95th=[ 255], | 99.99th=[ 510] bw ( MiB/s): min= 975, max= 5321, per=100.00%, avg=2307.45, stdev= 4.95, samples=12401 iops : min=124860, max=681156, avg=295342.34, stdev=633.42, samples=12401 lat (usec) : 4=0.01%, 10=0.01%, 20=0.01%, 250=0.01%, 500=0.01% lat (usec) : 750=0.01%, 1000=0.01% lat (msec) : 2=0.01%, 4=6.44%, 10=22.23%, 20=3.39%, 50=19.71% lat (msec) : 100=38.03%, 250=10.12%, 500=0.04%, 750=0.01%, 1000=0.01% lat (msec) : 2000=0.01% cpu : usr=0.20%, sys=5.43%, ctx=4844909, majf=0, minf=61325 IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.2%, 32=99.6%, >=64=0.0% submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.1%, 64=0.0%, >=64=0.0% issued rwts: total=3941484,0,0,0 short=0,0,0,0 dropped=0,0,0,0 latency : target=0, window=0, percentile=100.00%, depth=32
Run status group 0 (all jobs): READ: bw=1366MiB/s (1432MB/s), 1366MiB/s-1366MiB/s (1432MB/s-1432MB/s), io=30.1GiB (32.3GB), run=22542-22542msec root@iser-nvme:/home/vlosev# zfs set direct=disabled nvme root@iser-nvme:/home/vlosev# fio --name=WithDirect --size=400G --filename=/nvme/t.1 --ioengine=libaio --randrepeat=0 --iodepth=32 --direct=1 --invalidate=1 --verify=0 --verify_fatal=0 --numjobs=512 --rw=read --blocksize=8k --group_reporting WithDirect: (g=0): rw=read, bs=(R) 8192B-8192B, (W) 8192B-8192B, (T) 8192B-8192B, ioengine=libaio, iodepth=32 ... fio-3.16 Starting 512 processes ^Cbs: 512 (f=512): [R(512)][0.0%][r=2958MiB/s][r=379k IOPS][eta 53d:03h:38m:11s] fio: terminating on signal 2
WithDirect: (groupid=0, jobs=512): err= 0: pid=3615628: Fri Oct 14 10:13:32 2022 read: IOPS=284k, BW=2217MiB/s (2325MB/s)(25.0GiB/11998msec) slat (usec): min=4, max=1900.4k, avg=944.51, stdev=25750.21 clat (usec): min=2, max=3711.2k, avg=29291.26, stdev=151762.12 lat (usec): min=52, max=3711.3k, avg=30236.06, stdev=154212.86 clat percentiles (usec): | 1.00th=[ 322], 5.00th=[ 388], 10.00th=[ 1045], | 20.00th=[ 1713], 30.00th=[ 2278], 40.00th=[ 2835], | 50.00th=[ 3064], 60.00th=[ 3163], 70.00th=[ 3261], | 80.00th=[ 3359], 90.00th=[ 3490], 95.00th=[ 6980], | 99.00th=[ 893387], 99.50th=[ 952108], 99.90th=[1803551], | 99.95th=[1887437], 99.99th=[2298479] bw ( MiB/s): min= 307, max=19832, per=100.00%, avg=5501.44, stdev=15.48, samples=4311 iops : min=39284, max=2538513, avg=704094.52, stdev=1981.29, samples=4311 lat (usec) : 4=0.01%, 10=0.01%, 20=0.01%, 50=0.01%, 100=0.01% lat (usec) : 250=0.01%, 500=8.04%, 750=0.49%, 1000=1.10% lat (msec) : 2=15.27%, 4=69.79%, 10=0.55%, 20=0.67%, 50=0.23% lat (msec) : 100=0.06%, 250=0.44%, 500=0.66%, 750=0.92%, 1000=1.45% lat (msec) : 2000=0.29%, >=2000=0.01% cpu : usr=0.13%, sys=7.50%, ctx=30559, majf=0, minf=46152 IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.2%, 32=99.5%, >=64=0.0% submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.1%, 64=0.0%, >=64=0.0% issued rwts: total=3405068,0,0,0 short=0,0,0,0 dropped=0,0,0,0 latency : target=0, window=0, percentile=100.00%, depth=32
Run status group 0 (all jobs): READ: bw=2217MiB/s (2325MB/s), 2217MiB/s-2217MiB/s (2325MB/s-2325MB/s), io=25.0GiB (27.9GB), run=11998-11998msec
</details>