zfs icon indicating copy to clipboard operation
zfs copied to clipboard

Direct IO Support

Open bwatkinson opened this issue 5 years ago • 61 comments

Adding O_DIRECT support to ZFS.

Motivation and Context

By adding Direct IO support to ZFS, the ARC can be bypassed when issuing reads/writes. There are certain cases where caching data in the ARC can decrease overall performance. In particular the performance of ZPool's composed of NVMe devices displayed poor read/write performance due to the extra overhead of memcpy's issued to the ARC.

There are also cases where caching in the ARC may not make sense such as when data will not be referenced later. By using the O_DIRECT flag, unnecessary data copies to the ARC can be avoided.

Closes Issue: https://github.com/zfsonlinux/zfs/issues/8381

Description

O_DIRECT support in ZFS will always ensure there is coherency between buffered and O_DIRECT IO requests. This ensures that all IO requests, whether buffered or direct, will see the same file contents at all times. Just as in other FS's , O_DIRECT does not imply O_SYNC. While data is written directly to VDEV disks, metadata will not be synced until the associated TXG is synced. For both O_DIRECT read and write request the offset and requeset sizes, at a minimum, must be PAGE_SIZE aligned. In the event they are not, then EINVAL is returned except for in the event the direct property is set to always.

For O_DIRECT writes: The request also must be block aligned (recordsize) or the write request will take the normal (buffered) write path. In the event that request is block aligned and a cached copy of the buffer in the ARC, then it will be discarded from the ARC forcing all further reads to retrieve the data from disk.

For O_DIRECT reads: The only alignment restrictions are PAGE_SIZE alignment. In the event that the requested data is in buffered (in the ARC) it will just be copied from the ARC into the user buffer.

To ensure data integrity for all data written using O_DIRECT, all user pages are made stable in the event one of the following is required: Checksum Compression Encryption Parity By making the user pages stable, we make sure the contents of the user provided buffer can not be changed after any of the above operations have taken place.

A new dataset property direct has been added with the following 3 allowable values:

  • disabled - Accepts O_DIRECT flag, but silently ignores it and treats the request as a buffered IO request.

  • standard - Follows the alignment restrictions outlined above for write/read IO requests when the O_DIRECT flag is used.

  • always - Treats every write/read IO request as though it passed O_DIRECT. In the event the request is not page aligned, it will be redirected through the ARC. All other alignment restrictions are followed.

Direct IO does not bypass the ZIO pipeline, so all checksums, compression, etc. are still all supported with Direct IO.

Some issues that still need to be addressed:

  • [ ] Create ZTS tests for O_DIRECT
  • [ ] Possibly allow for DVA throttle with O_DIRECT writes
  • [ ] Further testing/verification of FreeBSD (majority of debugging has been on Linux)
  • [ ] Possibly allow for O_DIRECT with zvols
  • [ ] Address race conditions in dbuf code with O_DIRECT

How Has This Been Tested?

Testing was primarily done using FIO and XDD with striping, mirror, raidz, and dRAID VDEV ZPool's.

Tests were performed on CentOS using various kernel's ranging from 3.10, 4.18, and 4.20.

Types of changes

  • [ ] Bug fix (non-breaking change which fixes an issue)
  • [x] New feature (non-breaking change which adds functionality)
  • [x] Performance enhancement (non-breaking change which improves efficiency)
  • [ ] Code cleanup (non-breaking change which makes code smaller or more readable)
  • [ ] Breaking change (fix or feature that would cause existing functionality to change)
  • [x] Documentation (a change to man pages or other documentation)

Checklist:

  • [x] My code follows the ZFS on Linux code style requirements.
  • [x] I have updated the documentation accordingly.
  • [x] I have read the contributing document.
  • [ ] I have added tests to cover my changes.
  • [ ] I have run the ZFS Test Suite with this change applied.
  • [x] All commit messages are properly formatted and contain Signed-off-by.

bwatkinson avatar Feb 18 '20 17:02 bwatkinson

Codecov Report

Attention: Patch coverage is 63.17044% with 309 lines in your changes are missing coverage. Please review.

Project coverage is 61.94%. Comparing base (161ed82) to head (04e3a35). Report is 2456 commits behind head on master.

:exclamation: Current head 04e3a35 differs from pull request most recent head a83e237. Consider uploading reports for the commit a83e237 to get more accurate results

Files Patch % Lines
module/zfs/dmu.c 51.01% 265 Missing :warning:
module/os/linux/zfs/abd.c 88.30% 20 Missing :warning:
module/zfs/dbuf.c 75.71% 17 Missing :warning:
lib/libzpool/kernel.c 0.00% 5 Missing :warning:
include/sys/abd.h 50.00% 2 Missing :warning:
Additional details and impacted files
@@             Coverage Diff             @@
##           master   #10018       +/-   ##
===========================================
- Coverage   75.17%   61.94%   -13.24%     
===========================================
  Files         402      260      -142     
  Lines      128071    73582    -54489     
===========================================
- Hits        96283    45578    -50705     
+ Misses      31788    28004     -3784     
Flag Coverage Δ
kernel 51.01% <43.78%> (-27.75%) :arrow_down:
user 59.10% <59.33%> (+11.67%) :arrow_up:

Flags with carried forward coverage won't be shown. Click here to find out more.

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

codecov[bot] avatar Feb 25 '20 00:02 codecov[bot]

Overall, I have to say, thanks for taking this one on! This looks like it wasn't trivial to figure out.

With regards to zio_dva_throttle() and performance, I'd like to point to an older PR here: https://github.com/openzfs/zfs/pull/7560 - so it looks like skipping it might have some justification. Ideally IMHO it'd be best to leave it up to the user (ie. configurable).

I'm excited about this PR, it looks to be a solid basis for support of .splice_read()/.splice_write() in order to support IO to/from pipes. I was looking at it this week because of https://github.com/vpsfreecz/linux/commit/1a980b8cbf0059a5308eea61522f232fd03002e2 - with OverlayFS on top of ZFS, this patch makes all apps using sendfile(2) go tra-la. Issue about that one: https://github.com/openzfs/zfs/issues/1156

snajpa avatar Jun 20 '20 18:06 snajpa

page.h is a really generic name. Could you please call it zfs_page.h

mattmacy avatar Nov 27 '20 23:11 mattmacy

Does zfs 2.0 have support for Direct I/O ?

denogio avatar Dec 01 '20 12:12 denogio

From last week's OpenZFS Leadership Meeting notes:

Status of specific PRs: DirectIO Brian Atkinson is still working on it. We expect it to be updated soon, at which point we’ll need reviewers

FWIW I'm happy to review, to the extent of my ability.

adamdmoss avatar Feb 08 '21 19:02 adamdmoss

(Summary of a private Slack discussion)

zfs_log_write currently re-reads the O_DIRECTly written block back from disk if the WR_* selection logic decides it's going to be a WR_COPIED record. The performance impact is quite significant:

for i in 1; do fio --rw=randwrite --bs=4k --filename_format '/dut/ds$jobnum/benchmark' --name=foo --time_based --size=4G --group_reporting=1 --sync=1 --runtime=30s  --numjobs=$i --direct=1; done

...

./funclatency_specialized -T -i 3 zfs_write,zfs_log_write,dmu_read_by_dnode  zfs_write,dmu_write_uio_dbuf -S 
STACKFUNC             PROBE       AVG COUNT        SUM
      0,0         zfs_write 141386.90  7265 1027175834
      0,1     zfs_log_write  87372.27  7265  634759549
      0,2 dmu_read_by_dnode  85745.36  7265  622940020
STACKFUNC              PROBE       AVG COUNT        SUM
      1,0          zfs_write 142031.61  7265 1031859636
      1,1 dmu_write_uio_dbuf  41214.84  7265  299425788

On this pool of Micron_7300_MTFDHBA960TDF the zfs_write takes 40us zfs_log_write takes 87us.

Proposal: If a write was written directly we should always log this write as a WR_INDIRECT record with lr_bp= the block pointer produced by dmu_write_direct().

Implementation v1:

  • Add an additional argument to zfs_log_write that forces the selection algorithm to use WR_INDIRECT.
  • Set this flag iff O_DIRECT
  • => the dmu_sync() call will re-use the already written block pointer because it's in the dr
    • (This is @problame 's interpretation of the chat log)

Implementation v2 (long-term):

  • Break up zfs_log_write so ITXs can be allocated and filled in the zfs_write() copy loop.
  • Add a facility to bubble up the block pointer that is produced by dmu_write_direct up to the zfs_write copy loop.
  • => for each bubbled-up block pointer allocate a WR_INDIRECT ITX
  • After the copy loop, at the location where we currently call zfs_log_write, assign all the ITXs we just created.

I have a PoC for the zfs_log_write breakup ready. It was developed to avoid the dmu_read() overhead for WR_COPIED records but can be generalized to this use case as well. I won't have time to iterate on it for 2 weeks though, so @bwatkinson is likely going to implement v1, merge this PR and move the proposal for v2 into a separate issue.

problame avatar Mar 01 '21 20:03 problame

zfs_log_write implementation v1:

This has been implemented in the updated PR. For O_DIRECT writes WR_INDIRECT log records are now always used and the block pointer is stored in the log record without any re-read.

behlendorf avatar Mar 03 '21 16:03 behlendorf

It would be nice to funnel the ioflag through to dmu_sync so that it can assert that IMPLY(O_DIRECT as set, got the bp from the dr).

I looked in to doing exactly this, since it would be nice, but in practice it ended up being a pretty invasive change which didn't seem worthwhile in the end.

behlendorf avatar Mar 15 '21 17:03 behlendorf

FWIW I gave this PR a spin (well, full disclosure: Brian's direct_page_aligned.wip.3) for a day or two and it appeared to work okay. No explosions. Mostly non-direct-IO workloads, though I did some noodling with dd iflag=direct/oflag=direct and some C O_DIRECT hacks.

adamdmoss avatar Mar 29 '21 19:03 adamdmoss

@bwatkinson - what is the current state of this PR? Would you consider it safe for testing on semi-production systems (real data, but can be replaced)?

sempervictus avatar Nov 11 '21 00:11 sempervictus

@bwatkinson - what is the current state of this PR? Would you consider it safe for testing on semi-production systems (real data, but can be replaced)?

So the PR is still in the WIP progress state. It is up to date with master as of Friday (Nov. 5th). I think it safe for experimenting with but really only for experimentation at this point. There are a few known bugs that we are sorting through at the moment.

bwatkinson avatar Nov 11 '21 01:11 bwatkinson

hi All,

Is there any update on this PR? Is there any estimation on its merge date?

tomposmiko avatar Mar 02 '22 21:03 tomposmiko

I decided to give this branch a test since I have a bunch of NVME pools and would love some performance increase.

I ran a read test with my 10 disk nvme pool and noticed a pretty dramatic difference with direct=disabled vs always.

I have 3 machines with this spec that I won't be using until some networking is in place and am available to run tests.

Enabled:

[root@ac-1f-6b-a5-ab-ea bar]# nvme list
Node                  SN                   Model                                    Namespace Usage                      Format           FW Rev
--------------------- -------------------- ---------------------------------------- --------- -------------------------- ---------------- --------
/dev/nvme0n1          BTLN902103WV3P2BGN   INTEL SSDPE2KE032T8                      1           3.20  TB /   3.20  TB    512   B +  0 B   VDV10170
/dev/nvme1n1          BTLN902005083P2BGN   INTEL SSDPE2KE032T8                      1           3.20  TB /   3.20  TB    512   B +  0 B   VDV10170
/dev/nvme2n1          BTLN9050021N3P2BGN   INTEL SSDPE2KE032T8                      1           3.20  TB /   3.20  TB    512   B +  0 B   VDV10170
/dev/nvme3n1          BTLN907504P03P2BGN   INTEL SSDPE2KE032T8                      1           3.20  TB /   3.20  TB    512   B +  0 B   VDV10170
/dev/nvme4n1          BTLN902101103P2BGN   INTEL SSDPE2KE032T8                      1           3.20  TB /   3.20  TB    512   B +  0 B   VDV10170
/dev/nvme5n1          BTLN905001BD3P2BGN   INTEL SSDPE2KE032T8                      1           3.20  TB /   3.20  TB    512   B +  0 B   VDV10170
/dev/nvme6n1          BTLN902004DJ3P2BGN   INTEL SSDPE2KE032T8                      1           3.20  TB /   3.20  TB    512   B +  0 B   VDV10170
/dev/nvme7n1          BTLN907504N03P2BGN   INTEL SSDPE2KE032T8                      1           3.20  TB /   3.20  TB    512   B +  0 B   VDV10170
/dev/nvme8n1          BTLN9050027H3P2BGN   INTEL SSDPE2KE032T8                      1           3.20  TB /   3.20  TB    512   B +  0 B   VDV10170
/dev/nvme9n1          BTLN85110HDJ3P2BGN   INTEL SSDPE2KE032T8                      1           3.20  TB /   3.20  TB    512   B +  0 B   VDV10170
[root@ac-1f-6b-a5-ab-ea bar]# zpool status
  pool: test
 state: ONLINE
config:

        NAME        STATE     READ WRITE CKSUM
        test        ONLINE       0     0     0
          nvme1n1   ONLINE       0     0     0
          nvme2n1   ONLINE       0     0     0
          nvme3n1   ONLINE       0     0     0
          nvme5n1   ONLINE       0     0     0
          nvme6n1   ONLINE       0     0     0
          nvme8n1   ONLINE       0     0     0
          nvme9n1   ONLINE       0     0     0

errors: No known data errors

[root@ac-1f-6b-a5-ab-ea bar]# zfs get direct test/bar
NAME      PROPERTY  VALUE     SOURCE
test/bar  direct    always    local
[root@ac-1f-6b-a5-ab-ea bar]# zpool --version
zfs-2.1.99-1310_g7ac3b7ae9
zfs-kmod-2.1.99-1310_g7ac3b7ae9

[root@ac-1f-6b-a5-ab-ea bar]# fio --time_based --name=benchmark --size=2G --runtime=30 --filename=./test --ioengine=sync --randrepeat=0 --iodepth=32 --direct=1 --invalidate=1 --verify=0 --verify_fatal=0 --numjobs=4 --rw=read --blocksize=4k --group_reporting
benchmark: (g=0): rw=read, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=sync, iodepth=32
...
fio-3.19
Starting 4 processes
Jobs: 4 (f=4): [R(4)][100.0%][r=140MiB/s][r=35.9k IOPS][eta 00m:00s]
benchmark: (groupid=0, jobs=4): err= 0: pid=1289094: Fri Jul 22 18:24:47 2022
  read: IOPS=50.5k, BW=197MiB/s (207MB/s)(5914MiB/30001msec)
    clat (nsec): min=1772, max=4948.3k, avg=78834.03, stdev=112800.83
     lat (nsec): min=1810, max=4948.4k, avg=78884.69, stdev=112806.53
    clat percentiles (usec):
     |  1.00th=[    3],  5.00th=[    3], 10.00th=[    3], 20.00th=[    3],
     | 30.00th=[    4], 40.00th=[   76], 50.00th=[   86], 60.00th=[   92],
     | 70.00th=[  100], 80.00th=[  111], 90.00th=[  130], 95.00th=[  223],
     | 99.00th=[  359], 99.50th=[  424], 99.90th=[ 1037], 99.95th=[ 2835],
     | 99.99th=[ 3425]
   bw (  KiB/s): min=107584, max=666672, per=100.00%, avg=202896.98, stdev=23198.14, samples=236
   iops        : min=26896, max=166668, avg=50724.19, stdev=5799.55, samples=236
  lat (usec)   : 2=0.40%, 4=33.20%, 10=3.36%, 20=0.11%, 50=0.01%
  lat (usec)   : 100=32.59%, 250=25.74%, 500=4.22%, 750=0.16%, 1000=0.10%
  lat (msec)   : 2=0.06%, 4=0.06%, 10=0.01%
  cpu          : usr=0.98%, sys=13.18%, ctx=952786, majf=0, minf=78
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=1513906,0,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=32

Run status group 0 (all jobs):
   READ: bw=197MiB/s (207MB/s), 197MiB/s-197MiB/s (207MB/s-207MB/s), io=5914MiB (6201MB), run=30001-30001msec

Disabled:

[root@ac-1f-6b-a5-ab-ea bar]# zfs set direct=disabled test/bar

[root@ac-1f-6b-a5-ab-ea bar]# for i in {1..3}; do sync; echo $i > /proc/sys/vm/drop_caches; done

[root@ac-1f-6b-a5-ab-ea bar]# rm -rf test

[root@ac-1f-6b-a5-ab-ea bar]# fio --time_based --name=benchmark --size=2G --runtime=30 --filename=./test --ioengine=sync --randrepeat=0 --iodepth=32 --direct=1 --invalidate=1 --verify=0 --verify_fatal=0 --numjobs=4 --rw=read --blocksize=4k --group_reporting
benchmark: (g=0): rw=read, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=sync, iodepth=32
...
fio-3.19
Starting 4 processes
benchmark: Laying out IO file (1 file / 2048MiB)
Jobs: 4 (f=4): [R(4)][100.0%][r=2095MiB/s][r=536k IOPS][eta 00m:00s]
benchmark: (groupid=0, jobs=4): err= 0: pid=1399856: Fri Jul 22 18:25:43 2022
  read: IOPS=584k, BW=2283MiB/s (2394MB/s)(66.9GiB/30001msec)
    clat (nsec): min=1195, max=145299, avg=6568.14, stdev=5384.80
     lat (nsec): min=1231, max=145336, avg=6603.92, stdev=5385.07
    clat percentiles (nsec):
     |  1.00th=[ 1512],  5.00th=[ 2576], 10.00th=[ 3152], 20.00th=[ 3920],
     | 30.00th=[ 4704], 40.00th=[ 5408], 50.00th=[ 5984], 60.00th=[ 6496],
     | 70.00th=[ 7008], 80.00th=[ 7584], 90.00th=[ 8384], 95.00th=[ 9280],
     | 99.00th=[37120], 99.50th=[42240], 99.90th=[52480], 99.95th=[55552],
     | 99.99th=[60160]
   bw (  MiB/s): min= 2084, max= 3905, per=100.00%, avg=2289.42, stdev=122.06, samples=236
   iops        : min=533706, max=999856, avg=586091.93, stdev=31247.29, samples=236
  lat (usec)   : 2=2.16%, 4=18.91%, 10=75.22%, 20=0.93%, 50=2.61%
  lat (usec)   : 100=0.16%, 250=0.01%
  cpu          : usr=5.64%, sys=86.69%, ctx=348979, majf=0, minf=51
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=17534987,0,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=32

Run status group 0 (all jobs):
   READ: bw=2283MiB/s (2394MB/s), 2283MiB/s-2283MiB/s (2394MB/s-2394MB/s), io=66.9GiB (71.8GB), run=30001-30001msec

Another test: Enabed:

[root@ac-1f-6b-a5-ab-ea foo]# zfs get recordsize test/foo
NAME      PROPERTY    VALUE    SOURCE
test/foo  recordsize  128K     default
[root@ac-1f-6b-a5-ab-ea foo]# pwd
/test/foo
[root@ac-1f-6b-a5-ab-ea foo]# fio --time_based --name=benchmark --size=2G --runtime=300 --filename=./boot --ioengine=libaio --randrepeat=0 --iodepth=128 --direct=1 --invalidate=1 --verify=0 --verify_fatal=0 --numjobs=24 --rw=randwrite --blocksize=128k --group_reporting
benchmark: (g=0): rw=randwrite, bs=(R) 128KiB-128KiB, (W) 128KiB-128KiB, (T) 128KiB-128KiB, ioengine=libaio, iodepth=128
...
fio-3.19
Starting 24 processes
benchmark: Laying out IO file (1 file / 2048MiB)
^Cbs: 24 (f=24): [w(24)][4.7%][w=5150MiB/s][w=41.2k IOPS][eta 04m:47s]
fio: terminating on signal 2

benchmark: (groupid=0, jobs=24): err= 0: pid=1409094: Tue Jul 26 11:41:53 2022
  write: IOPS=42.8k, BW=5345MiB/s (5605MB/s)(68.5GiB/13118msec); 0 zone resets
    slat (usec): min=245, max=3462, avg=558.80, stdev=130.00
    clat (usec): min=2, max=101349, avg=71107.74, stdev=11837.83
     lat (usec): min=416, max=102170, avg=71666.77, stdev=11911.16
    clat percentiles (msec):
     |  1.00th=[   56],  5.00th=[   61], 10.00th=[   62], 20.00th=[   63],
     | 30.00th=[   64], 40.00th=[   64], 50.00th=[   65], 60.00th=[   69],
     | 70.00th=[   75], 80.00th=[   83], 90.00th=[   92], 95.00th=[   95],
     | 99.00th=[   97], 99.50th=[   99], 99.90th=[  100], 99.95th=[  101],
     | 99.99th=[  101]
   bw (  MiB/s): min= 4247, max= 6095, per=99.60%, avg=5323.87, stdev=23.61, samples=624
   iops        : min=33980, max=48760, avg=42590.50, stdev=188.86, samples=624
  lat (usec)   : 4=0.01%, 10=0.01%, 500=0.01%, 750=0.01%, 1000=0.01%
  lat (msec)   : 2=0.01%, 4=0.02%, 10=0.05%, 20=0.08%, 50=0.24%
  lat (msec)   : 100=99.58%, 250=0.02%
  cpu          : usr=1.83%, sys=73.94%, ctx=566417, majf=0, minf=16943
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=99.7%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.1%
     issued rwts: total=0,560978,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=128

Run status group 0 (all jobs):
  WRITE: bw=5345MiB/s (5605MB/s), 5345MiB/s-5345MiB/s (5605MB/s-5605MB/s), io=68.5GiB (73.5GB), run=13118-13118msec

Disabed:

[root@ac-1f-6b-a5-ab-ea foo]# zfs set direct=disabled test/foo
[root@ac-1f-6b-a5-ab-ea foo]# for i in {1..3}; do sync; echo $i > /proc/sys/vm/drop_caches; done
[root@ac-1f-6b-a5-ab-ea foo]# rm -rf boot
[root@ac-1f-6b-a5-ab-ea foo]# fio --time_based --name=benchmark --size=2G --runtime=300 --filename=./boot --ioengine=libaio --randrepeat=0 --iodepth=128 --direct=1 --invalidate=1 --verify=0 --verify_fatal=0 --numjobs=24 --rw=randwrite --blocksize=128k --group_reporting
benchmark: (g=0): rw=randwrite, bs=(R) 128KiB-128KiB, (W) 128KiB-128KiB, (T) 128KiB-128KiB, ioengine=libaio, iodepth=128
...
fio-3.19
Starting 24 processes
benchmark: Laying out IO file (1 file / 2048MiB)
^Cbs: 24 (f=24): [w(24)][3.3%][w=17.7GiB/s][w=145k IOPS][eta 04m:51s]
fio: terminating on signal 2

benchmark: (groupid=0, jobs=24): err= 0: pid=1479104: Tue Jul 26 11:42:16 2022
write: IOPS=141k, BW=17.2GiB/s (18.5GB/s)(161GiB/9361msec); 0 zone resets
  slat (usec): min=36, max=67814, avg=167.46, stdev=682.26
  clat (usec): min=2, max=124187, avg=21584.85, stdev=9289.76
   lat (usec): min=58, max=124312, avg=21752.62, stdev=9330.98
  clat percentiles (usec):
   |  1.00th=[10028],  5.00th=[11863], 10.00th=[13042], 20.00th=[14877],
   | 30.00th=[16450], 40.00th=[17957], 50.00th=[19268], 60.00th=[20841],
   | 70.00th=[22938], 80.00th=[26346], 90.00th=[32900], 95.00th=[39584],
   | 99.00th=[56361], 99.50th=[64750], 99.90th=[81265], 99.95th=[90702],
   | 99.99th=[98042]
 bw (  MiB/s): min=14225, max=20963, per=99.79%, avg=17598.44, stdev=74.08, samples=432
 iops        : min=113793, max=167698, avg=140778.78, stdev=592.60, samples=432
lat (usec)   : 4=0.01%, 10=0.01%, 100=0.01%, 250=0.01%, 500=0.01%
lat (usec)   : 750=0.01%, 1000=0.01%
lat (msec)   : 2=0.01%, 4=0.03%, 10=0.87%, 20=53.54%, 50=43.80%
lat (msec)   : 100=1.74%, 250=0.01%
cpu          : usr=9.64%, sys=65.26%, ctx=68796, majf=0, minf=25694
IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=99.9%
   submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
   complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.1%
   issued rwts: total=0,1320646,0,0 short=0,0,0,0 dropped=0,0,0,0
   latency   : target=0, window=0, percentile=100.00%, depth=128

Run status group 0 (all jobs):
WRITE: bw=17.2GiB/s (18.5GB/s), 17.2GiB/s-17.2GiB/s (18.5GB/s-18.5GB/s), io=161GiB (173GB), run=9361-9361msec

Smithx10 avatar Jul 22 '22 22:07 Smithx10

I also noticed that you can't set direct on a zvol.

"cannot set property for 'test/foo': 'direct' does not apply to datasets of this type". Do zvols suffer from the double mem copy?

Smithx10 avatar Jul 23 '22 01:07 Smithx10

I decided to give this branch a test since I have a bunch of NVME pools and would love some performance increase.

I ran a read test with my 10 disk nvme pool and noticed a pretty dramatic difference with direct=disabled vs always.

I have 3 machines with this spec that I won't be using until some networking is in place and am available to run tests.

Enabled:

[root@ac-1f-6b-a5-ab-ea bar]# nvme list
Node                  SN                   Model                                    Namespace Usage                      Format           FW Rev
--------------------- -------------------- ---------------------------------------- --------- -------------------------- ---------------- --------
/dev/nvme0n1          BTLN902103WV3P2BGN   INTEL SSDPE2KE032T8                      1           3.20  TB /   3.20  TB    512   B +  0 B   VDV10170
/dev/nvme1n1          BTLN902005083P2BGN   INTEL SSDPE2KE032T8                      1           3.20  TB /   3.20  TB    512   B +  0 B   VDV10170
/dev/nvme2n1          BTLN9050021N3P2BGN   INTEL SSDPE2KE032T8                      1           3.20  TB /   3.20  TB    512   B +  0 B   VDV10170
/dev/nvme3n1          BTLN907504P03P2BGN   INTEL SSDPE2KE032T8                      1           3.20  TB /   3.20  TB    512   B +  0 B   VDV10170
/dev/nvme4n1          BTLN902101103P2BGN   INTEL SSDPE2KE032T8                      1           3.20  TB /   3.20  TB    512   B +  0 B   VDV10170
/dev/nvme5n1          BTLN905001BD3P2BGN   INTEL SSDPE2KE032T8                      1           3.20  TB /   3.20  TB    512   B +  0 B   VDV10170
/dev/nvme6n1          BTLN902004DJ3P2BGN   INTEL SSDPE2KE032T8                      1           3.20  TB /   3.20  TB    512   B +  0 B   VDV10170
/dev/nvme7n1          BTLN907504N03P2BGN   INTEL SSDPE2KE032T8                      1           3.20  TB /   3.20  TB    512   B +  0 B   VDV10170
/dev/nvme8n1          BTLN9050027H3P2BGN   INTEL SSDPE2KE032T8                      1           3.20  TB /   3.20  TB    512   B +  0 B   VDV10170
/dev/nvme9n1          BTLN85110HDJ3P2BGN   INTEL SSDPE2KE032T8                      1           3.20  TB /   3.20  TB    512   B +  0 B   VDV10170
[root@ac-1f-6b-a5-ab-ea bar]# zpool status
  pool: test
 state: ONLINE
config:

        NAME        STATE     READ WRITE CKSUM
        test        ONLINE       0     0     0
          nvme1n1   ONLINE       0     0     0
          nvme2n1   ONLINE       0     0     0
          nvme3n1   ONLINE       0     0     0
          nvme5n1   ONLINE       0     0     0
          nvme6n1   ONLINE       0     0     0
          nvme8n1   ONLINE       0     0     0
          nvme9n1   ONLINE       0     0     0

errors: No known data errors

[root@ac-1f-6b-a5-ab-ea bar]# zfs get direct test/bar
NAME      PROPERTY  VALUE     SOURCE
test/bar  direct    always    local
[root@ac-1f-6b-a5-ab-ea bar]# zpool --version
zfs-2.1.99-1310_g7ac3b7ae9
zfs-kmod-2.1.99-1310_g7ac3b7ae9

[root@ac-1f-6b-a5-ab-ea bar]# fio --time_based --name=benchmark --size=2G --runtime=30 --filename=./test --ioengine=sync --randrepeat=0 --iodepth=32 --direct=1 --invalidate=1 --verify=0 --verify_fatal=0 --numjobs=4 --rw=read --blocksize=4k --group_reporting
benchmark: (g=0): rw=read, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=sync, iodepth=32
...
fio-3.19
Starting 4 processes
Jobs: 4 (f=4): [R(4)][100.0%][r=140MiB/s][r=35.9k IOPS][eta 00m:00s]
benchmark: (groupid=0, jobs=4): err= 0: pid=1289094: Fri Jul 22 18:24:47 2022
  read: IOPS=50.5k, BW=197MiB/s (207MB/s)(5914MiB/30001msec)
    clat (nsec): min=1772, max=4948.3k, avg=78834.03, stdev=112800.83
     lat (nsec): min=1810, max=4948.4k, avg=78884.69, stdev=112806.53
    clat percentiles (usec):
     |  1.00th=[    3],  5.00th=[    3], 10.00th=[    3], 20.00th=[    3],
     | 30.00th=[    4], 40.00th=[   76], 50.00th=[   86], 60.00th=[   92],
     | 70.00th=[  100], 80.00th=[  111], 90.00th=[  130], 95.00th=[  223],
     | 99.00th=[  359], 99.50th=[  424], 99.90th=[ 1037], 99.95th=[ 2835],
     | 99.99th=[ 3425]
   bw (  KiB/s): min=107584, max=666672, per=100.00%, avg=202896.98, stdev=23198.14, samples=236
   iops        : min=26896, max=166668, avg=50724.19, stdev=5799.55, samples=236
  lat (usec)   : 2=0.40%, 4=33.20%, 10=3.36%, 20=0.11%, 50=0.01%
  lat (usec)   : 100=32.59%, 250=25.74%, 500=4.22%, 750=0.16%, 1000=0.10%
  lat (msec)   : 2=0.06%, 4=0.06%, 10=0.01%
  cpu          : usr=0.98%, sys=13.18%, ctx=952786, majf=0, minf=78
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=1513906,0,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=32

Run status group 0 (all jobs):
   READ: bw=197MiB/s (207MB/s), 197MiB/s-197MiB/s (207MB/s-207MB/s), io=5914MiB (6201MB), run=30001-30001msec

Disabled:

[root@ac-1f-6b-a5-ab-ea bar]# zfs set direct=disabled test/bar

[root@ac-1f-6b-a5-ab-ea bar]# for i in {1..3}; do sync; echo $i > /proc/sys/vm/drop_caches; done

[root@ac-1f-6b-a5-ab-ea bar]# rm -rf test

[root@ac-1f-6b-a5-ab-ea bar]# fio --time_based --name=benchmark --size=2G --runtime=30 --filename=./test --ioengine=sync --randrepeat=0 --iodepth=32 --direct=1 --invalidate=1 --verify=0 --verify_fatal=0 --numjobs=4 --rw=read --blocksize=4k --group_reporting
benchmark: (g=0): rw=read, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=sync, iodepth=32
...
fio-3.19
Starting 4 processes
benchmark: Laying out IO file (1 file / 2048MiB)
Jobs: 4 (f=4): [R(4)][100.0%][r=2095MiB/s][r=536k IOPS][eta 00m:00s]
benchmark: (groupid=0, jobs=4): err= 0: pid=1399856: Fri Jul 22 18:25:43 2022
  read: IOPS=584k, BW=2283MiB/s (2394MB/s)(66.9GiB/30001msec)
    clat (nsec): min=1195, max=145299, avg=6568.14, stdev=5384.80
     lat (nsec): min=1231, max=145336, avg=6603.92, stdev=5385.07
    clat percentiles (nsec):
     |  1.00th=[ 1512],  5.00th=[ 2576], 10.00th=[ 3152], 20.00th=[ 3920],
     | 30.00th=[ 4704], 40.00th=[ 5408], 50.00th=[ 5984], 60.00th=[ 6496],
     | 70.00th=[ 7008], 80.00th=[ 7584], 90.00th=[ 8384], 95.00th=[ 9280],
     | 99.00th=[37120], 99.50th=[42240], 99.90th=[52480], 99.95th=[55552],
     | 99.99th=[60160]
   bw (  MiB/s): min= 2084, max= 3905, per=100.00%, avg=2289.42, stdev=122.06, samples=236
   iops        : min=533706, max=999856, avg=586091.93, stdev=31247.29, samples=236
  lat (usec)   : 2=2.16%, 4=18.91%, 10=75.22%, 20=0.93%, 50=2.61%
  lat (usec)   : 100=0.16%, 250=0.01%
  cpu          : usr=5.64%, sys=86.69%, ctx=348979, majf=0, minf=51
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=17534987,0,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=32

Run status group 0 (all jobs):
   READ: bw=2283MiB/s (2394MB/s), 2283MiB/s-2283MiB/s (2394MB/s-2394MB/s), io=66.9GiB (71.8GB), run=30001-30001msec

Another test: Enabed:

[root@ac-1f-6b-a5-ab-ea foo]# zfs get recordsize test/foo
NAME      PROPERTY    VALUE    SOURCE
test/foo  recordsize  128K     default
[root@ac-1f-6b-a5-ab-ea foo]# pwd
/test/foo
[root@ac-1f-6b-a5-ab-ea foo]# fio --time_based --name=benchmark --size=2G --runtime=300 --filename=./boot --ioengine=libaio --randrepeat=0 --iodepth=128 --direct=1 --invalidate=1 --verify=0 --verify_fatal=0 --numjobs=24 --rw=randwrite --blocksize=128k --group_reporting
benchmark: (g=0): rw=randwrite, bs=(R) 128KiB-128KiB, (W) 128KiB-128KiB, (T) 128KiB-128KiB, ioengine=libaio, iodepth=128
...
fio-3.19
Starting 24 processes
benchmark: Laying out IO file (1 file / 2048MiB)
^Cbs: 24 (f=24): [w(24)][4.7%][w=5150MiB/s][w=41.2k IOPS][eta 04m:47s]
fio: terminating on signal 2

benchmark: (groupid=0, jobs=24): err= 0: pid=1409094: Tue Jul 26 11:41:53 2022
  write: IOPS=42.8k, BW=5345MiB/s (5605MB/s)(68.5GiB/13118msec); 0 zone resets
    slat (usec): min=245, max=3462, avg=558.80, stdev=130.00
    clat (usec): min=2, max=101349, avg=71107.74, stdev=11837.83
     lat (usec): min=416, max=102170, avg=71666.77, stdev=11911.16
    clat percentiles (msec):
     |  1.00th=[   56],  5.00th=[   61], 10.00th=[   62], 20.00th=[   63],
     | 30.00th=[   64], 40.00th=[   64], 50.00th=[   65], 60.00th=[   69],
     | 70.00th=[   75], 80.00th=[   83], 90.00th=[   92], 95.00th=[   95],
     | 99.00th=[   97], 99.50th=[   99], 99.90th=[  100], 99.95th=[  101],
     | 99.99th=[  101]
   bw (  MiB/s): min= 4247, max= 6095, per=99.60%, avg=5323.87, stdev=23.61, samples=624
   iops        : min=33980, max=48760, avg=42590.50, stdev=188.86, samples=624
  lat (usec)   : 4=0.01%, 10=0.01%, 500=0.01%, 750=0.01%, 1000=0.01%
  lat (msec)   : 2=0.01%, 4=0.02%, 10=0.05%, 20=0.08%, 50=0.24%
  lat (msec)   : 100=99.58%, 250=0.02%
  cpu          : usr=1.83%, sys=73.94%, ctx=566417, majf=0, minf=16943
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=99.7%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.1%
     issued rwts: total=0,560978,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=128

Run status group 0 (all jobs):
  WRITE: bw=5345MiB/s (5605MB/s), 5345MiB/s-5345MiB/s (5605MB/s-5605MB/s), io=68.5GiB (73.5GB), run=13118-13118msec

Disabed:

[root@ac-1f-6b-a5-ab-ea foo]# zfs set direct=disabled test/foo
[root@ac-1f-6b-a5-ab-ea foo]# for i in {1..3}; do sync; echo $i > /proc/sys/vm/drop_caches; done
[root@ac-1f-6b-a5-ab-ea foo]# rm -rf boot
[root@ac-1f-6b-a5-ab-ea foo]# fio --time_based --name=benchmark --size=2G --runtime=300 --filename=./boot --ioengine=libaio --randrepeat=0 --iodepth=128 --direct=1 --invalidate=1 --verify=0 --verify_fatal=0 --numjobs=24 --rw=randwrite --blocksize=128k --group_reporting
benchmark: (g=0): rw=randwrite, bs=(R) 128KiB-128KiB, (W) 128KiB-128KiB, (T) 128KiB-128KiB, ioengine=libaio, iodepth=128
...
fio-3.19
Starting 24 processes
benchmark: Laying out IO file (1 file / 2048MiB)
^Cbs: 24 (f=24): [w(24)][3.3%][w=17.7GiB/s][w=145k IOPS][eta 04m:51s]
fio: terminating on signal 2

benchmark: (groupid=0, jobs=24): err= 0: pid=1479104: Tue Jul 26 11:42:16 2022
write: IOPS=141k, BW=17.2GiB/s (18.5GB/s)(161GiB/9361msec); 0 zone resets
  slat (usec): min=36, max=67814, avg=167.46, stdev=682.26
  clat (usec): min=2, max=124187, avg=21584.85, stdev=9289.76
   lat (usec): min=58, max=124312, avg=21752.62, stdev=9330.98
  clat percentiles (usec):
   |  1.00th=[10028],  5.00th=[11863], 10.00th=[13042], 20.00th=[14877],
   | 30.00th=[16450], 40.00th=[17957], 50.00th=[19268], 60.00th=[20841],
   | 70.00th=[22938], 80.00th=[26346], 90.00th=[32900], 95.00th=[39584],
   | 99.00th=[56361], 99.50th=[64750], 99.90th=[81265], 99.95th=[90702],
   | 99.99th=[98042]
 bw (  MiB/s): min=14225, max=20963, per=99.79%, avg=17598.44, stdev=74.08, samples=432
 iops        : min=113793, max=167698, avg=140778.78, stdev=592.60, samples=432
lat (usec)   : 4=0.01%, 10=0.01%, 100=0.01%, 250=0.01%, 500=0.01%
lat (usec)   : 750=0.01%, 1000=0.01%
lat (msec)   : 2=0.01%, 4=0.03%, 10=0.87%, 20=53.54%, 50=43.80%
lat (msec)   : 100=1.74%, 250=0.01%
cpu          : usr=9.64%, sys=65.26%, ctx=68796, majf=0, minf=25694
IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=99.9%
   submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
   complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.1%
   issued rwts: total=0,1320646,0,0 short=0,0,0,0 dropped=0,0,0,0
   latency   : target=0, window=0, percentile=100.00%, depth=128

Run status group 0 (all jobs):
WRITE: bw=17.2GiB/s (18.5GB/s), 17.2GiB/s-17.2GiB/s (18.5GB/s-18.5GB/s), io=161GiB (173GB), run=9361-9361msec

@Smithx10 so I noticed a couple of things with your results. So you are doing libaio for the ioengine with fio for the 128k tests; however, the asynchronous engines do not behave the way you think they would with ZFS. There is no asynchronous hooks in ZFS currently and all those IO calls get sent down the synchronous paths. The numjobs parameter will create the identical IO workloads, but the iodepth will not behave as it might normally would with other FS's. In fact there is a PR to make the asynchronous IO API's (this is what the async ioengine's use) work with ZFS https://github.com/openzfs/zfs/pull/12166. Also, I am not so certain that iodepth does anything when ioengine=sync with the 4k tests. I believe the fio man page might mention this. When I have looked at the FIO code, in the past, it seems to have no effect in that case. Could be wrong though. Feel free to double check me or tell me I am wrong on that one.

Also, I noticed your size was 2g correct? There can be no expectation that O_DIRECT can surpass ARC speed if the entire working set is living the ARC. You are just getting all the memory bandwidth the ARC can give at that point. Also if you are only writing 2GB of data, I would have to imagine you are not reading from all of those NVMe's in the Zpool. You can use zpool iostat -vq 1 to confirm or deny this though. It maybe the case you are not reading from all devices. This was just some quick observations.

I think if you look at the slides from the OpenZFS Developer Summit: https://docs.google.com/presentation/d/1f9bE1S6KqwHWVJtsOOfCu_cVKAFQO94h/edit#slide=id.p1 I was sequentially writing and reading 2TB of data to/from the Zpools. The argument has never been to not use the ARC with NVMe devices. In order to get good throughput performance with O_DIRECT you really need to step outside of the bounds of the ARC or be reading data sets that are not cached in the ARC that are spread across devices (larger than 2GB). There are certain situations where direct IO is a valid solution, and others where it isn't. It is not as simple as asking for it and everything improves.

I also was curious what your recordsize was set to for the 4k tests? I imagine you were trying to measure IOPS there? If it was set to the default 128k the IOPS results are not too surprising to me. Just like the normal ARC reads, a 4k request will not just fetch that amount if it has to go down to disk and the recordsize > 4k. The reason for this has to do with data validation with checksums. There is no way to validate the data unless you fetch the whole block. So, every time a read IO with O_DIRECT is issued, it will read back the entire block (AKA recordsize) to validate the data before returning it to the user.

My response is also based on the idea your ARC size was >= 2GB. If that assumption is wrong, then there is egg on my face. Please let me know if this is not the case.

bwatkinson avatar Aug 09 '22 21:08 bwatkinson

I also noticed that you can't set direct on a zvol.

"cannot set property for 'test/foo': 'direct' does not apply to datasets of this type". Do zvols suffer from the double mem copy?

@Smithx10 that is correct. @behlendorf and I were having some issues with hooking in O_DIRECT with zvols. However, if my memory serves me correctly, that was possibly due to placing the pages in writeback which we no longer do. That needs to be revisited. However, I think that is additional work outside of this PR. If this PR is merged, then using the hooks in the zvol code should in theory be fine. It would just require some more investigation as there might still be issues there.

bwatkinson avatar Aug 09 '22 21:08 bwatkinson

I decided to give this branch a test since I have a bunch of NVME pools and would love some performance increase. I ran a read test with my 10 disk nvme pool and noticed a pretty dramatic difference with direct=disabled vs always. I have 3 machines with this spec that I won't be using until some networking is in place and am available to run tests. Enabled:

[root@ac-1f-6b-a5-ab-ea bar]# nvme list
Node                  SN                   Model                                    Namespace Usage                      Format           FW Rev
--------------------- -------------------- ---------------------------------------- --------- -------------------------- ---------------- --------
/dev/nvme0n1          BTLN902103WV3P2BGN   INTEL SSDPE2KE032T8                      1           3.20  TB /   3.20  TB    512   B +  0 B   VDV10170
/dev/nvme1n1          BTLN902005083P2BGN   INTEL SSDPE2KE032T8                      1           3.20  TB /   3.20  TB    512   B +  0 B   VDV10170
/dev/nvme2n1          BTLN9050021N3P2BGN   INTEL SSDPE2KE032T8                      1           3.20  TB /   3.20  TB    512   B +  0 B   VDV10170
/dev/nvme3n1          BTLN907504P03P2BGN   INTEL SSDPE2KE032T8                      1           3.20  TB /   3.20  TB    512   B +  0 B   VDV10170
/dev/nvme4n1          BTLN902101103P2BGN   INTEL SSDPE2KE032T8                      1           3.20  TB /   3.20  TB    512   B +  0 B   VDV10170
/dev/nvme5n1          BTLN905001BD3P2BGN   INTEL SSDPE2KE032T8                      1           3.20  TB /   3.20  TB    512   B +  0 B   VDV10170
/dev/nvme6n1          BTLN902004DJ3P2BGN   INTEL SSDPE2KE032T8                      1           3.20  TB /   3.20  TB    512   B +  0 B   VDV10170
/dev/nvme7n1          BTLN907504N03P2BGN   INTEL SSDPE2KE032T8                      1           3.20  TB /   3.20  TB    512   B +  0 B   VDV10170
/dev/nvme8n1          BTLN9050027H3P2BGN   INTEL SSDPE2KE032T8                      1           3.20  TB /   3.20  TB    512   B +  0 B   VDV10170
/dev/nvme9n1          BTLN85110HDJ3P2BGN   INTEL SSDPE2KE032T8                      1           3.20  TB /   3.20  TB    512   B +  0 B   VDV10170
[root@ac-1f-6b-a5-ab-ea bar]# zpool status
  pool: test
 state: ONLINE
config:

        NAME        STATE     READ WRITE CKSUM
        test        ONLINE       0     0     0
          nvme1n1   ONLINE       0     0     0
          nvme2n1   ONLINE       0     0     0
          nvme3n1   ONLINE       0     0     0
          nvme5n1   ONLINE       0     0     0
          nvme6n1   ONLINE       0     0     0
          nvme8n1   ONLINE       0     0     0
          nvme9n1   ONLINE       0     0     0

errors: No known data errors

[root@ac-1f-6b-a5-ab-ea bar]# zfs get direct test/bar
NAME      PROPERTY  VALUE     SOURCE
test/bar  direct    always    local
[root@ac-1f-6b-a5-ab-ea bar]# zpool --version
zfs-2.1.99-1310_g7ac3b7ae9
zfs-kmod-2.1.99-1310_g7ac3b7ae9

[root@ac-1f-6b-a5-ab-ea bar]# fio --time_based --name=benchmark --size=2G --runtime=30 --filename=./test --ioengine=sync --randrepeat=0 --iodepth=32 --direct=1 --invalidate=1 --verify=0 --verify_fatal=0 --numjobs=4 --rw=read --blocksize=4k --group_reporting
benchmark: (g=0): rw=read, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=sync, iodepth=32
...
fio-3.19
Starting 4 processes
Jobs: 4 (f=4): [R(4)][100.0%][r=140MiB/s][r=35.9k IOPS][eta 00m:00s]
benchmark: (groupid=0, jobs=4): err= 0: pid=1289094: Fri Jul 22 18:24:47 2022
  read: IOPS=50.5k, BW=197MiB/s (207MB/s)(5914MiB/30001msec)
    clat (nsec): min=1772, max=4948.3k, avg=78834.03, stdev=112800.83
     lat (nsec): min=1810, max=4948.4k, avg=78884.69, stdev=112806.53
    clat percentiles (usec):
     |  1.00th=[    3],  5.00th=[    3], 10.00th=[    3], 20.00th=[    3],
     | 30.00th=[    4], 40.00th=[   76], 50.00th=[   86], 60.00th=[   92],
     | 70.00th=[  100], 80.00th=[  111], 90.00th=[  130], 95.00th=[  223],
     | 99.00th=[  359], 99.50th=[  424], 99.90th=[ 1037], 99.95th=[ 2835],
     | 99.99th=[ 3425]
   bw (  KiB/s): min=107584, max=666672, per=100.00%, avg=202896.98, stdev=23198.14, samples=236
   iops        : min=26896, max=166668, avg=50724.19, stdev=5799.55, samples=236
  lat (usec)   : 2=0.40%, 4=33.20%, 10=3.36%, 20=0.11%, 50=0.01%
  lat (usec)   : 100=32.59%, 250=25.74%, 500=4.22%, 750=0.16%, 1000=0.10%
  lat (msec)   : 2=0.06%, 4=0.06%, 10=0.01%
  cpu          : usr=0.98%, sys=13.18%, ctx=952786, majf=0, minf=78
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=1513906,0,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=32

Run status group 0 (all jobs):
   READ: bw=197MiB/s (207MB/s), 197MiB/s-197MiB/s (207MB/s-207MB/s), io=5914MiB (6201MB), run=30001-30001msec

Disabled:

[root@ac-1f-6b-a5-ab-ea bar]# zfs set direct=disabled test/bar

[root@ac-1f-6b-a5-ab-ea bar]# for i in {1..3}; do sync; echo $i > /proc/sys/vm/drop_caches; done

[root@ac-1f-6b-a5-ab-ea bar]# rm -rf test

[root@ac-1f-6b-a5-ab-ea bar]# fio --time_based --name=benchmark --size=2G --runtime=30 --filename=./test --ioengine=sync --randrepeat=0 --iodepth=32 --direct=1 --invalidate=1 --verify=0 --verify_fatal=0 --numjobs=4 --rw=read --blocksize=4k --group_reporting
benchmark: (g=0): rw=read, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=sync, iodepth=32
...
fio-3.19
Starting 4 processes
benchmark: Laying out IO file (1 file / 2048MiB)
Jobs: 4 (f=4): [R(4)][100.0%][r=2095MiB/s][r=536k IOPS][eta 00m:00s]
benchmark: (groupid=0, jobs=4): err= 0: pid=1399856: Fri Jul 22 18:25:43 2022
  read: IOPS=584k, BW=2283MiB/s (2394MB/s)(66.9GiB/30001msec)
    clat (nsec): min=1195, max=145299, avg=6568.14, stdev=5384.80
     lat (nsec): min=1231, max=145336, avg=6603.92, stdev=5385.07
    clat percentiles (nsec):
     |  1.00th=[ 1512],  5.00th=[ 2576], 10.00th=[ 3152], 20.00th=[ 3920],
     | 30.00th=[ 4704], 40.00th=[ 5408], 50.00th=[ 5984], 60.00th=[ 6496],
     | 70.00th=[ 7008], 80.00th=[ 7584], 90.00th=[ 8384], 95.00th=[ 9280],
     | 99.00th=[37120], 99.50th=[42240], 99.90th=[52480], 99.95th=[55552],
     | 99.99th=[60160]
   bw (  MiB/s): min= 2084, max= 3905, per=100.00%, avg=2289.42, stdev=122.06, samples=236
   iops        : min=533706, max=999856, avg=586091.93, stdev=31247.29, samples=236
  lat (usec)   : 2=2.16%, 4=18.91%, 10=75.22%, 20=0.93%, 50=2.61%
  lat (usec)   : 100=0.16%, 250=0.01%
  cpu          : usr=5.64%, sys=86.69%, ctx=348979, majf=0, minf=51
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=17534987,0,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=32

Run status group 0 (all jobs):
   READ: bw=2283MiB/s (2394MB/s), 2283MiB/s-2283MiB/s (2394MB/s-2394MB/s), io=66.9GiB (71.8GB), run=30001-30001msec

Another test: Enabed:

[root@ac-1f-6b-a5-ab-ea foo]# zfs get recordsize test/foo
NAME      PROPERTY    VALUE    SOURCE
test/foo  recordsize  128K     default
[root@ac-1f-6b-a5-ab-ea foo]# pwd
/test/foo
[root@ac-1f-6b-a5-ab-ea foo]# fio --time_based --name=benchmark --size=2G --runtime=300 --filename=./boot --ioengine=libaio --randrepeat=0 --iodepth=128 --direct=1 --invalidate=1 --verify=0 --verify_fatal=0 --numjobs=24 --rw=randwrite --blocksize=128k --group_reporting
benchmark: (g=0): rw=randwrite, bs=(R) 128KiB-128KiB, (W) 128KiB-128KiB, (T) 128KiB-128KiB, ioengine=libaio, iodepth=128
...
fio-3.19
Starting 24 processes
benchmark: Laying out IO file (1 file / 2048MiB)
^Cbs: 24 (f=24): [w(24)][4.7%][w=5150MiB/s][w=41.2k IOPS][eta 04m:47s]
fio: terminating on signal 2

benchmark: (groupid=0, jobs=24): err= 0: pid=1409094: Tue Jul 26 11:41:53 2022
  write: IOPS=42.8k, BW=5345MiB/s (5605MB/s)(68.5GiB/13118msec); 0 zone resets
    slat (usec): min=245, max=3462, avg=558.80, stdev=130.00
    clat (usec): min=2, max=101349, avg=71107.74, stdev=11837.83
     lat (usec): min=416, max=102170, avg=71666.77, stdev=11911.16
    clat percentiles (msec):
     |  1.00th=[   56],  5.00th=[   61], 10.00th=[   62], 20.00th=[   63],
     | 30.00th=[   64], 40.00th=[   64], 50.00th=[   65], 60.00th=[   69],
     | 70.00th=[   75], 80.00th=[   83], 90.00th=[   92], 95.00th=[   95],
     | 99.00th=[   97], 99.50th=[   99], 99.90th=[  100], 99.95th=[  101],
     | 99.99th=[  101]
   bw (  MiB/s): min= 4247, max= 6095, per=99.60%, avg=5323.87, stdev=23.61, samples=624
   iops        : min=33980, max=48760, avg=42590.50, stdev=188.86, samples=624
  lat (usec)   : 4=0.01%, 10=0.01%, 500=0.01%, 750=0.01%, 1000=0.01%
  lat (msec)   : 2=0.01%, 4=0.02%, 10=0.05%, 20=0.08%, 50=0.24%
  lat (msec)   : 100=99.58%, 250=0.02%
  cpu          : usr=1.83%, sys=73.94%, ctx=566417, majf=0, minf=16943
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=99.7%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.1%
     issued rwts: total=0,560978,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=128

Run status group 0 (all jobs):
  WRITE: bw=5345MiB/s (5605MB/s), 5345MiB/s-5345MiB/s (5605MB/s-5605MB/s), io=68.5GiB (73.5GB), run=13118-13118msec

Disabed:

[root@ac-1f-6b-a5-ab-ea foo]# zfs set direct=disabled test/foo
[root@ac-1f-6b-a5-ab-ea foo]# for i in {1..3}; do sync; echo $i > /proc/sys/vm/drop_caches; done
[root@ac-1f-6b-a5-ab-ea foo]# rm -rf boot
[root@ac-1f-6b-a5-ab-ea foo]# fio --time_based --name=benchmark --size=2G --runtime=300 --filename=./boot --ioengine=libaio --randrepeat=0 --iodepth=128 --direct=1 --invalidate=1 --verify=0 --verify_fatal=0 --numjobs=24 --rw=randwrite --blocksize=128k --group_reporting
benchmark: (g=0): rw=randwrite, bs=(R) 128KiB-128KiB, (W) 128KiB-128KiB, (T) 128KiB-128KiB, ioengine=libaio, iodepth=128
...
fio-3.19
Starting 24 processes
benchmark: Laying out IO file (1 file / 2048MiB)
^Cbs: 24 (f=24): [w(24)][3.3%][w=17.7GiB/s][w=145k IOPS][eta 04m:51s]
fio: terminating on signal 2

benchmark: (groupid=0, jobs=24): err= 0: pid=1479104: Tue Jul 26 11:42:16 2022
write: IOPS=141k, BW=17.2GiB/s (18.5GB/s)(161GiB/9361msec); 0 zone resets
  slat (usec): min=36, max=67814, avg=167.46, stdev=682.26
  clat (usec): min=2, max=124187, avg=21584.85, stdev=9289.76
   lat (usec): min=58, max=124312, avg=21752.62, stdev=9330.98
  clat percentiles (usec):
   |  1.00th=[10028],  5.00th=[11863], 10.00th=[13042], 20.00th=[14877],
   | 30.00th=[16450], 40.00th=[17957], 50.00th=[19268], 60.00th=[20841],
   | 70.00th=[22938], 80.00th=[26346], 90.00th=[32900], 95.00th=[39584],
   | 99.00th=[56361], 99.50th=[64750], 99.90th=[81265], 99.95th=[90702],
   | 99.99th=[98042]
 bw (  MiB/s): min=14225, max=20963, per=99.79%, avg=17598.44, stdev=74.08, samples=432
 iops        : min=113793, max=167698, avg=140778.78, stdev=592.60, samples=432
lat (usec)   : 4=0.01%, 10=0.01%, 100=0.01%, 250=0.01%, 500=0.01%
lat (usec)   : 750=0.01%, 1000=0.01%
lat (msec)   : 2=0.01%, 4=0.03%, 10=0.87%, 20=53.54%, 50=43.80%
lat (msec)   : 100=1.74%, 250=0.01%
cpu          : usr=9.64%, sys=65.26%, ctx=68796, majf=0, minf=25694
IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=99.9%
   submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
   complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.1%
   issued rwts: total=0,1320646,0,0 short=0,0,0,0 dropped=0,0,0,0
   latency   : target=0, window=0, percentile=100.00%, depth=128

Run status group 0 (all jobs):
WRITE: bw=17.2GiB/s (18.5GB/s), 17.2GiB/s-17.2GiB/s (18.5GB/s-18.5GB/s), io=161GiB (173GB), run=9361-9361msec

@Smithx10 so I noticed a couple of things with your results. So you are doing libaio for the ioengine with fio for the 128k tests; however, the asynchronous engines do not behave the way you think they would with ZFS. There is no asynchronous hooks in ZFS currently and all those IO calls get sent down the synchronous paths. The numjobs parameter will create the identical IO workloads, but the iodepth will not behave as it might normally would with other FS's. In fact there is a PR to make the asynchronous IO API's (this is what the async ioengine's use) work with ZFS #12166. Also, I am not so certain that iodepth does anything when ioengine=sync with the 4k tests. I believe the fio man page might mention this. When I have looked at the FIO code, in the past, it seems to have no effect in that case. Could be wrong though. Feel free to double check me or tell me I am wrong on that one.

Also, I noticed your size was 2g correct? There can be no expectation that O_DIRECT can surpass ARC speed if the entire working set is living the ARC. You are just getting all the memory bandwidth the ARC can give at that point. Also if you are only writing 2GB of data, I would have to imagine you are not reading from all of those NVMe's in the Zpool. You can use zpool iostat -vq 1 to confirm or deny this though. It maybe the case you are not reading from all devices. This was just some quick observations.

I think if you look at the slides from the OpenZFS Developer Summit: https://docs.google.com/presentation/d/1f9bE1S6KqwHWVJtsOOfCu_cVKAFQO94h/edit#slide=id.p1 I was sequentially writing and reading 2TB of data to/from the Zpools. The argument has never been to not use the ARC with NVMe devices. In order to get good throughput performance with O_DIRECT you really need to step outside of the bounds of the ARC or be reading data sets that are not cached in the ARC that are spread across devices (larger than 2GB). There are certain situations where direct IO is a valid solution, and others where it isn't. It is not as simple as asking for it and everything improves.

I also was curious what your recordsize was set to for the 4k tests? I imagine you were trying to measure IOPS there? If it was set to the default 128k the IOPS results are not too surprising to me. Just like the normal ARC reads, a 4k request will not just fetch that amount if it has to go down to disk and the recordsize > 4k. The reason for this has to do with data validation with checksums. There is no way to validate the data unless you fetch the whole block. So, every time a read IO with O_DIRECT is issued, it will read back the entire block (AKA recordsize) to validate the data before returning it to the user.

My response is also based on the idea your ARC size was >= 2GB. If that assumption is wrong, then there is egg on my face. Please let me know if this is not the case.

Thank You So much for this explanation. I learned a bunch about what is happening here. I really appreciate you taking the time to help educate me. I will go experiment with this new knowledge. Thanks again!

Smithx10 avatar Aug 10 '22 02:08 Smithx10

I decided to give this branch a test since I have a bunch of NVME pools and would love some performance increase. I ran a read test with my 10 disk nvme pool and noticed a pretty dramatic difference with direct=disabled vs always. I have 3 machines with this spec that I won't be using until some networking is in place and am available to run tests. Enabled:

[root@ac-1f-6b-a5-ab-ea bar]# nvme list
Node                  SN                   Model                                    Namespace Usage                      Format           FW Rev
--------------------- -------------------- ---------------------------------------- --------- -------------------------- ---------------- --------
/dev/nvme0n1          BTLN902103WV3P2BGN   INTEL SSDPE2KE032T8                      1           3.20  TB /   3.20  TB    512   B +  0 B   VDV10170
/dev/nvme1n1          BTLN902005083P2BGN   INTEL SSDPE2KE032T8                      1           3.20  TB /   3.20  TB    512   B +  0 B   VDV10170
/dev/nvme2n1          BTLN9050021N3P2BGN   INTEL SSDPE2KE032T8                      1           3.20  TB /   3.20  TB    512   B +  0 B   VDV10170
/dev/nvme3n1          BTLN907504P03P2BGN   INTEL SSDPE2KE032T8                      1           3.20  TB /   3.20  TB    512   B +  0 B   VDV10170
/dev/nvme4n1          BTLN902101103P2BGN   INTEL SSDPE2KE032T8                      1           3.20  TB /   3.20  TB    512   B +  0 B   VDV10170
/dev/nvme5n1          BTLN905001BD3P2BGN   INTEL SSDPE2KE032T8                      1           3.20  TB /   3.20  TB    512   B +  0 B   VDV10170
/dev/nvme6n1          BTLN902004DJ3P2BGN   INTEL SSDPE2KE032T8                      1           3.20  TB /   3.20  TB    512   B +  0 B   VDV10170
/dev/nvme7n1          BTLN907504N03P2BGN   INTEL SSDPE2KE032T8                      1           3.20  TB /   3.20  TB    512   B +  0 B   VDV10170
/dev/nvme8n1          BTLN9050027H3P2BGN   INTEL SSDPE2KE032T8                      1           3.20  TB /   3.20  TB    512   B +  0 B   VDV10170
/dev/nvme9n1          BTLN85110HDJ3P2BGN   INTEL SSDPE2KE032T8                      1           3.20  TB /   3.20  TB    512   B +  0 B   VDV10170
[root@ac-1f-6b-a5-ab-ea bar]# zpool status
  pool: test
 state: ONLINE
config:

        NAME        STATE     READ WRITE CKSUM
        test        ONLINE       0     0     0
          nvme1n1   ONLINE       0     0     0
          nvme2n1   ONLINE       0     0     0
          nvme3n1   ONLINE       0     0     0
          nvme5n1   ONLINE       0     0     0
          nvme6n1   ONLINE       0     0     0
          nvme8n1   ONLINE       0     0     0
          nvme9n1   ONLINE       0     0     0

errors: No known data errors

[root@ac-1f-6b-a5-ab-ea bar]# zfs get direct test/bar
NAME      PROPERTY  VALUE     SOURCE
test/bar  direct    always    local
[root@ac-1f-6b-a5-ab-ea bar]# zpool --version
zfs-2.1.99-1310_g7ac3b7ae9
zfs-kmod-2.1.99-1310_g7ac3b7ae9

[root@ac-1f-6b-a5-ab-ea bar]# fio --time_based --name=benchmark --size=2G --runtime=30 --filename=./test --ioengine=sync --randrepeat=0 --iodepth=32 --direct=1 --invalidate=1 --verify=0 --verify_fatal=0 --numjobs=4 --rw=read --blocksize=4k --group_reporting
benchmark: (g=0): rw=read, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=sync, iodepth=32
...
fio-3.19
Starting 4 processes
Jobs: 4 (f=4): [R(4)][100.0%][r=140MiB/s][r=35.9k IOPS][eta 00m:00s]
benchmark: (groupid=0, jobs=4): err= 0: pid=1289094: Fri Jul 22 18:24:47 2022
  read: IOPS=50.5k, BW=197MiB/s (207MB/s)(5914MiB/30001msec)
    clat (nsec): min=1772, max=4948.3k, avg=78834.03, stdev=112800.83
     lat (nsec): min=1810, max=4948.4k, avg=78884.69, stdev=112806.53
    clat percentiles (usec):
     |  1.00th=[    3],  5.00th=[    3], 10.00th=[    3], 20.00th=[    3],
     | 30.00th=[    4], 40.00th=[   76], 50.00th=[   86], 60.00th=[   92],
     | 70.00th=[  100], 80.00th=[  111], 90.00th=[  130], 95.00th=[  223],
     | 99.00th=[  359], 99.50th=[  424], 99.90th=[ 1037], 99.95th=[ 2835],
     | 99.99th=[ 3425]
   bw (  KiB/s): min=107584, max=666672, per=100.00%, avg=202896.98, stdev=23198.14, samples=236
   iops        : min=26896, max=166668, avg=50724.19, stdev=5799.55, samples=236
  lat (usec)   : 2=0.40%, 4=33.20%, 10=3.36%, 20=0.11%, 50=0.01%
  lat (usec)   : 100=32.59%, 250=25.74%, 500=4.22%, 750=0.16%, 1000=0.10%
  lat (msec)   : 2=0.06%, 4=0.06%, 10=0.01%
  cpu          : usr=0.98%, sys=13.18%, ctx=952786, majf=0, minf=78
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=1513906,0,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=32

Run status group 0 (all jobs):
   READ: bw=197MiB/s (207MB/s), 197MiB/s-197MiB/s (207MB/s-207MB/s), io=5914MiB (6201MB), run=30001-30001msec

Disabled:

[root@ac-1f-6b-a5-ab-ea bar]# zfs set direct=disabled test/bar

[root@ac-1f-6b-a5-ab-ea bar]# for i in {1..3}; do sync; echo $i > /proc/sys/vm/drop_caches; done

[root@ac-1f-6b-a5-ab-ea bar]# rm -rf test

[root@ac-1f-6b-a5-ab-ea bar]# fio --time_based --name=benchmark --size=2G --runtime=30 --filename=./test --ioengine=sync --randrepeat=0 --iodepth=32 --direct=1 --invalidate=1 --verify=0 --verify_fatal=0 --numjobs=4 --rw=read --blocksize=4k --group_reporting
benchmark: (g=0): rw=read, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=sync, iodepth=32
...
fio-3.19
Starting 4 processes
benchmark: Laying out IO file (1 file / 2048MiB)
Jobs: 4 (f=4): [R(4)][100.0%][r=2095MiB/s][r=536k IOPS][eta 00m:00s]
benchmark: (groupid=0, jobs=4): err= 0: pid=1399856: Fri Jul 22 18:25:43 2022
  read: IOPS=584k, BW=2283MiB/s (2394MB/s)(66.9GiB/30001msec)
    clat (nsec): min=1195, max=145299, avg=6568.14, stdev=5384.80
     lat (nsec): min=1231, max=145336, avg=6603.92, stdev=5385.07
    clat percentiles (nsec):
     |  1.00th=[ 1512],  5.00th=[ 2576], 10.00th=[ 3152], 20.00th=[ 3920],
     | 30.00th=[ 4704], 40.00th=[ 5408], 50.00th=[ 5984], 60.00th=[ 6496],
     | 70.00th=[ 7008], 80.00th=[ 7584], 90.00th=[ 8384], 95.00th=[ 9280],
     | 99.00th=[37120], 99.50th=[42240], 99.90th=[52480], 99.95th=[55552],
     | 99.99th=[60160]
   bw (  MiB/s): min= 2084, max= 3905, per=100.00%, avg=2289.42, stdev=122.06, samples=236
   iops        : min=533706, max=999856, avg=586091.93, stdev=31247.29, samples=236
  lat (usec)   : 2=2.16%, 4=18.91%, 10=75.22%, 20=0.93%, 50=2.61%
  lat (usec)   : 100=0.16%, 250=0.01%
  cpu          : usr=5.64%, sys=86.69%, ctx=348979, majf=0, minf=51
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=17534987,0,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=32

Run status group 0 (all jobs):
   READ: bw=2283MiB/s (2394MB/s), 2283MiB/s-2283MiB/s (2394MB/s-2394MB/s), io=66.9GiB (71.8GB), run=30001-30001msec

Another test: Enabed:

[root@ac-1f-6b-a5-ab-ea foo]# zfs get recordsize test/foo
NAME      PROPERTY    VALUE    SOURCE
test/foo  recordsize  128K     default
[root@ac-1f-6b-a5-ab-ea foo]# pwd
/test/foo
[root@ac-1f-6b-a5-ab-ea foo]# fio --time_based --name=benchmark --size=2G --runtime=300 --filename=./boot --ioengine=libaio --randrepeat=0 --iodepth=128 --direct=1 --invalidate=1 --verify=0 --verify_fatal=0 --numjobs=24 --rw=randwrite --blocksize=128k --group_reporting
benchmark: (g=0): rw=randwrite, bs=(R) 128KiB-128KiB, (W) 128KiB-128KiB, (T) 128KiB-128KiB, ioengine=libaio, iodepth=128
...
fio-3.19
Starting 24 processes
benchmark: Laying out IO file (1 file / 2048MiB)
^Cbs: 24 (f=24): [w(24)][4.7%][w=5150MiB/s][w=41.2k IOPS][eta 04m:47s]
fio: terminating on signal 2

benchmark: (groupid=0, jobs=24): err= 0: pid=1409094: Tue Jul 26 11:41:53 2022
  write: IOPS=42.8k, BW=5345MiB/s (5605MB/s)(68.5GiB/13118msec); 0 zone resets
    slat (usec): min=245, max=3462, avg=558.80, stdev=130.00
    clat (usec): min=2, max=101349, avg=71107.74, stdev=11837.83
     lat (usec): min=416, max=102170, avg=71666.77, stdev=11911.16
    clat percentiles (msec):
     |  1.00th=[   56],  5.00th=[   61], 10.00th=[   62], 20.00th=[   63],
     | 30.00th=[   64], 40.00th=[   64], 50.00th=[   65], 60.00th=[   69],
     | 70.00th=[   75], 80.00th=[   83], 90.00th=[   92], 95.00th=[   95],
     | 99.00th=[   97], 99.50th=[   99], 99.90th=[  100], 99.95th=[  101],
     | 99.99th=[  101]
   bw (  MiB/s): min= 4247, max= 6095, per=99.60%, avg=5323.87, stdev=23.61, samples=624
   iops        : min=33980, max=48760, avg=42590.50, stdev=188.86, samples=624
  lat (usec)   : 4=0.01%, 10=0.01%, 500=0.01%, 750=0.01%, 1000=0.01%
  lat (msec)   : 2=0.01%, 4=0.02%, 10=0.05%, 20=0.08%, 50=0.24%
  lat (msec)   : 100=99.58%, 250=0.02%
  cpu          : usr=1.83%, sys=73.94%, ctx=566417, majf=0, minf=16943
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=99.7%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.1%
     issued rwts: total=0,560978,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=128

Run status group 0 (all jobs):
  WRITE: bw=5345MiB/s (5605MB/s), 5345MiB/s-5345MiB/s (5605MB/s-5605MB/s), io=68.5GiB (73.5GB), run=13118-13118msec

Disabed:

[root@ac-1f-6b-a5-ab-ea foo]# zfs set direct=disabled test/foo
[root@ac-1f-6b-a5-ab-ea foo]# for i in {1..3}; do sync; echo $i > /proc/sys/vm/drop_caches; done
[root@ac-1f-6b-a5-ab-ea foo]# rm -rf boot
[root@ac-1f-6b-a5-ab-ea foo]# fio --time_based --name=benchmark --size=2G --runtime=300 --filename=./boot --ioengine=libaio --randrepeat=0 --iodepth=128 --direct=1 --invalidate=1 --verify=0 --verify_fatal=0 --numjobs=24 --rw=randwrite --blocksize=128k --group_reporting
benchmark: (g=0): rw=randwrite, bs=(R) 128KiB-128KiB, (W) 128KiB-128KiB, (T) 128KiB-128KiB, ioengine=libaio, iodepth=128
...
fio-3.19
Starting 24 processes
benchmark: Laying out IO file (1 file / 2048MiB)
^Cbs: 24 (f=24): [w(24)][3.3%][w=17.7GiB/s][w=145k IOPS][eta 04m:51s]
fio: terminating on signal 2

benchmark: (groupid=0, jobs=24): err= 0: pid=1479104: Tue Jul 26 11:42:16 2022
write: IOPS=141k, BW=17.2GiB/s (18.5GB/s)(161GiB/9361msec); 0 zone resets
  slat (usec): min=36, max=67814, avg=167.46, stdev=682.26
  clat (usec): min=2, max=124187, avg=21584.85, stdev=9289.76
   lat (usec): min=58, max=124312, avg=21752.62, stdev=9330.98
  clat percentiles (usec):
   |  1.00th=[10028],  5.00th=[11863], 10.00th=[13042], 20.00th=[14877],
   | 30.00th=[16450], 40.00th=[17957], 50.00th=[19268], 60.00th=[20841],
   | 70.00th=[22938], 80.00th=[26346], 90.00th=[32900], 95.00th=[39584],
   | 99.00th=[56361], 99.50th=[64750], 99.90th=[81265], 99.95th=[90702],
   | 99.99th=[98042]
 bw (  MiB/s): min=14225, max=20963, per=99.79%, avg=17598.44, stdev=74.08, samples=432
 iops        : min=113793, max=167698, avg=140778.78, stdev=592.60, samples=432
lat (usec)   : 4=0.01%, 10=0.01%, 100=0.01%, 250=0.01%, 500=0.01%
lat (usec)   : 750=0.01%, 1000=0.01%
lat (msec)   : 2=0.01%, 4=0.03%, 10=0.87%, 20=53.54%, 50=43.80%
lat (msec)   : 100=1.74%, 250=0.01%
cpu          : usr=9.64%, sys=65.26%, ctx=68796, majf=0, minf=25694
IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=99.9%
   submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
   complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.1%
   issued rwts: total=0,1320646,0,0 short=0,0,0,0 dropped=0,0,0,0
   latency   : target=0, window=0, percentile=100.00%, depth=128

Run status group 0 (all jobs):
WRITE: bw=17.2GiB/s (18.5GB/s), 17.2GiB/s-17.2GiB/s (18.5GB/s-18.5GB/s), io=161GiB (173GB), run=9361-9361msec

@Smithx10 so I noticed a couple of things with your results. So you are doing libaio for the ioengine with fio for the 128k tests; however, the asynchronous engines do not behave the way you think they would with ZFS. There is no asynchronous hooks in ZFS currently and all those IO calls get sent down the synchronous paths. The numjobs parameter will create the identical IO workloads, but the iodepth will not behave as it might normally would with other FS's. In fact there is a PR to make the asynchronous IO API's (this is what the async ioengine's use) work with ZFS #12166. Also, I am not so certain that iodepth does anything when ioengine=sync with the 4k tests. I believe the fio man page might mention this. When I have looked at the FIO code, in the past, it seems to have no effect in that case. Could be wrong though. Feel free to double check me or tell me I am wrong on that one. Also, I noticed your size was 2g correct? There can be no expectation that O_DIRECT can surpass ARC speed if the entire working set is living the ARC. You are just getting all the memory bandwidth the ARC can give at that point. Also if you are only writing 2GB of data, I would have to imagine you are not reading from all of those NVMe's in the Zpool. You can use zpool iostat -vq 1 to confirm or deny this though. It maybe the case you are not reading from all devices. This was just some quick observations. I think if you look at the slides from the OpenZFS Developer Summit: https://docs.google.com/presentation/d/1f9bE1S6KqwHWVJtsOOfCu_cVKAFQO94h/edit#slide=id.p1 I was sequentially writing and reading 2TB of data to/from the Zpools. The argument has never been to not use the ARC with NVMe devices. In order to get good throughput performance with O_DIRECT you really need to step outside of the bounds of the ARC or be reading data sets that are not cached in the ARC that are spread across devices (larger than 2GB). There are certain situations where direct IO is a valid solution, and others where it isn't. It is not as simple as asking for it and everything improves. I also was curious what your recordsize was set to for the 4k tests? I imagine you were trying to measure IOPS there? If it was set to the default 128k the IOPS results are not too surprising to me. Just like the normal ARC reads, a 4k request will not just fetch that amount if it has to go down to disk and the recordsize > 4k. The reason for this has to do with data validation with checksums. There is no way to validate the data unless you fetch the whole block. So, every time a read IO with O_DIRECT is issued, it will read back the entire block (AKA recordsize) to validate the data before returning it to the user. My response is also based on the idea your ARC size was >= 2GB. If that assumption is wrong, then there is egg on my face. Please let me know if this is not the case.

Thank You So much for this explanation. I learned a bunch about what is happening here. I really appreciate you taking the time to help educate me. I will go experiment with this new knowledge. Thanks again!

No problem, and thank you for taking this PR for a test drive. Let me know if you run into any other performance concerns. I would be happy to help resolve any issues you uncover with this work.

Also, there is one caveat to the data that I shared at the OpenZFS Developer Summit for the O_DIRECT results. Previously compression defaulted to off; however, by default compression is set to on now. I believe that FIO. by default, uses random data for the buffer it writes. So I don't think this contributed to your read results. If decompression is happening though when reading with O_DIRECT with a buffer that was compressed, the buffer will be decompressed before returning to user space. ZFS will only compress a buffer that is at a minimum 1/8 compressible, so more than likely your reads were not decompressing any data buffers. You can always double check this though with zfs get compressratio on the dataset. The results in my slides were using random data, so the compression ratio would have still been more than likely 1.0x; however, it also was not running through the compression code for writes. Just something to keep in mind when measuring O_DIRECT performance.

bwatkinson avatar Aug 11 '22 03:08 bwatkinson

Away from my workstation with the big screen, excuse the e-mail reply; assume all notes apply to all instances in the diff.

+In the event the checksum is not valid then the I/O operation will return +EINVAL and the write will not be committed.

... will return
.Er EINVAL
and the write ...

+Controls the behavior of direct requests (e.g. +.Sx O_DIRECT Ns +). The Wrong ‒ e.g. ends the sentence, Ns-at-EOL is awful (and almost always an error), O_DIRECT is very much not Sx (well, unless there's an O_DIRECT section but I don't think there is?), and you called it "Direct I/O" above:

... of Direct I/O requests
.Pq e.g. Dv O_DIRECT .
The

Best,

nabijaczleweli avatar Aug 11 '22 21:08 nabijaczleweli

Away from my workstation with the big screen, excuse the e-mail reply; assume all notes apply to all instances in the diff. +In the event the checksum is not valid then the I/O operation will return +EINVAL and the write will not be committed. ... will return .Er EINVAL and the write ... +Controls the behavior of direct requests (e.g. +.Sx O_DIRECT Ns +). The Wrong ‒ e.g. ends the sentence, Ns-at-EOL is awful (and almost always an error), O_DIRECT is very much not Sx (well, unless there's an O_DIRECT section but I don't think there is?), and you called it "Direct I/O" above: ... of Direct I/O requests .Pq e.g. Dv O_DIRECT . The Best,

I went ahead and update the man pages.

bwatkinson avatar Aug 12 '22 20:08 bwatkinson

hi All,

Is there any update on this PR? Is there any estimation on its merge date?

Sorry for such a late reply on this. My hope is that is will be merged soon. Reviews of PR are under way. When you oringally asked about this, I was still working on figuring out if we could do write protected user pages in Linux, but that does not seem possible.

bwatkinson avatar Aug 12 '22 20:08 bwatkinson

I do observe the failure from vm_fault_quick_hold_pages() on FreeBSD 13 but not 14. pmap_extract_and_hold() is failing and then vm_fault() fails with ENOENT. @markjdb can you think of what would be causing different behavior between FreeBSD 13 and 14 there?

ghost avatar Aug 15 '22 14:08 ghost

Good news: This PR improved write performance by ~60% on one of our test systems :+1: :smile:

Bad news: When I tried to do direct IO reads with a block size greater than 16MB, I got kernel assertions:

$ for i in {1..64} ; do dd if=/tank1/test$i of=/dev/null bs=32M iflag=direct & true ; done
...
Message from syslogd@localhost at Oct 12 17:27:48 ...
 kernel: PANIC at abd_os.c:858:abd_alloc_from_pages()

Message from syslogd@localhost at Oct 12 17:27:48 ...
 kernel: VERIFY3(size <= SPA_MAXBLOCKSIZE) failed (33554432 <= 16777216)

Message from syslogd@localhost at Oct 12 17:27:48 ...
 kernel: PANIC at abd_os.c:858:abd_alloc_from_pages()
Oct 12 17:27:48 localhost kernel: Call Trace:
Oct 12 17:27:48 localhost kernel:  dump_stack+0x41/0x60
Oct 12 17:27:48 localhost kernel:  spl_panic+0xd0/0xf3 [spl]
Oct 12 17:27:48 localhost kernel:  ? __get_user_pages+0x1fb/0x800
Oct 12 17:27:48 localhost kernel:  ? gup_pgd_range+0x2fd/0xc60
Oct 12 17:27:48 localhost kernel:  ? get_user_pages_unlocked+0xd5/0x2a0
Oct 12 17:27:48 localhost kernel:  ? get_user_pages_unlocked+0x1f5/0x2a0
Oct 12 17:27:48 localhost kernel:  abd_alloc_from_pages+0x196/0x1a0 [zfs]
Oct 12 17:27:48 localhost kernel:  ? spl_kmem_alloc+0x11e/0x140 [spl]
Oct 12 17:27:48 localhost kernel:  dmu_read_uio_direct+0x3e/0x90 [zfs]
Oct 12 17:27:48 localhost kernel:  dmu_read_uio_dnode+0xfa/0x110 [zfs]
Oct 12 17:27:48 localhost kernel:  ? zfs_rangelock_enter_impl+0x25b/0x560 [zfs]
Oct 12 17:27:48 localhost kernel:  ? xas_load+0x8/0x80
Oct 12 17:27:48 localhost kernel:  ? xas_find+0x173/0x1b0
Oct 12 17:27:48 localhost kernel:  dmu_read_uio_dbuf+0x3f/0x60 [zfs]
Oct 12 17:27:48 localhost kernel:  zfs_read+0x143/0x3d0 [zfs]
Oct 12 17:27:48 localhost kernel:  zpl_iter_read_direct+0x182/0x220 [zfs]
Oct 12 17:27:48 localhost kernel:  ? _cond_resched+0x15/0x30
Oct 12 17:27:48 localhost kernel:  ? mutex_lock+0x21/0x40
Oct 12 17:27:48 localhost kernel:  ? rrw_exit+0x65/0x150 [zfs]
Oct 12 17:27:48 localhost kernel:  zpl_iter_read+0xae/0xe0 [zfs]
Oct 12 17:27:48 localhost kernel:  new_sync_read+0x10f/0x150
Oct 12 17:27:48 localhost kernel:  vfs_read+0xa3/0x160
Oct 12 17:27:48 localhost kernel:  ksys_read+0x4f/0xb0
Oct 12 17:27:48 localhost kernel:  do_syscall_64+0x5b/0x1a0
Oct 12 17:27:48 localhost kernel:  entry_SYSCALL_64_after_hwframe+0x65/0xca
Oct 12 17:27:48 localhost kernel: RIP: 0033:0x155555081505

tonyhutter avatar Oct 13 '22 00:10 tonyhutter

Good news: This PR improved write performance by ~60% on one of our test systems 👍 😄

Bad news: When I tried to do direct IO reads with a block size greater than 16MB, I got kernel assertions:

$ for i in {1..64} ; do dd if=/tank1/test$i of=/dev/null bs=32M iflag=direct & true ; done
...
Message from syslogd@localhost at Oct 12 17:27:48 ...
 kernel: PANIC at abd_os.c:858:abd_alloc_from_pages()

Message from syslogd@localhost at Oct 12 17:27:48 ...
 kernel: VERIFY3(size <= SPA_MAXBLOCKSIZE) failed (33554432 <= 16777216)

Message from syslogd@localhost at Oct 12 17:27:48 ...
 kernel: PANIC at abd_os.c:858:abd_alloc_from_pages()
Oct 12 17:27:48 localhost kernel: Call Trace:
Oct 12 17:27:48 localhost kernel:  dump_stack+0x41/0x60
Oct 12 17:27:48 localhost kernel:  spl_panic+0xd0/0xf3 [spl]
Oct 12 17:27:48 localhost kernel:  ? __get_user_pages+0x1fb/0x800
Oct 12 17:27:48 localhost kernel:  ? gup_pgd_range+0x2fd/0xc60
Oct 12 17:27:48 localhost kernel:  ? get_user_pages_unlocked+0xd5/0x2a0
Oct 12 17:27:48 localhost kernel:  ? get_user_pages_unlocked+0x1f5/0x2a0
Oct 12 17:27:48 localhost kernel:  abd_alloc_from_pages+0x196/0x1a0 [zfs]
Oct 12 17:27:48 localhost kernel:  ? spl_kmem_alloc+0x11e/0x140 [spl]
Oct 12 17:27:48 localhost kernel:  dmu_read_uio_direct+0x3e/0x90 [zfs]
Oct 12 17:27:48 localhost kernel:  dmu_read_uio_dnode+0xfa/0x110 [zfs]
Oct 12 17:27:48 localhost kernel:  ? zfs_rangelock_enter_impl+0x25b/0x560 [zfs]
Oct 12 17:27:48 localhost kernel:  ? xas_load+0x8/0x80
Oct 12 17:27:48 localhost kernel:  ? xas_find+0x173/0x1b0
Oct 12 17:27:48 localhost kernel:  dmu_read_uio_dbuf+0x3f/0x60 [zfs]
Oct 12 17:27:48 localhost kernel:  zfs_read+0x143/0x3d0 [zfs]
Oct 12 17:27:48 localhost kernel:  zpl_iter_read_direct+0x182/0x220 [zfs]
Oct 12 17:27:48 localhost kernel:  ? _cond_resched+0x15/0x30
Oct 12 17:27:48 localhost kernel:  ? mutex_lock+0x21/0x40
Oct 12 17:27:48 localhost kernel:  ? rrw_exit+0x65/0x150 [zfs]
Oct 12 17:27:48 localhost kernel:  zpl_iter_read+0xae/0xe0 [zfs]
Oct 12 17:27:48 localhost kernel:  new_sync_read+0x10f/0x150
Oct 12 17:27:48 localhost kernel:  vfs_read+0xa3/0x160
Oct 12 17:27:48 localhost kernel:  ksys_read+0x4f/0xb0
Oct 12 17:27:48 localhost kernel:  do_syscall_64+0x5b/0x1a0
Oct 12 17:27:48 localhost kernel:  entry_SYSCALL_64_after_hwframe+0x65/0xca
Oct 12 17:27:48 localhost kernel: RIP: 0033:0x155555081505

@tonyhutter good catch! I went ahead and updated the ABD size ASSERT checks to account for DMU_MAX_ACCESS in the event the ABD flag ABD_FLAG_FROM_PAGES is set. This should resolve the issue you observed (at least it did when I tested it out).

bwatkinson avatar Oct 13 '22 19:10 bwatkinson

Hello,today I tried to test this PR under the following system: 2x E5-2670 v2 (2.5GHz, 10C) 226GB DDR3

root@iser-nvme:/home/vlosev# nvme list
Node             SN                   Model                                    Namespace Usage                      Format           FW Rev
---------------- -------------------- ---------------------------------------- --------- -------------------------- ---------------- --------
/dev/nvme0n1     50026B7684C433E7     KINGSTON SKC2500M81000G                  1         735.87  GB /   1.00  TB      4 KiB +  0 B   S7780101
/dev/nvme1n1     50026B7684C43738     KINGSTON SKC2500M81000G                  1         679.65  GB /   1.00  TB      4 KiB +  0 B   S7780101
/dev/nvme2n1     S48ENC0N701594L      Samsung SSD 983 DCT M.2 960GB            1         573.66  GB / 960.20  GB      4 KiB +  0 B   EDA7602Q
/dev/nvme3n1     S48ENC0N700767V      Samsung SSD 983 DCT M.2 960GB            1         625.36  GB / 960.20  GB      4 KiB +  0 B   EDA7602Q
/dev/nvme4n1     S4EMNX0R627163       SAMSUNG MZVLB1T0HBLR-000L7               1         702.80  GB /   1.02  TB    512   B +  0 B   5M2QEXF7
/dev/nvme5n1     S64FNE0RB06622       SAMSUNG MZQL2960HCJR-00A07               1         639.14  GB / 960.20  GB      4 KiB +  0 B   GDC5502Q
root@iser-nvme:/home/vlosev# zfs version
zfs-2.1.99-1446_gf9bb9f26c
zfs-kmod-2.1.99-1446_gf9bb9f26c

I have created the zpool with following settings root@iser-nvme:/home/vlosev# zpool create -O direct=always -o ashift=12 -O atime=off -O recordsize=8k nvme /dev/nvme0n1 /dev/nvme1n1 /dev/nvme2n1 /dev/nvme3n1 /dev/nvme4n1 /dev/nvme5n1 -f But when I try to benchmark the read/write performance with following fio options: fio --name=WithDirect --size=300G --filename=/nvme/t.1 --ioengine=libaio --randrepeat=0 --iodepth=32 --direct=1 --invalidate=1 --verify=0 --verify_fatal=0 --numjobs=4 --rw=write --blocksize=8k --group_reporting fio --name=WithDirect --size=300G --filename=/nvme/t.1 --ioengine=libaio --randrepeat=0 --iodepth=32 --direct=1 --invalidate=1 --verify=0 --verify_fatal=0 --numjobs=4 --rw=read --blocksize=8k --group_reporting I got a very strange result: root@iser-nvme:/home/vlosev/zfs# fio --name=WithDirect --size=400G --filename=/nvme/t.1 --ioengine=libaio --randrepeat=0 --iodepth=32 --direct=1 --invalidate=1 --verify=0 --verify_fatal=0 --numjobs=4 --rw=write --blocksize=8k --group_reporting Write:

WithDirect: (g=0): rw=write, bs=(R) 8192B-8192B, (W) 8192B-8192B, (T) 8192B-8192B, ioengine=libaio, iodepth=32
...
fio-3.16
Starting 4 processes
WithDirect: Laying out IO file (1 file / 409600MiB)
Jobs: 4 (f=4): [W(4)][100.0%][w=278MiB/s][w=35.6k IOPS][eta 00m:00s]
WithDirect: (groupid=0, jobs=4): err= 0: pid=1087394: Thu Oct 13 19:51:45 2022
  write: IOPS=34.2k, BW=267MiB/s (280MB/s)(1600GiB/6134181msec); 0 zone resets
    slat (usec): min=77, max=120593, avg=112.50, stdev=424.76
    clat (usec): min=4, max=240172, avg=3628.84, stdev=2601.48
     lat (usec): min=92, max=249482, avg=3741.99, stdev=2651.57
    clat percentiles (usec):
     |  1.00th=[ 2966],  5.00th=[ 3032], 10.00th=[ 3097], 20.00th=[ 3163],
     | 30.00th=[ 3195], 40.00th=[ 3228], 50.00th=[ 3261], 60.00th=[ 3326],
     | 70.00th=[ 3392], 80.00th=[ 3490], 90.00th=[ 3687], 95.00th=[ 4293],
     | 99.00th=[11994], 99.50th=[16909], 99.90th=[31851], 99.95th=[57410],
     | 99.99th=[98042]
   bw (  KiB/s): min=48967, max=334032, per=99.99%, avg=273463.15, stdev=12311.44, samples=49072
   iops        : min= 6120, max=41754, avg=34182.76, stdev=1538.93, samples=49072
  lat (usec)   : 10=0.01%, 100=0.01%, 250=0.01%, 500=0.01%, 750=0.01%
  lat (usec)   : 1000=0.01%
  lat (msec)   : 2=0.01%, 4=92.79%, 10=5.72%, 20=1.13%, 50=0.30%
  lat (msec)   : 100=0.05%, 250=0.01%
  cpu          : usr=4.80%, sys=65.10%, ctx=256489223, majf=0, minf=63432
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=100.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.1%, 64=0.0%, >=64=0.0%
     issued rwts: total=0,209715200,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=32

Run status group 0 (all jobs):
  WRITE: bw=267MiB/s (280MB/s), 267MiB/s-267MiB/s (280MB/s-280MB/s), io=1600GiB (1718GB), run=6134181-6134181msec

Read:

root@iser-nvme:/home/vlosev/zfs# fio --name=WithDirect --size=400G --filename=/nvme/t.1 --ioengine=libaio --randrepeat=0 --iodepth=32 --direct=1 --invalidate=1 --verify=0 --verify_fatal=0 --numjobs=4 --rw=read --blocksize=8k --group_reporting
WithDirect: (g=0): rw=read, bs=(R) 8192B-8192B, (W) 8192B-8192B, (T) 8192B-8192B, ioengine=libaio, iodepth=32
...
fio-3.16
Starting 4 processes
Jobs: 4 (f=4): [R(4)][0.3%][r=251MiB/s][r=32.1k IOPS][eta 01h:54m:26s]

and if I disable direct by:

root@iser-nvme:/home/vlosev/zfs# zfs set direct=disabled nvme
root@iser-nvme:/home/vlosev/zfs# zfs get direct nvme
NAME  PROPERTY  VALUE     SOURCE
nvme  direct    disabled  local

Write:

root@iser-nvme:/home/vlosev/zfs# fio --name=WithDirect --size=400G --filename=/nvme/t.1 --ioengine=libaio --randrepeat=0 --iodepth=32 --direct=1 --invalidate=1 --verify=0 --verify_fatal=0 --numjobs=4 --rw=write --blocksize=8k --group_reporting
WithDirect: (g=0): rw=write, bs=(R) 8192B-8192B, (W) 8192B-8192B, (T) 8192B-8192B, ioengine=libaio, iodepth=32
...
fio-3.16
Starting 4 processes
Jobs: 4 (f=4): [W(4)][100.0%][w=944MiB/s][w=121k IOPS][eta 00m:00s]
WithDirect: (groupid=0, jobs=4): err= 0: pid=4050640: Thu Oct 13 21:20:13 2022
  write: IOPS=123k, BW=963MiB/s (1009MB/s)(1600GiB/1701924msec); 0 zone resets
    slat (usec): min=7, max=26674, avg=29.66, stdev=76.38
    clat (usec): min=2, max=27658, avg=1007.56, stdev=443.89
     lat (usec): min=17, max=27690, avg=1037.56, stdev=451.44
    clat percentiles (usec):
     |  1.00th=[  469],  5.00th=[  676], 10.00th=[  824], 20.00th=[  922],
     | 30.00th=[  963], 40.00th=[  979], 50.00th=[  996], 60.00th=[ 1012],
     | 70.00th=[ 1037], 80.00th=[ 1074], 90.00th=[ 1139], 95.00th=[ 1205],
     | 99.00th=[ 1434], 99.50th=[ 1532], 99.90th=[ 8979], 99.95th=[11600],
     | 99.99th=[12911]
   bw (  KiB/s): min=816048, max=1426608, per=99.98%, avg=985625.01, stdev=13358.13, samples=13612
   iops        : min=102006, max=178326, avg=123202.98, stdev=1669.77, samples=13612
  lat (usec)   : 4=0.01%, 20=0.01%, 50=0.01%, 100=0.01%, 250=0.01%
  lat (usec)   : 500=1.56%, 750=5.38%, 1000=46.13%
  lat (msec)   : 2=46.66%, 4=0.02%, 10=0.18%, 20=0.07%, 50=0.01%
  cpu          : usr=10.83%, sys=71.37%, ctx=205048515, majf=0, minf=93114
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=100.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.1%, 64=0.0%, >=64=0.0%
     issued rwts: total=0,209715200,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=32

Run status group 0 (all jobs):
  WRITE: bw=963MiB/s (1009MB/s), 963MiB/s-963MiB/s (1009MB/s-1009MB/s), io=1600GiB (1718GB), run=1701924-1701924msec

Read:

Dante4 avatar Oct 13 '22 21:10 Dante4

@Dante4 try cranking up the number of reader / writer threads.

During my initial testing I was getting better write performance without Direct IO. That is because non-Direct IO writes are async writes, which work well when there are a low number of writer threads (but at the cost of two memory copies). Direct IO writes are handled synchronously, with fewer to no memory copies. Once I cranked up the number or writers from 64 dd parallel writes to 512 parallel dd writes, I got much better write performance with Direct IO.

tonyhutter avatar Oct 13 '22 21:10 tonyhutter

Hello,today I tried to test this PR under the following system: 2x E5-2670 v2 @ 2.50GHz 226GB DDR3

root@iser-nvme:/home/vlosev/zfs# nvme list
Node             SN                   Model                                    Namespace Usage                      Format           FW Rev
---------------- -------------------- ---------------------------------------- --------- -------------------------- ---------------- --------
/dev/nvme0n1     50026B7684C433E7     KINGSTON SKC2500M81000G                  1         833.65  GB /   1.00  TB      4 KiB +  0 B   S7780101
/dev/nvme1n1     50026B7684C43738     KINGSTON SKC2500M81000G                  1         771.43  GB /   1.00  TB      4 KiB +  0 B   S7780101
/dev/nvme2n1     S48ENC0N701594L      Samsung SSD 983 DCT M.2 960GB            1         714.24  GB / 960.20  GB      4 KiB +  0 B   EDA7602Q
/dev/nvme3n1     S48ENC0N700767V      Samsung SSD 983 DCT M.2 960GB            1         752.24  GB / 960.20  GB      4 KiB +  0 B   EDA7602Q
/dev/nvme4n1     S4EMNX0R627163       SAMSUNG MZVLB1T0HBLR-000L7               1         807.06  GB /   1.02  TB    512   B +  0 B   5M2QEXF7
/dev/nvme5n1     S64FNE0RB06622       SAMSUNG MZQL2960HCJR-00A07               1         733.04  GB / 960.20  GB      4 KiB +  0 B   GDC5502Q
root@iser-nvme:/home/vlosev/zfs# zfs version
zfs-2.1.99-1446_gf9bb9f26c
zfs-kmod-2.1.99-1446_gf9bb9f26c

I have created the zpool with following settings root@iser-nvme:/home/vlosev# zpool create -O direct=always -o ashift=12 -O atime=off -O recordsize=8k nvme /dev/nvme0n1 /dev/nvme1n1 /dev/nvme2n1 /dev/nvme3n1 /dev/nvme4n1 /dev/nvme5n1 -f But when I try to benchmark the read/write performance with following fio options: fio --name=WithDirect --size=300G --filename=/nvme/t.1 --ioengine=libaio --randrepeat=0 --iodepth=32 --direct=1 --invalidate=1 --verify=0 --verify_fatal=0 --numjobs=4 --rw=write --blocksize=8k --group_reporting fio --name=WithDirect --size=300G --filename=/nvme/t.1 --ioengine=libaio --randrepeat=0 --iodepth=32 --direct=1 --invalidate=1 --verify=0 --verify_fatal=0 --numjobs=4 --rw=read --blocksize=8k --group_reporting I got a very strange result: root@iser-nvme:/home/vlosev/zfs# fio --name=WithDirect --size=400G --filename=/nvme/t.1 --ioengine=libaio --randrepeat=0 --iodepth=32 --direct=1 --invalidate=1 --verify=0 --verify_fatal=0 --numjobs=4 --rw=write --blocksize=8k --group_reporting Write:

WithDirect: (g=0): rw=write, bs=(R) 8192B-8192B, (W) 8192B-8192B, (T) 8192B-8192B, ioengine=libaio, iodepth=32
...
fio-3.16
Starting 4 processes
WithDirect: Laying out IO file (1 file / 409600MiB)
Jobs: 4 (f=4): [W(4)][100.0%][w=278MiB/s][w=35.6k IOPS][eta 00m:00s]
WithDirect: (groupid=0, jobs=4): err= 0: pid=1087394: Thu Oct 13 19:51:45 2022
  write: IOPS=34.2k, BW=267MiB/s (280MB/s)(1600GiB/6134181msec); 0 zone resets
    slat (usec): min=77, max=120593, avg=112.50, stdev=424.76
    clat (usec): min=4, max=240172, avg=3628.84, stdev=2601.48
     lat (usec): min=92, max=249482, avg=3741.99, stdev=2651.57
    clat percentiles (usec):
     |  1.00th=[ 2966],  5.00th=[ 3032], 10.00th=[ 3097], 20.00th=[ 3163],
     | 30.00th=[ 3195], 40.00th=[ 3228], 50.00th=[ 3261], 60.00th=[ 3326],
     | 70.00th=[ 3392], 80.00th=[ 3490], 90.00th=[ 3687], 95.00th=[ 4293],
     | 99.00th=[11994], 99.50th=[16909], 99.90th=[31851], 99.95th=[57410],
     | 99.99th=[98042]
   bw (  KiB/s): min=48967, max=334032, per=99.99%, avg=273463.15, stdev=12311.44, samples=49072
   iops        : min= 6120, max=41754, avg=34182.76, stdev=1538.93, samples=49072
  lat (usec)   : 10=0.01%, 100=0.01%, 250=0.01%, 500=0.01%, 750=0.01%
  lat (usec)   : 1000=0.01%
  lat (msec)   : 2=0.01%, 4=92.79%, 10=5.72%, 20=1.13%, 50=0.30%
  lat (msec)   : 100=0.05%, 250=0.01%
  cpu          : usr=4.80%, sys=65.10%, ctx=256489223, majf=0, minf=63432
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=100.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.1%, 64=0.0%, >=64=0.0%
     issued rwts: total=0,209715200,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=32

Run status group 0 (all jobs):
  WRITE: bw=267MiB/s (280MB/s), 267MiB/s-267MiB/s (280MB/s-280MB/s), io=1600GiB (1718GB), run=6134181-6134181msec

Read:

root@iser-nvme:/home/vlosev/zfs# fio --name=WithDirect --size=400G --filename=/nvme/t.1 --ioengine=libaio --randrepeat=0 --iodepth=32 --direct=1 --invalidate=1 --verify=0 --verify_fatal=0 --numjobs=4 --rw=read --blocksize=8k --group_reporting
WithDirect: (g=0): rw=read, bs=(R) 8192B-8192B, (W) 8192B-8192B, (T) 8192B-8192B, ioengine=libaio, iodepth=32
...
fio-3.16
Starting 4 processes
Jobs: 4 (f=4): [R(4)][0.3%][r=251MiB/s][r=32.1k IOPS][eta 01h:54m:26s]

and if I disable direct by:

root@iser-nvme:/home/vlosev/zfs# zfs set direct=disabled nvme
root@iser-nvme:/home/vlosev/zfs# zfs get direct nvme
NAME  PROPERTY  VALUE     SOURCE
nvme  direct    disabled  local

Write:

root@iser-nvme:/home/vlosev/zfs# fio --name=WithDirect --size=400G --filename=/nvme/t.1 --ioengine=libaio --randrepeat=0 --iodepth=32 --direct=1 --invalidate=1 --verify=0 --verify_fatal=0 --numjobs=4 --rw=write --blocksize=8k --group_reporting
WithDirect: (g=0): rw=write, bs=(R) 8192B-8192B, (W) 8192B-8192B, (T) 8192B-8192B, ioengine=libaio, iodepth=32
...
fio-3.16
Starting 4 processes
Jobs: 4 (f=4): [W(4)][100.0%][w=944MiB/s][w=121k IOPS][eta 00m:00s]
WithDirect: (groupid=0, jobs=4): err= 0: pid=4050640: Thu Oct 13 21:20:13 2022
  write: IOPS=123k, BW=963MiB/s (1009MB/s)(1600GiB/1701924msec); 0 zone resets
    slat (usec): min=7, max=26674, avg=29.66, stdev=76.38
    clat (usec): min=2, max=27658, avg=1007.56, stdev=443.89
     lat (usec): min=17, max=27690, avg=1037.56, stdev=451.44
    clat percentiles (usec):
     |  1.00th=[  469],  5.00th=[  676], 10.00th=[  824], 20.00th=[  922],
     | 30.00th=[  963], 40.00th=[  979], 50.00th=[  996], 60.00th=[ 1012],
     | 70.00th=[ 1037], 80.00th=[ 1074], 90.00th=[ 1139], 95.00th=[ 1205],
     | 99.00th=[ 1434], 99.50th=[ 1532], 99.90th=[ 8979], 99.95th=[11600],
     | 99.99th=[12911]
   bw (  KiB/s): min=816048, max=1426608, per=99.98%, avg=985625.01, stdev=13358.13, samples=13612
   iops        : min=102006, max=178326, avg=123202.98, stdev=1669.77, samples=13612
  lat (usec)   : 4=0.01%, 20=0.01%, 50=0.01%, 100=0.01%, 250=0.01%
  lat (usec)   : 500=1.56%, 750=5.38%, 1000=46.13%
  lat (msec)   : 2=46.66%, 4=0.02%, 10=0.18%, 20=0.07%, 50=0.01%
  cpu          : usr=10.83%, sys=71.37%, ctx=205048515, majf=0, minf=93114
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=100.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.1%, 64=0.0%, >=64=0.0%
     issued rwts: total=0,209715200,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=32

Run status group 0 (all jobs):
  WRITE: bw=963MiB/s (1009MB/s), 963MiB/s-963MiB/s (1009MB/s-1009MB/s), io=1600GiB (1718GB), run=1701924-1701924msec

Read:

root@iser-nvme:/home/vlosev/zfs# fio --name=WithDirect --size=400G --filename=/nvme/t.1 --ioengine=libaio --randrepeat=0 --iodepth=32 --direct=1 --invalidate=1 --verify=0 --verify_fatal=0 --numjobs=4 --rw=read --blocksize=8k --group_reporting
WithDirect: (g=0): rw=read, bs=(R) 8192B-8192B, (W) 8192B-8192B, (T) 8192B-8192B, ioengine=libaio, iodepth=32
...
fio-3.16
Starting 4 processes
Jobs: 4 (f=4): [R(4)][100.0%][r=1254MiB/s][r=160k IOPS][eta 00m:00s]
WithDirect: (groupid=0, jobs=4): err= 0: pid=2106034: Thu Oct 13 21:44:08 2022
  read: IOPS=163k, BW=1273MiB/s (1335MB/s)(1600GiB/1286575msec)
    slat (usec): min=4, max=3737, avg=21.73, stdev=13.06
    clat (usec): min=2, max=4462, avg=761.81, stdev=47.08
     lat (usec): min=18, max=4489, avg=783.84, stdev=48.26
    clat percentiles (usec):
     |  1.00th=[  668],  5.00th=[  693], 10.00th=[  709], 20.00th=[  725],
     | 30.00th=[  742], 40.00th=[  750], 50.00th=[  758], 60.00th=[  766],
     | 70.00th=[  783], 80.00th=[  791], 90.00th=[  816], 95.00th=[  840],
     | 99.00th=[  930], 99.50th=[  955], 99.90th=[  996], 99.95th=[ 1020],
     | 99.99th=[ 1057]
   bw (  MiB/s): min=  472, max= 1408, per=99.98%, avg=1273.26, stdev=11.33, samples=10292
   iops        : min=60504, max=180282, avg=162977.27, stdev=1450.14, samples=10292
  lat (usec)   : 4=0.01%, 50=0.01%, 100=0.01%, 250=0.01%, 500=0.01%
  lat (usec)   : 750=42.04%, 1000=57.87%
  lat (msec)   : 2=0.10%, 4=0.01%, 10=0.01%
  cpu          : usr=13.76%, sys=77.17%, ctx=79434728, majf=0, minf=2622
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=100.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.1%, 64=0.0%, >=64=0.0%
     issued rwts: total=209715200,0,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=32

Run status group 0 (all jobs):
   READ: bw=1273MiB/s (1335MB/s), 1273MiB/s-1273MiB/s (1335MB/s-1335MB/s), io=1600GiB (1718GB), run=1286575-1286575msec

Dante4 avatar Oct 13 '22 22:10 Dante4

I'd also recommend leaving the property set to direct=standard when testing. You shouldn't need to set this and can request Direct I/O like normal with the --direct=1 fio flag. This will have the advantage that if fio isn't creating correctly aligned I/O an error will be reported. When --direct=always is set rather than fail a Direct I/O it'll take the buffered path.

behlendorf avatar Oct 13 '22 22:10 behlendorf

I'd also recommend leaving the property set to direct=standard when testing. You shouldn't need to set this and can request Direct I/O like normal with the --direct=1 fio flag. This will have the advantage that if fio isn't creating correctly aligned I/O an error will be reported. When --direct=always is set rather than fail a Direct I/O it'll take the buffered path.

Thank you for your answer. Sadly there were no changes when I used standard instead of always:

direct=standard ``` root@iser-nvme:/home/vlosev# zfs set direct=standard nvme root@iser-nvme:/home/vlosev# fio --name=WithDirect --size=400G --filename=/nvme/t.1 --ioengine=libaio --randrepeat=0 --iodepth=32 --direct=1 --invalidate=1 --verify=0 --verify_fatal=0 --numjobs=4 --rw=read --blocksize=8k --group_reporting WithDirect: (g=0): rw=read, bs=(R) 8192B-8192B, (W) 8192B-8192B, (T) 8192B-8192B, ioengine=libaio, iodepth=32 ... fio-3.16 Starting 4 processes ^Cbs: 4 (f=4): [R(4)][0.2%][r=290MiB/s][r=37.1k IOPS][eta 01h:40m:04s] fio: terminating on signal 2

WithDirect: (groupid=0, jobs=4): err= 0: pid=3312285: Fri Oct 14 08:53:21 2022 read: IOPS=35.8k, BW=280MiB/s (293MB/s)(3415MiB/12210msec) slat (usec): min=41, max=2797, avg=104.58, stdev=37.71 clat (usec): min=3, max=6690, avg=3357.27, stdev=839.75 lat (usec): min=114, max=6832, avg=3462.38, stdev=863.39 clat percentiles (usec): | 1.00th=[ 1860], 5.00th=[ 1975], 10.00th=[ 2073], 20.00th=[ 2409], | 30.00th=[ 2737], 40.00th=[ 3195], 50.00th=[ 3621], 60.00th=[ 3851], | 70.00th=[ 3949], 80.00th=[ 4113], 90.00th=[ 4359], 95.00th=[ 4490], | 99.00th=[ 4686], 99.50th=[ 4817], 99.90th=[ 5080], 99.95th=[ 5145], | 99.99th=[ 5997] bw ( KiB/s): min=62560, max=328448, per=99.98%, avg=286327.67, stdev=12347.72, samples=96 iops : min= 7820, max=41056, avg=35790.75, stdev=1543.47, samples=96 lat (usec) : 4=0.01%, 250=0.01%, 500=0.01%, 750=0.01%, 1000=0.01% lat (msec) : 2=6.54%, 4=66.10%, 10=27.35% cpu : usr=4.20%, sys=27.97%, ctx=437188, majf=0, minf=319 IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=100.0%, >=64=0.0% submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.1%, 64=0.0%, >=64=0.0% issued rwts: total=437080,0,0,0 short=0,0,0,0 dropped=0,0,0,0 latency : target=0, window=0, percentile=100.00%, depth=32

Run status group 0 (all jobs): READ: bw=280MiB/s (293MB/s), 280MiB/s-280MiB/s (293MB/s-293MB/s), io=3415MiB (3581MB), run=12210-12210msec root@iser-nvme:/home/vlosev# fio --name=WithDirect --size=400G --filename=/nvme/t.1 --ioengine=libaio --randrepeat=0 --iodepth=32 --direct=1 --invalidate=1 --verify=0 --verify_fatal=0 --numjobs=4 --rw=write --blocksize=8k --group_reporting WithDirect: (g=0): rw=write, bs=(R) 8192B-8192B, (W) 8192B-8192B, (T) 8192B-8192B, ioengine=libaio, iodepth=32 ... fio-3.16 Starting 4 processes ^Cbs: 4 (f=4): [W(4)][0.2%][w=296MiB/s][w=37.9k IOPS][eta 01h:35m:48s] fio: terminating on signal 2

WithDirect: (groupid=0, jobs=4): err= 0: pid=3344955: Fri Oct 14 08:53:41 2022 write: IOPS=37.3k, BW=291MiB/s (305MB/s)(2997MiB/10289msec); 0 zone resets slat (usec): min=79, max=13452, avg=101.92, stdev=93.20 clat (usec): min=4, max=17890, avg=3297.29, stdev=556.83 lat (usec): min=117, max=18015, avg=3399.85, stdev=567.83 clat percentiles (usec): | 1.00th=[ 2933], 5.00th=[ 2999], 10.00th=[ 3064], 20.00th=[ 3130], | 30.00th=[ 3163], 40.00th=[ 3195], 50.00th=[ 3228], 60.00th=[ 3261], | 70.00th=[ 3326], 80.00th=[ 3392], 90.00th=[ 3523], 95.00th=[ 3589], | 99.00th=[ 4228], 99.50th=[ 4359], 99.90th=[15270], 99.95th=[17171], | 99.99th=[17695] bw ( KiB/s): min=225824, max=315472, per=99.89%, avg=297972.85, stdev=4771.26, samples=80 iops : min=28228, max=39434, avg=37246.50, stdev=596.40, samples=80 lat (usec) : 10=0.01%, 250=0.01%, 500=0.01%, 750=0.01%, 1000=0.01% lat (msec) : 2=0.01%, 4=97.43%, 10=2.36%, 20=0.19% cpu : usr=4.72%, sys=71.28%, ctx=478440, majf=0, minf=308 IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=100.0%, >=64=0.0% submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.1%, 64=0.0%, >=64=0.0% issued rwts: total=0,383656,0,0 short=0,0,0,0 dropped=0,0,0,0 latency : target=0, window=0, percentile=100.00%, depth=32

Run status group 0 (all jobs): WRITE: bw=291MiB/s (305MB/s), 291MiB/s-291MiB/s (305MB/s-305MB/s), io=2997MiB (3143MB), run=10289-10289msec root@iser-nvme:/home/vlosev# zfs get direct nvme NAME PROPERTY VALUE SOURCE nvme direct standard local

</details>

Dante4 avatar Oct 14 '22 08:10 Dante4

@Dante4 try cranking up the number of reader / writer threads.

During my initial testing I was getting better write performance without Direct IO. That is because non-Direct IO writes are async writes, which work well when there are a low number of writer threads (but at the cost of two memory copies). Direct IO writes are handled synchronously, with fewer to no memory copies. Once I cranked up the number or writers from 64 dd parallel writes to 512 parallel dd writes, I got much better write performance with Direct IO.

Sadly, increasing number of jobs did not really helped

Spoiler ``` root@iser-nvme:/home/vlosev# fio --name=WithDirect --size=400G --filename=/nvme/t.1 --ioengine=libaio --randrepeat=0 --iodepth=32 --direct=1 --invalidate=1 --verify=0 --verify_fatal=0 --numjobs=512 --rw=read --blocksize=8k --group_reporting WithDirect: (g=0): rw=read, bs=(R) 8192B-8192B, (W) 8192B-8192B, (T) 8192B-8192B, ioengine=libaio, iodepth=32 ... fio-3.16 Starting 512 processes ^Cbs: 512 (f=512): [R(512)][0.0%][r=1879MiB/s][r=241k IOPS][eta 03d:08h:59m:03s] fio: terminating on signal 2

WithDirect: (groupid=0, jobs=512): err= 0: pid=3515147: Fri Oct 14 10:13:12 2022 read: IOPS=175k, BW=1366MiB/s (1432MB/s)(30.1GiB/22542msec) slat (usec): min=71, max=738837, avg=1545.77, stdev=5307.16 clat (usec): min=3, max=1113.1k, avg=47984.72, stdev=39541.12 lat (usec): min=131, max=1113.3k, avg=49531.37, stdev=40460.94 clat percentiles (msec): | 1.00th=[ 4], 5.00th=[ 4], 10.00th=[ 5], 20.00th=[ 5], | 30.00th=[ 14], 40.00th=[ 35], 50.00th=[ 45], 60.00th=[ 56], | 70.00th=[ 70], 80.00th=[ 83], 90.00th=[ 101], 95.00th=[ 116], | 99.00th=[ 150], 99.50th=[ 169], 99.90th=[ 222], 99.95th=[ 255], | 99.99th=[ 510] bw ( MiB/s): min= 975, max= 5321, per=100.00%, avg=2307.45, stdev= 4.95, samples=12401 iops : min=124860, max=681156, avg=295342.34, stdev=633.42, samples=12401 lat (usec) : 4=0.01%, 10=0.01%, 20=0.01%, 250=0.01%, 500=0.01% lat (usec) : 750=0.01%, 1000=0.01% lat (msec) : 2=0.01%, 4=6.44%, 10=22.23%, 20=3.39%, 50=19.71% lat (msec) : 100=38.03%, 250=10.12%, 500=0.04%, 750=0.01%, 1000=0.01% lat (msec) : 2000=0.01% cpu : usr=0.20%, sys=5.43%, ctx=4844909, majf=0, minf=61325 IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.2%, 32=99.6%, >=64=0.0% submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.1%, 64=0.0%, >=64=0.0% issued rwts: total=3941484,0,0,0 short=0,0,0,0 dropped=0,0,0,0 latency : target=0, window=0, percentile=100.00%, depth=32

Run status group 0 (all jobs): READ: bw=1366MiB/s (1432MB/s), 1366MiB/s-1366MiB/s (1432MB/s-1432MB/s), io=30.1GiB (32.3GB), run=22542-22542msec root@iser-nvme:/home/vlosev# zfs set direct=disabled nvme root@iser-nvme:/home/vlosev# fio --name=WithDirect --size=400G --filename=/nvme/t.1 --ioengine=libaio --randrepeat=0 --iodepth=32 --direct=1 --invalidate=1 --verify=0 --verify_fatal=0 --numjobs=512 --rw=read --blocksize=8k --group_reporting WithDirect: (g=0): rw=read, bs=(R) 8192B-8192B, (W) 8192B-8192B, (T) 8192B-8192B, ioengine=libaio, iodepth=32 ... fio-3.16 Starting 512 processes ^Cbs: 512 (f=512): [R(512)][0.0%][r=2958MiB/s][r=379k IOPS][eta 53d:03h:38m:11s] fio: terminating on signal 2

WithDirect: (groupid=0, jobs=512): err= 0: pid=3615628: Fri Oct 14 10:13:32 2022 read: IOPS=284k, BW=2217MiB/s (2325MB/s)(25.0GiB/11998msec) slat (usec): min=4, max=1900.4k, avg=944.51, stdev=25750.21 clat (usec): min=2, max=3711.2k, avg=29291.26, stdev=151762.12 lat (usec): min=52, max=3711.3k, avg=30236.06, stdev=154212.86 clat percentiles (usec): | 1.00th=[ 322], 5.00th=[ 388], 10.00th=[ 1045], | 20.00th=[ 1713], 30.00th=[ 2278], 40.00th=[ 2835], | 50.00th=[ 3064], 60.00th=[ 3163], 70.00th=[ 3261], | 80.00th=[ 3359], 90.00th=[ 3490], 95.00th=[ 6980], | 99.00th=[ 893387], 99.50th=[ 952108], 99.90th=[1803551], | 99.95th=[1887437], 99.99th=[2298479] bw ( MiB/s): min= 307, max=19832, per=100.00%, avg=5501.44, stdev=15.48, samples=4311 iops : min=39284, max=2538513, avg=704094.52, stdev=1981.29, samples=4311 lat (usec) : 4=0.01%, 10=0.01%, 20=0.01%, 50=0.01%, 100=0.01% lat (usec) : 250=0.01%, 500=8.04%, 750=0.49%, 1000=1.10% lat (msec) : 2=15.27%, 4=69.79%, 10=0.55%, 20=0.67%, 50=0.23% lat (msec) : 100=0.06%, 250=0.44%, 500=0.66%, 750=0.92%, 1000=1.45% lat (msec) : 2000=0.29%, >=2000=0.01% cpu : usr=0.13%, sys=7.50%, ctx=30559, majf=0, minf=46152 IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.2%, 32=99.5%, >=64=0.0% submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.1%, 64=0.0%, >=64=0.0% issued rwts: total=3405068,0,0,0 short=0,0,0,0 dropped=0,0,0,0 latency : target=0, window=0, percentile=100.00%, depth=32

Run status group 0 (all jobs): READ: bw=2217MiB/s (2325MB/s), 2217MiB/s-2217MiB/s (2325MB/s-2325MB/s), io=25.0GiB (27.9GB), run=11998-11998msec

</details>

Dante4 avatar Oct 14 '22 10:10 Dante4