zfs icon indicating copy to clipboard operation
zfs copied to clipboard

Add Linux posix_fadvise support

Open Finix1979 opened this issue 3 years ago • 9 comments

The purpose of this PR is to accepts fadvise ioctl from userland to do read-ahead by demand.

It could dramatically improve sequencial read performance especially when primarycache is set to metadata or zfs_prefetch_disable is 1.

If the file is mmaped, generic_fadvise is also called for page cache read-ahead besides dmu_prefetch.

Only POSIX_FADV_WILLNEED and POSIX_FADV_SEQUENTIAL are supported in this PR currently.

Signed-off-by: Finix Yan [email protected]

Motivation and Context

I found fadvise is added by PR#9807 initially. After test, I found it could dramatically improve seqential read performance.

Description

Only if the file is mmaped, we leverage generic_fadvise to do page cache read-ahead additionally.

We treat POSIX_FADV_SEQUENTIAL same as POSIX_FADV_SEQUENTIAL for now.

How Has This Been Tested?

Types of changes

  • [ ] Bug fix (non-breaking change which fixes an issue)
  • [x] New feature (non-breaking change which adds functionality)
  • [x] Performance enhancement (non-breaking change which improves efficiency)
  • [ ] Code cleanup (non-breaking change which makes code smaller or more readable)
  • [ ] Breaking change (fix or feature that would cause existing functionality to change)
  • [ ] Library ABI change (libzfs, libzfs_core, libnvpair, libuutil and libzfsbootenv)
  • [ ] Documentation (a change to man pages or other documentation)

Checklist:

  • [x] My code follows the OpenZFS code style requirements.
  • [ ] I have updated the documentation accordingly.
  • [x] I have read the contributing document.
  • [ ] I have added tests to cover my changes.
  • [ ] I have run the ZFS Test Suite with this change applied.
  • [x] All commit messages are properly formatted and contain Signed-off-by.

Finix1979 avatar Jul 26 '22 07:07 Finix1979

You mention "could dramatically improve performance" - do you have benchmarks for how well it performs in some situations without and with this PR?

rincebrain avatar Jul 26 '22 09:07 rincebrain

And we surely need some tests, which was cumbersome in https://github.com/openzfs/zfs/pull/9807

gmelikov avatar Jul 26 '22 09:07 gmelikov

You mention "could dramatically improve performance" - do you have benchmarks for how well it performs in some situations without and with this PR?

In our production machine, we always set primarycache to metadata and disable prefetch. Under this circumstances, single thread sequential read is about 40MB/s without fadvise. While we could get aournd 1GB/s with triggering fadvise after every 128K read. 128K is the default dmu_prefetch length.

BTW, orginal PR9807https://github.com/openzfs/zfs/pull/9807 has some test result too.

Finix1979 avatar Jul 26 '22 09:07 Finix1979

I have no conceptual objection to this, but I do wonder - what about your workload made you disable the prefetcher and data caching entirely, only to turn around and want to manually prefetch? Was it that poor at prediction, for you?

Because correcting the reason you felt the need to disable prefetch or data caching seems like the "right" approach to your issue, to me, long-term, even if this solution works for you right now, so that people don't have to rely on {having, building} a special case setup like that configuration and the software that manually prefetches for you to get good performance.

rincebrain avatar Jul 26 '22 10:07 rincebrain

If database calls fadvise for read ahead, that means application knows what they want exactly. With fadvise supported, ZFS could prefetch data accurately.

The reason why we set metadata as primarycache is that we use ZFS as backup storage system and the written data is rarely be read again. Therefore, the data does not necessary stay in ARC. By setting metadata, ARC could consume less memory. When data recovery is needed, we could still get perfect performance by calling fadvise ioctl properly without change primarycache and zfs_prefetch_disable.

Finix1979 avatar Jul 26 '22 10:07 Finix1979

Interestingly this change appears to cause the following two test case failures. Before this can be merged that will need to be understood and sorted out, http://build.zfsonlinux.org/builders/CentOS%208%20x86_64%20%28TEST%29/builds/9178/steps/shell_4/logs/summary

    FAIL checksum/filetest_002_pos (expected PASS)
    FAIL fault/auto_spare_002_pos (expected PASS)

It looks like we'll also need a configure check to verify that generic_fadvise() is available. It appears it wasn't exported as a symbol until the 5.3 kernel, but the ->fadvise callback was added in 4.19.

behlendorf avatar Jul 28 '22 23:07 behlendorf

Thank you. I will take a look at it.

Finix1979 avatar Jul 29 '22 01:07 Finix1979

Hi @behlendorf The reason why checksum/filetest_002_pos failed is that the cat command uses fadvise internally and the zio_checksum_verify function does not record checksum error if zio flag contains ZIO_FLAG_SPECULATIVE.

if (error == ECKSUM && !(zio->io_flags & ZIO_FLAG_SPECULATIVE)) { (void) zfs_ereport_start_checksum(zio->io_spa, zio->io_vd, &zio->io_bookmark, zio, zio->io_offset, zio->io_size, &info); mutex_enter(&zio->io_vd->vdev_stat_lock); zio->io_vd->vdev_stat.vs_checksum_errors++; mutex_exit(&zio->io_vd->vdev_stat_lock); } I have no idea why checksum error is ignored if ZIO_FLAG_SPECULATIVE is set. Please help

Finix1979 avatar Aug 09 '22 10:08 Finix1979

I have no idea why checksum error is ignored if ZIO_FLAG_SPECULATIVE is set.

Checksum errors aren't reported for prefetch I/O since it's possible the prefetch read might fail if the file is being modified at the same time. A demand read will be issued for any data which is really needed and a checksum error logged for that. I'm not sure why you're not seeing that, but we do want to suppress this prefetch false positive.

Presumably switching the test to use dd instead of cat which I believe doesn't use fadvise which fix the test case.

behlendorf avatar Aug 09 '22 18:08 behlendorf

Can you please squash these commits, rebase this branch on the current master branch, and force update the PR. That should help with some of the failed CI bots.

behlendorf avatar Aug 11 '22 16:08 behlendorf

@Finix1979 if you can rebase this one more time and resolve the outstanding checkstyle warnings this is good to go.

https://github.com/openzfs/zfs/runs/8044573481?check_suite_focus=true

Run make -j$(nproc) --no-print-directory --silent checkstyle
no errors found
./tests/zfs-tests/tests/functional/fadvise/cleanup.ksh
./tests/zfs-tests/tests/functional/fadvise/fadvise_sequential.ksh
./tests/zfs-tests/tests/functional/fadvise/setup.ksh
make: *** [Makefile:13790: testscheck] Error 1
make: *** Waiting for unfinished jobs....
Error: Process completed with exit code 2.

behlendorf avatar Sep 02 '22 20:09 behlendorf

Can we get this in a stable release?

stefantalpalaru avatar May 24 '23 18:05 stefantalpalaru