zfs icon indicating copy to clipboard operation
zfs copied to clipboard

ZFS Interface for Accelerators (Z.I.A.)

Open calccrypto opened this issue 2 years ago • 19 comments

Motivation and Context

ZFS provides many powerful features such as compression, checksumming, and erasure coding. Such operations can be CPU/memory intensive. In particular, compressing with gzip reduces a zpool's performance significantly. Offloading data to hardware accelerators such as the Intel QAT can improve performance. However, offloading stages individually results in many data transfers to and from the accelerators. Z.I.A. provides a write path parallel to the ZIO pipeline that keeps data offloaded for as long as possible and allows for arbitrary accelerators to be used rather than integrating specific accelerators into the ZFS codebase.

Z.I.A. + DPUSM.pdf Dec 7, 2021 OpenZFS Leadership Meeting SDC 2022 OpenZFS Developer Summit 2023

Description

The ZIO pipeline has been modified to allow for external, alternative implementations of existing operations to be used. The original ZFS functions remain in the code as fallback in case the external implementation fails.

Definitions:

Term Definition
Accelerator An entity (usually hardware) that is intended to accelerate operations
Offloader Synonym of accelerator; used interchangeably
Data Processing Unit Services Module (DPUSM)
  • https://github.com/hpc/dpusm
  • Defines a "provider API" for accelerator vendors to set up
  • Defines a "user API" for accelerator consumers to call
  • Maintains list of providers and coordinates interactions between providers and consumers.
Provider A DPUSM wrapper for an accelerator's API
Offload Moving data from ZFS/memory to the accelerator
Onload The opposite of offload

In order for Z.I.A. to be extensible, it does not directly communicate with a fixed accelerator. Rather, Z.I.A. acquires a handle to a DPUSM, which is then used to acquire handles to providers.

Using ZFS with Z.I.A.:

  1. Build and start the DPUSM
  2. Implement, build, and register a provider with the DPUSM
  3. Reconfigure ZFS with --with-zia=<DPUSM root>
  4. Rebuild and start ZFS
  5. Create a zpool
  6. Select the provider zpool set zia_provider=<provider name> <zpool>
  7. Select operations to offload zpool set zia_<property>=on <zpool>

The operations that have been modified are:

  • compression
    • non-raw-writes only
  • decompression
  • checksum
    • not handling embedded checksums
    • checksum compute and checksum error call the same function
  • raidz
    • generation
    • reconstruction
  • vdev_file
    • open
    • write
    • close
  • vdev_disk
    • open
    • invalidate
    • write
    • flush
    • close

Successful operations do not bring data back into memory after they complete, allowing for subsequent offloader operations reuse the data. This results in only one data movement per ZIO at the beginning of a pipeline that is necessary for getting data from ZFS to the accelerator.

When errors ocurr and the offloaded data is still accessible, the offloaded data will be onloaded (or dropped if it still matches the in-memory copy) for that ZIO pipeline stage and processed with ZFS. This will cause thrashing if a later operation offloads data. This should not happen often, as constant errors (resulting in data movement) is not expected to be the norm.

Unrecoverable errors such as hardware failures will trigger pipeline restarts (if necessary) in order to complete the original ZIO using the software path.

The modifications to ZFS can be thought of as two sets of changes:

  • The ZIO write pipeline
    • compression, checksum, RAIDZ generation, and write
    • Each stage starts by offloading data that was not previously offloaded
      • This allows for ZIOs to be offloaded at any point in the pipeline
  • Resilver
    • vdev_raidz_io_done (RAIDZ reconstruction, checksum, and RAIDZ generation), and write
    • Because the core of resilver is vdev_raidz_io_done, data is only offloaded once at the beginning of vdev_raidz_io_done
      • Errors cause data to be onloaded, but will not re-offload in subsequent steps within resilver
      • Write is a separate ZIO pipeline stage, so it will attempt to offload data

The zio_decompress function has been modified to allow for offloading but the ZIO read pipeline as a whole has not, so it is not part of the above list.

An example provider implementation can be found in module/zia-software-provider

  • The provider's hardware is actually software - data is "offloaded" to memory not owned by ZFS
  • Calls ZFS functions in order to not reimplement operations
  • Has kernel module parameters that can be used to trigger ZIA_ACCELERATOR_DOWN states for testing pipeline restarts.

abd_t, raidz_row_t, and vdev_t have each been given an additional void *<prefix>_zia_handle member. These opaque handles point to data that is located on an offloader. abds are still allocated, but their contents are expected to diverge from the offloaded copy as operations are run.

Encryption and deduplication are disabled for zpools with Z.I.A. operations enabled

Aggregation is disabled for offloaded abds

RPMs will build with Z.I.A.

TODO/Need help with:

  • [ ] Fix/Clean up build system
    • [ ] autoconf
    • [ ] m4
    • [ ] rpm spec
    • [ ] make install
  • [ ] Configuring with Z.I.A. enabled in GitHub Actions
  • [ ] Move example provider into contrib?

How Has This Been Tested?

Testing was done using FIO and XDD with stripe and raidz 2 zpools writing to direct attached NVMes and NVMe-oF. Tests were performed on Ubuntu 20.04 and Rocky Linux 8.6 running kernels 5.13 and 5.14.

Types of changes

  • [x] Bug fix (non-breaking change which fixes an issue)
  • [x] New feature (non-breaking change which adds functionality)
  • [x] Performance enhancement (non-breaking change which improves efficiency)
  • [ ] Code cleanup (non-breaking change which makes code smaller or more readable)
  • [ ] Breaking change (fix or feature that would cause existing functionality to change)
  • [x] Library ABI change (libzfs, libzfs_core, libnvpair, libuutil and libzfsbootenv)
  • [x] Documentation (a change to man pages or other documentation)

Checklist:

  • [x] My code follows the OpenZFS code style requirements.
  • [x] I have updated the documentation accordingly.
  • [x] I have read the contributing document.
  • [x] I have added tests to cover my changes.
  • [x] I have run the ZFS Test Suite with this change applied.
  • [x] All commit messages are properly formatted and contain Signed-off-by.

calccrypto avatar Jul 05 '22 20:07 calccrypto

Having not looked at the code at all, "ZFS data structures are still allocated, but their contents are expected to diverge from the offloaded copy as operations are run." strikes fear into my heart. Could, instead, the existing ZFS code be made to appear as a DPUSM so that there's one source of truth and no risk of divergence?

nwf avatar Jul 06 '22 11:07 nwf

@nwf I should have worded that better. The only data that diverges from ZFS are abd payloads. I intentionally did not deallocate zio->io_abd and rc->rc_abd to

  1. Reuse existing data instead of storing extra data in Z.I.A. in order to recreate the abd.
    • ZFS expects abds with valid data other than the payloads such as abd_flags and abd_size
    • Z.I.A. handles are stored in the abd_t struct.
  2. Not spend extra time deallocating/reallocating abds in the middle of a pipeline.
  3. Not invalidate reference abds.
  4. Be able to not have to do an onload if data is not modified during a ZIO stage, since most stages do not modify the source data.
  5. Maintain an already allocated location to onload data into when an onload is needed.

Can you elaborate on what you mean by

Could, instead, the existing ZFS code be made to appear as a DPUSM

? Are you saying that the existing functionality should be wrapped into a provider so that it is sort of an offload? If so, that was done to create module/zia-software-provider. However, I do not plan on removing the existing code, leaving only Z.I.A. calls. This was done in anticipation of hardware accelerators failing in live systems: Z.I.A. will return errors, and ZFS falls back to processing with the original code path rather than completely fail.

calccrypto avatar Jul 06 '22 14:07 calccrypto

I'm just trying to understand this architecture... using compression as an example:

zio_write_compress()
	zia_compress()
		zio_compress_impl()
			dpusm->compress()
------------------------- dpusm layer- ----------------------------------
				sw_provider_compress()
					kernel_offloader_compress()
						// does a gzip compresion
		

I'm confused why sw_provider_compress() and below were being checked into the ZFS repository, considering they're part of the lower-level "dpusm" layer. I did see this comment:

 * Providers and offloaders are usually separate entities. However, to
 * keep things simple, the kernel offloader is compiled into this
 * provider.

... but I still don't understand. If you checked it into the dpusm module, all users of dpsum could use it, not just ZFS. You could also test and develop it independently of ZFS.

tonyhutter avatar Jul 08 '22 01:07 tonyhutter

@tonyhutter The software provider/kernel offloader is included with this pull request because it links with ZFS and reuses ZFS functions instead of implementing its own operations. The software provider can also be used as an example to show ZFS developers how to create other providers, such as one for the Intel QAT, if someone chooses to do so. The dpusm already has example providers, but they do not have very much code in them.

Additionally, the software provider allows for Z.I.A. to be used immediately rather than requiring users to buy hardware accelerators and develop providers.

calccrypto avatar Jul 08 '22 01:07 calccrypto

The software provider/kernel offloader is included with this pull request because it links with ZFS and reuses ZFS functions instead of implementing its own operations.

I think the whole idea is that dpusm should be implementing its own operations, since it's an external module. That's why ZFS would want to call it - because it's more optimized/efficient that ZFS's internal functions. It should be a black box one layer below ZFS.

It would be nice if dpusm provided reference implementations for all of its APIs within its own module. That way we can at least functionally test against it. It looks like many of the functions are already implemented:

const dpusm_pf_t example_dpusm_provider_functions = {
    .algorithms         = dpusm_provider_algorithms,
    .alloc              = dpusm_provider_alloc,
    .alloc_ref          = dpusm_provider_alloc_ref,
    .get_size           = dpusm_provider_get_size,
    .free               = dpusm_provider_free,
    .copy_from_mem      = dpusm_provider_copy_from_mem,
    .copy_to_mem        = dpusm_provider_copy_to_mem,
    .mem_stats          = NULL,
    .zero_fill          = NULL,
    .all_zeros          = NULL,
    .compress           = NULL,
    .decompress         = NULL,
    .checksum           = NULL,
    .raid               = {
                              .alloc       = NULL,
                              .free        = NULL,
                              .gen         = NULL,
                              .new_parity  = NULL,
                              .cmp         = NULL,
                              .rec         = NULL,
                          },
    .file               = {
                              .open        = NULL,
                              .write       = NULL,
                              .close       = NULL,

                          },
    .disk               = {
                              .open        = NULL,
                              .invalidate  = NULL,
                              .write       = NULL,
                              .flush       = NULL,
                              .close       = NULL,
                          },
};

Note: the reference implementations don't have to be optimized, they just have to functionally work. checksum() could literally be a simple xor over the data, for example. compress() could just return a copy of the data with a "0% compression ratio".

The other thing that came to mind when looking at all this is that ZFS already has an API for pluggable checksum algorithms: https://github.com/openzfs/zfs/blob/cb01da68057dcb9e612e8d2e97d058c46c3574af/module/zfs/zio_checksum.c#L163-L202 and compression algorithms: https://github.com/openzfs/zfs/blob/cb01da68057dcb9e612e8d2e97d058c46c3574af/module/zfs/zio_compress.c#L52-L71 I think it would make sense to add dpusm as selectable checksum and compression algorithms as a first step, and after that's checked in, then look into integrating your other accelerated functions into ZFS.

tonyhutter avatar Jul 09 '22 01:07 tonyhutter

I think the whole idea is that dpusm should be implementing its own operations, since it's an external module. That's why ZFS would want to call it - because it's more optimized/efficient that ZFS's internal functions.

You are correct that providers registered to the dpusm should provide better implementations than ZFS. The software provider is special in that there is no backing hardware accelerator - it uses ZFS defined operations. It is not meant to be used for anything other than as an example and for testing.

Providers do not implement their own operations. Rather, they are meant to call custom hardware accelerator APIs on behalf of the user to run operations on hardware. The software provider is special in that its "hardware accelerator API" is functions exported from ZFS.

It should be a black box one layer below ZFS.

ZFS and Z.I.A. should never reach down into the dpusm or provider to attempt to manipulate data. That is why all of the handles are opaque pointers. A few ZFS pointers do get passed into the provider, but those pointers are simple, such as arrays of handles, or are just passed along without being modified.

Similarly, the dpusm, providers, and hardware accelerators never know who they are offloading for or what format the data they are offloading are in. They do not know anything about ZFS structures.

It would be nice if dpusm provided reference implementations for all of its APIs within its own module.

The example you copied contains the minimum set of functions required to have a valid provider. It shows how to create wrappers around opaque hardware accelerator handles. Providers are not expected to have all operations defined. A reference implementation like the one you recommend would effectively be a bunch of no-ops which may as well not exist.

That way we can at least functionally test against it. It looks like many of the functions are already implemented: ... Note: the reference implementations don't have to be optimized, they just have to functionally work. checksum() could literally be a simple xor over the data, for example. compress() could just return a copy of the data with a "0% compression ratio".

That is what the software provider is, except with real operations, and shows that Z.I.A. works. The software provider is not meant for speed. If anything, it is slower than raw ZFS since it performs memcpys to move data out of ZFS memory space and then runs the same implementations of algorithms that ZFS runs.

I think it would make sense to add dpusm as selectable checksum and compression algorithms as a first step, and after that's checked in, then look into integrating your other accelerated functions into ZFS.

I considered doing that early on during development. However, the goal of Z.I.A. is not to add new algorithms and end up with zpools with data encoded with proprietary algorithms. It is to provide alternative data paths for existing algorithms. When hardware accelerators fail, ZFS can still fall back to running the software code path without breaking the zpool. This additionally allows for providers/hardware accelerators to be swapped out or even removed from the system and still have usable zpools.

calccrypto avatar Jul 09 '22 04:07 calccrypto

I've a problem with this part:

Reconfigure ZFS with --with-zia=<DPUSM root>
Rebuild and start ZFS

There are a lot of products shipping ZFS, where the product is hardware agnostic. We already have the problem with QAT support where ZFS has to be rebuild to allow users to use it, which does not work well with hardware agnostic downstreams.

That being said: I Applaud the idea of more modularity, it's just that modularity needs to take the above into account as well.

Ornias1993 avatar Jul 25 '22 14:07 Ornias1993

@Ornias1993 Can you elaborate on the QAT issues you have experienced? In theory, ZFS should work with or without Z.I.A. enabled (perhaps the ifguards can be removed when Z.I.A. is merged). It's just that with Z.I.A., operations are accelerated. Data modifications such as compression should always result in data compatible with the ZFS implementation so that if stock ZFS were loaded after writing with Z.I.A., the data should still be accessible.

There is no need to link against the accelerator, since that would be the provider's responsibility. All accelerators would use the same code path within ZFS, so figuring out what is broken would be obvious: either ZFS or the provider/accelerator, never the accelerator specific code in ZFS, because it wouldn't exist.

calccrypto avatar Jul 25 '22 17:07 calccrypto

I'm not having issues. Might be best to reread what I wrote....

Downstreams with ZFS included that are hardware agnostic currently do not implement QAT for example. Mainly because the fact it needs to be enabled on build. This is a major problem with the current QAT implementation and this problem is also present in this design.

These things should be able to set-up be AFTER the binary has been build.

Ornias1993 avatar Jul 25 '22 18:07 Ornias1993

@Ornias1993 The configuration can be changed to always try to find dpusm symbols (allowing for it not to be found) and the include guards can be removed so that Z.I.A. always builds. ZFS + Z.I.A. without a dpusm will still run. There is a weak pointer in Z.I.A. that allows for the dpusm to be missing.

calccrypto avatar Jul 25 '22 20:07 calccrypto

@Ornias1993 The configuration can be changed to always try to find dpusm symbols (allowing for it not to be found) and the include guards can be removed so that Z.I.A. always builds. ZFS + Z.I.A. without a dpusm will still run. There is a weak pointer in Z.I.A. that allows for the dpusm to be missing.

Could you elaborate a bit on the weak pointer bit? That sounds like something which might complicate life on hardened kernels so curious to see where/how that's implemented upstream if you happen to know.

In terms of accelerators being able to fail down to zfs internal codepaths, how much runtime testing is needed when initializing those APIs before were sure all possible offloaded computational products match the internal ones at the currently running version of ZFS? For example, if an offloader does ZSTD at v1.5 but zfs innards move to vX then how safe is it to fail midstream so to speak and fall back to the older compressor?

sempervictus avatar Jul 31 '22 23:07 sempervictus

@sempervictus The weak pointer check is done at initialization and finalization. The functions are only defined if they were available at ZFS module load time.

https://github.com/openzfs/zfs/blob/73adbe60b1b988342d29ed634e24c5088a31202b/module/zfs/zia.c#L220-L229

Testing of providers by the dpusm is not done because the dpusm does not provide any functionality by itself. It is possible to add functionality checks, say, when a provider registers with the dpusm. However, that would bloat the dpusm codebase and add dependencies that users might not have or want since reference implementations for each algorithm would be required (the dpusm does not link with ZFS). Users can create a separate testing module that links with the dpusm and does whatever testing they wish with their provider/accelerator.

The dpusm API has a function that the provider fills out to tell it what algorithms the accelerator exposes. It is the responsibility of the provider to not lie.

For example, if an offloader does ZSTD at v1.5 but zfs innards move to vX then how safe is it to fail midstream so to speak and fall back to the older compressor?

Offloading occurs based on the data found in a zio_t. If new zios arrive with a different algorithm, Z.I.A. will see the new algorithm and process them accordingly: if the provider says it does not support the algorithm, the data will not be offloaded. If the algorithm is supported, the data will be offloaded and processed by the accelerator. This should be no different than switching algorithms halfway through writing today.

calccrypto avatar Aug 01 '22 16:08 calccrypto

@calccrypto where does this stand? Might be useful for #15731 :)

sempervictus avatar Mar 15 '24 00:03 sempervictus

@sempervictus I agree that Z.I.A. would be useful for #15731.

Code reviews and people testing the changes would be helpful in getting Z.I.A. merged into ZFS.

calccrypto avatar Mar 18 '24 16:03 calccrypto

I hope this Project gets finally merged, as we are in need of Accelerators. ZFS is utterly Slow compared to Filesystems that are especially made for NVME. U.3 is nowadays Standard storage in almost any modern Servers, and something like using 12x Micron's 7450 Max or Samsung PM9a3 in their Hypervisors, is a nogo to use with ZFS, because ZFS is a big bottleneck. I mean, we wish to use ZFS, for the features, but we cannot because of the utterly bad Performance.

ZIA would finally allow using Accelerators, and i believe once ZIA is merged, Hardware will follow and more Companies would move to ZFS. Even while ZFS is not a Clustering Filesystem, thats the other big downside, but there are luckily workarounds for this. For example like Proxmox does it, with syncing VM Storage between ZFS nodes every minute or 2, or by using zfs-send. However, the main issue is how slow ZFS is for modern NVME's.

Intel QAT for example, was great 3 years ago, but is basically dead, because of 2 reasons:

  1. The Acceleration of QAT with todays NVME's and Processors and optimizations in ZFS itself + Kernel, bringt not really any Performance benefit anymore. It makes no difference of using QAT vs not using it.
  2. The Problems with Compiling QAT+ZFS and maintaining it, is a really crappy solution, thats why no one really uses that even if they have Xeons with QAT. It would be a whole other Story if ZFS would include QAT by default, so if QAT is available it will get used, if not then not, like with AES-NI.

To not make the same mistakes again, ZIA should be included into ZFS by default, and if there are any Providers there should be a way to easily use those, instead of recompiling ZFS. If this can be accomplished, ZFS will definitively have a future and gain attention. Otherwise as we move more and more towards faster NVME Drives, ZFS will surely die.

I personally don't even see a reason anymore to use ZFS already anywhere, except for my Backup-Servers. Otherwise on any of our NVME Servers, the performance is so bad (20-30% of raw performance) that we had to switch to propritary f*** expensive Filesystems. And on other Servers where possible, we are using Cephs or LVM/Lvm-thin or LVM/ext4, and all of them are not nearly as nice as ZFS, but have twice almost 3x the performance of ZFS.

I don't know how others feel about this, but really is there so little interest in Accelerating ZFS to make it usable for nowadays Standards again? I mean its by far my favorite Filesystem, and i think yours either, else we wouldn't be here!

Cheers

Ramalama2 avatar Mar 30 '24 17:03 Ramalama2

Thanks for the enthusiasm @Ramalama2! We have noticed the same issues with backing ZFS with NVMes. Z.I.A. is intended to recapture all of the performance that ZFS already captures but loses with the addition of useful features such as compression. The changes to ZFS that will capture the performance that ZFS cannot currently capture from NVMEs can be found in #10018.

I agree with your comments on reconfiguring ZFS to use QATs. As mentioned in this comment, reconfiguring ZFS sometimes causes problems. With Z.I.A., there would no longer be custom accelerator code in ZFS that requires reconfiguring to use. This has the additional benefit of reducing the cost of developing accelerator code for ZFS: developers would no longer have to learn the intricacies of ZFS before even starting - they would just have to wrap their functions in another function with the corresponding DPUSM signature. After developing the accelerator specific code, the code would then have to be upstreamed, which may take a while, if it happens at all, whereas custom providers do not have to be merged and can even be kept private.

I have a commit in my fork of ZFS containing changes that moves QAT code from ZFS into a provider. It is completely untested, but it does exist.

calccrypto avatar Apr 01 '24 15:04 calccrypto

Thanks @calccrypto a lot for the ZIA work! And surely @bwatkinson either, direct_io is a different approach to the same problem! So both solutions are great!

TBH, direct_io is extremely difficult to understand for me, since for compression for example, the blocks still need to be in memory for compression itself and then written to the disk. I cannot imagine the performance gain. I think because i simply don't understand it in detail. But i don't need to understand it either, i will be extremely happy if it speeds up things. On the other side, ZIA is extremely easy to understand, it just an "lets say" API Interface for accelerators, you don't change how ZFS works. Which makes me less headaches or "fears" to use it in Production environment.

About the QAT fork, i checked that out either, but i would need to buy an Intel 8970 Adapter to test it out. Which is tbh not a big of a deal because they are cheap nowadays, but the issue is, that all benchmarks i've found about QAT vs. no QAT on newer Hardware like Milan/Genoa, there is absolutely nothing to gain, no performance improvement, not even less CPU utilization, nothing, basically it will be exactly the same. Not worse, but not better either. So QAT itself, isn't really something promising, and intel self don't makes the accelerator cards anymore, they integrated it into the cpu, which makes sense, since qat has access to the memory there directly and no additional operations are needed, that would be needed on a separate pcie card. Amd has a very similar thing to QAT, called CCP (only for checksums), but somehow that didn't gained any attention at all, there is abolutely no software i can find that uses that.

However, thats just talk, thank you both a lot for the work! Im unbelievable happy that there is work going on!

Ramalama2 avatar Apr 01 '24 22:04 Ramalama2

One little question, i see functions for Raid Z1/2/3 Generation/Reconstruction. Mirrored Pools like Raid 1/10, doesnt need that functions/any acceleration, because they just copy all the blocks over to the reconstructing drive etc, correct? Or did you simply not implemented/forgot mirrors?

Sorry might be sound stupid, just gone over your code and thats the only thing that took my attention :-) Thanks for your work again so much! Cheers

Ramalama2 avatar Apr 10 '24 17:04 Ramalama2

Only RAIDZ generation and reconstruction were modified to offload data.

calccrypto avatar Apr 10 '24 19:04 calccrypto