zfs icon indicating copy to clipboard operation
zfs copied to clipboard

icp: Port AVX2 implementation of aes-gcm from BoringSSL

Open lowjoel opened this issue 10 months ago • 37 comments

Motivation and Context

Zen 3 CPUs support the VAES and VPCLMULDQ instructions which extend the width of each instruction from 128-bits to 256-bits. BoringSSL has recently implemented this version for AES-GCM and it provides up to a 80% speedup. See google/boringssl@3b6e1be4391d96e81cee022f77f7bab85d51cf4e.

Description

I've backported the implementation from BoringSSL, adapting code from google/boringssl@3b6e1be4391d96e81cee022f77f7bab85d51cf4e (but picking the tip of master), as well as from google/boringssl@62f9751ade8373ae3339daee581c80a173206321 which changed the primitive signature from 6 arguments to 7 (by not implicitly relying on the address offset of the ghash structure.)

Adaptations for icp (akin to #9749) as well as to use the RET macro for kernel code are in the third commit.

The fifth to seventh commits combine the use_avx/use_avx2 flags into an enum, allowing toggling of the different implementations that are available. Also, define different values of CAN_USE_GCM_ASM to indicate various levels of compiler support.

How Has This Been Tested?

Compile tested.

I'm now running it on my ZFS-on-root main machine.

@robn has a Wycheproof test set (#17089) that has been merged; these changes pass.

Types of changes

  • [ ] Bug fix (non-breaking change which fixes an issue)
  • [ ] New feature (non-breaking change which adds functionality)
  • [x] Performance enhancement (non-breaking change which improves efficiency)
  • [x] Code cleanup (non-breaking change which makes code smaller or more readable)
  • [ ] Breaking change (fix or feature that would cause existing functionality to change)
  • [ ] Library ABI change (libzfs, libzfs_core, libnvpair, libuutil and libzfsbootenv)
  • [ ] Documentation (a change to man pages or other documentation)

Checklist:

  • [x] My code follows the OpenZFS code style requirements.
  • [ ] I have updated the documentation accordingly.
  • [x] I have read the contributing document.
  • [x] I have added tests to cover my changes.
  • [x] I have run the ZFS Test Suite with this change applied.
  • [x] All commit messages are properly formatted and contain Signed-off-by.

lowjoel avatar Feb 15 '25 07:02 lowjoel

Hey @AttilaFueloep I tried to port your changes in #9749 and I can't figure out what you mean by the change with the comment // ICP does not zero key schedule. in the modified assembly sources. I've ported the other two changes, for the round offset in the AES_KEY struct, as well as the OpenSSL vs ICP representation of number of AES rounds. I've also fixed the gcm_init_vpclmulqdq_avx2 method which you mentioned ICP stores H in network order.

lowjoel avatar Feb 15 '25 08:02 lowjoel

Strong opening move! I've had a little play with it, here's what I got.

Before the final thing, you should run make checkstyle, and fix the errors it throws out.

I added this patch to let me see what it thinks is happening on my test machines:

diff --git module/zcommon/simd_stat.c module/zcommon/simd_stat.c
index d82a88ca9..da557bbb0 100644
--- module/zcommon/simd_stat.c
+++ module/zcommon/simd_stat.c
@@ -117,6 +117,10 @@ simd_stat_kstat_data(char *buf, size_t size, void *data)
 		    "pclmulqdq", zfs_pclmulqdq_available());
 		off += SIMD_STAT_PRINT(simd_stat_kstat_payload,
 		    "movbe", zfs_movbe_available());
+		off += SIMD_STAT_PRINT(simd_stat_kstat_payload,
+		    "vaes", zfs_vaes_available());
+		off += SIMD_STAT_PRINT(simd_stat_kstat_payload,
+		    "vpclmulqdq", zfs_vpclmulqdq_available());
 
 		off += SIMD_STAT_PRINT(simd_stat_kstat_payload,
 		    "osxsave", boot_cpu_has(X86_FEATURE_OSXSAVE));

With that, my old Intel 2019 junker laptop says:

# grep -E 'vaes|vpclmulqdq' /proc/spl/kstat/zfs/simd
vaes                    0
vpclmulqdq              0

My much nicer Ryzen 5 from last year says:

# grep -E 'vaes|vpclmulqdq' /proc/spl/kstat/zfs/simd
vaes                    1
vpclmulqdq              1

Unfortunately we don't have visibility on the ICP microbenchmarks like we do for checksums and raidz, but we can at least see the options available:

# cat /sys/module/zfs/parameters/icp_gcm_impl
cycle [fastest] avx avx2 generic pclmulqdq

So I'd say it's all wired up right, which is half the fun.

I set it to avx and created a dataset, then unmounted and exported. Then I set it to avx2 and tried to zfs load-key, but that didn't work:

# zfs load-key -a
Enter passphrase for 'tank/enc':
Key load error: Incorrect key provided for 'tank/enc'.

Trying to create the pool and dataset with avx2 selected gets a crash:

# zpool create tank -O encryption=aes-256-gcm -O keyformat=passphrase /home/robn/blk
Enter new passphrase:
Re-enter new passphrase:
Killed

And the kernel has a nice complaint:

[ 2769.438918] BUG: unable to handle page fault for address: ffff9c23c8307e10
[ 2769.439133] #PF: supervisor read access in kernel mode
[ 2769.439290] #PF: error_code(0x0000) - not-present page
[ 2769.439447] PGD 110e01067 P4D 110e01067 PUD 0
[ 2769.439588] Oops: 0000 [#1] PREEMPT SMP NOPTI
[ 2769.439733] CPU: 0 PID: 73114 Comm: zpool Tainted: P           OE      6.1.0-25-amd64 #1  Debian 6.1.106-3
[ 2769.440035] Hardware name: FreeBSD BHYVE/BHYVE, BIOS 14.0 10/17/2021
[ 2769.440236] RIP: 0010:aes_gcm_dec_update_vaes_avx2+0x3e/0x5e0 [zfs]
[ 2769.440596] Code: 0c 24 c4 e2 71 00 c8 c4 42 7d 5a 18 c4 62 25 00 d8 44 8b 91 78 01 00 00 46 8d 14 95 f0 ff ff ff 4e 8d 5c 91 60 c4 62 7d 5a 09 <c4> 42 7d 5a 13 c5 25 fe 1d d5 ed 0b 00 48 83 fa 7f 0f 8
6 31 03 00
[ 2769.441171] RSP: 0018:ffffb1b6c37df358 EFLAGS: 00010086
[ 2769.441339] RAX: ffffffffc089b140 RBX: ffffb1b6c37df43c RCX: ffff9c214b6eae00
[ 2769.441566] RDX: 0000000000000060 RSI: ffff9c208e918500 RDI: ffff9c208e918500
[ 2769.441793] RBP: 0000000000000006 R08: ffffb1b6c37df430 R09: ffff9c20e8eba3c0
[ 2769.442019] R10: 000000009f3073ec R11: ffff9c23c8307e10 R12: ffffb1b6c37df498
[ 2769.442102] R13: ffffffffc089b140 R14: ffffb1b6c37df420 R15: ffffb1b6c37df488
[ 2769.442102] FS:  00007f2d7ead5840(0000) GS:ffff9c23afc00000(0000) knlGS:0000000000000000
[ 2769.442102] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 2769.442102] CR2: ffff9c23c8307e10 CR3: 00000001cccaa000 CR4: 00000000003506f0
[ 2769.442102] Call Trace:
[ 2769.442102]  <TASK>
[ 2769.442102]  ? __die_body.cold+0x1a/0x1f
[ 2769.442102]  ? page_fault_oops+0xd2/0x2b0
[ 2769.442102]  ? srso_alias_return_thunk+0x5/0x7f
[ 2769.442102]  ? search_bpf_extables+0x5b/0x80
[ 2769.442102]  ? exc_page_fault+0xca/0x170
[ 2769.442102]  ? asm_exc_page_fault+0x22/0x30
[ 2769.442102]  ? aesni_gcm_decrypt_avx+0x10/0x10 [zfs]
[ 2769.442102]  ? aesni_gcm_decrypt_avx+0x10/0x10 [zfs]
[ 2769.442102]  ? aes_gcm_dec_update_vaes_avx2+0x3e/0x5e0 [zfs]
[ 2769.442102]  ? aesni_gcm_decrypt_avx2+0x2a/0x50 [zfs]
[ 2769.442102]  ? gcm_decrypt_final_avx+0x167/0x460 [zfs]
[ 2769.442102]  ? crypto_update_uio+0xc0/0x120 [zfs]
[ 2769.442102]  ? aesni_gcm_decrypt_avx+0x10/0x10 [zfs]
[ 2769.442102]  ? aes_decrypt_atomic+0x1ca/0x310 [zfs]
[ 2769.442102]  ? crypto_decrypt+0x78/0x1c0 [zfs]
[ 2769.442102]  ? zio_do_crypt_uio+0x2ce/0x400 [zfs]
[ 2769.442102]  ? zio_crypt_key_unwrap+0x254/0x480 [zfs]
[ 2769.442102]  ? dsl_crypto_key_open.constprop.0+0x2d1/0x350 [zfs]
[ 2769.442102]  ? dsl_crypto_key_open.constprop.0+0x2d1/0x350 [zfs]
[ 2769.442102]  ? spa_keystore_dsl_key_hold_dd+0x12e/0x280 [zfs]
[ 2769.442102]  ? __kmalloc_node+0x4c/0x150
[ 2769.442102]  ? spa_keystore_create_mapping+0x79/0x200 [zfs]
[ 2769.442102]  ? dsl_dataset_hold_obj_flags+0x48/0x90 [zfs]
[ 2769.442102]  ? dsl_pool_create+0x1bc/0x490 [zfs]
[ 2769.442102]  ? spa_create+0x83d/0xdb0 [zfs]
[ 2769.442102]  ? zfs_ioc_pool_create+0xab/0x310 [zfs]
[ 2769.442102]  ? zfsdev_ioctl_common+0x6a0/0x7c0 [zfs]
[ 2769.442102]  ? __kmalloc_node+0xbf/0x150
[ 2769.442102]  ? srso_alias_return_thunk+0x5/0x7f
[ 2769.442102]  ? zfsdev_ioctl+0x4f/0xd0 [zfs]
[ 2769.442102]  ? __x64_sys_ioctl+0x90/0xd0
[ 2769.442102]  ? do_syscall_64+0x55/0xb0

That's all I have time for tonight. This is a good start!

robn avatar Feb 15 '25 09:02 robn

Oh the other thing I forgot to add, after loading the module it says fastest, and fails as above. So it is selecting avx2 as fastest, which is good! (we need a stat for it, for sure).

robn avatar Feb 15 '25 09:02 robn

With #17089, looks pretty great:

$ ./tests/zfs-tests/cmd/crypto_test -c tests/zfs-tests/tests/functional/crypto/aes_gcm_test.txt
AES-GCM [generic+generic]: tests=316: passed=316 failed=0
AES-GCM [x86_64+generic]: tests=316: passed=316 failed=0
AES-GCM [aesni+generic]: tests=316: passed=316 failed=0
AES-GCM [generic+pclmulqdq]: tests=316: passed=316 failed=0
AES-GCM [x86_64+pclmulqdq]: tests=316: passed=316 failed=0
AES-GCM [aesni+pclmulqdq]: tests=316: passed=316 failed=0
AES-GCM [x86_64+avx]: tests=316: passed=316 failed=0
AES-GCM [aesni+avx]: tests=316: passed=316 failed=0
AES-GCM [x86_64+avx2]: tests=316: passed=316 failed=0
AES-GCM [aesni+avx2]: tests=316: passed=316 failed=0

$ ./tests/zfs-tests/cmd/crypto_test -p AES-GCM
avg encrypt (1000 rounds)        1K     2K     4K     8K    16K    32K    64K   128K   256K   512K
AES-GCM [generic+generic]      26us   54us  149us  215us  458us  938us    1ms    3ms    7ms   14ms
AES-GCM [x86_64+generic]       14us   29us   89us  201us  428us  881us    1ms    3ms    7ms   14ms
AES-GCM [aesni+generic]        11us   24us   78us  182us  393us  809us    1ms    3ms    6ms   13ms
AES-GCM [generic+pclmulqdq]     7us   12us   21us   41us   79us  156us  310us  618us    1ms    2ms
AES-GCM [x86_64+pclmulqdq]      6us   10us   18us   33us   64us  127us  252us  501us    1ms    1ms
AES-GCM [aesni+pclmulqdq]       4us    6us   11us   19us   37us   73us  144us  287us  572us    1ms
AES-GCM [x86_64+avx]            2us    2us    3us    4us    5us    8us   15us   28us   55us  107us
AES-GCM [aesni+avx]             2us    2us    3us    3us    5us    8us   15us   28us   55us  106us
AES-GCM [x86_64+avx2]           2us    2us    2us    3us    4us    6us   10us   18us   34us   63us
AES-GCM [aesni+avx2]            2us    2us    2us    3us    4us    6us   10us   18us   34us   63us

robn avatar Feb 24 '25 11:02 robn

I can't figure out what you mean by the change with the comment // ICP does not zero key schedule. in the modified assembly sources.

Of the top of my head:

Openssl initializes the keyschedule with zeros while ICP does not. Later on the openssl code uses that fact to determine some parameter (the number of iterations IIRC). For IPC we have to cope with that since the keyschedule is created by the aes routines which I didn't want to touch.

If you need the patch or not depends on how the BoringSSL asm routine is handling this.

AttilaFueloep avatar Feb 25 '25 20:02 AttilaFueloep

@lowjoel I just merged https://github.com/openzfs/zfs/pull/17089. Would you mind rebasing this PR so we can have it run against that new test code?

tonyhutter avatar Feb 26 '25 01:02 tonyhutter

Given #16601 and #14531 we have now three PRs modifying gcm.c in incompatible ways. I already started refactoring the gcm routines in #14531. I'll try to extend that to make it easy to plug in additional SIMD routines and come up with something over the weekend. I think I've a plan already.

AttilaFueloep avatar Feb 26 '25 04:02 AttilaFueloep

@lowjoel I just merged #17089. Would you mind rebasing this PR so we can have it run against that new test code?

@tonyhutter rebased cleanly. Letting the test suite run.

lowjoel avatar Feb 27 '25 01:02 lowjoel

@lowjoel here's two commits you may or may not wish to include here:

  • robn/zfs@a1cdddc63bb92d33c355a32b1a01e22c895e42c6: actually test avx2 in crypto_test
  • robn/zfs@cedd824e2ba30d5167abdfdd5ed08d1763e03507: include vaes and vpclmulqdq in /proc/spl/kstat/zfs/simd

robn avatar Feb 27 '25 02:02 robn

@robn thanks - pulled into this PR.

lowjoel avatar Feb 27 '25 02:02 lowjoel

Another rebase on top of master should fix the Fedora build errors.

tonyhutter avatar Mar 04 '25 17:03 tonyhutter

I made some progress in refactoring the GCM SIMD interface. There's still some refinements and a lot of cleanup and documentation to do, so I'll need another couple of days to come up with something usefull.

AttilaFueloep avatar Mar 04 '25 21:03 AttilaFueloep

@AttilaFueloep is merging this PR going to mess up your work? I am all in favour of better abstractions inside the ICP, but that's going to take time to finish, review and merge, while this PR is ready to go right now. I'd like to not to hold it up unless it's really going to create a new mess for you.

Is there anything I can help you with?

robn avatar Mar 04 '25 22:03 robn

Another rebase on top of master should fix the Fedora build errors.

Done, thanks @tonyhutter

lowjoel avatar Mar 04 '25 23:03 lowjoel

@robn As I mentioned above, there are currently two other open PRs adding gcm SIMD: #14531 which is mine and #16601, opened by @w0xel . If you look at gcm.c in the three commits, you'll see what I mean. I'm currently refactoring the mess by writing generalized gcm_{enc,dec,..}_simd() functions where you could "just" plug in your ported assembly. So it's not about refactoring ICP but just gcm.c to enable multiple SIMD implementations. All three mentioned commits do that in different ways currently.

Would be good to hear the thoughts of @lowjoel and @w0xel as well.

is merging this PR going to mess up your work?

This depends on how you define mess up. If this is merged I can easily revert it in the upcoming refactoring PR and then rebase it on top of the refactoring. So basically merging will take time pressure off me but add additional work.

Is there anything I can help you with?

Yes, I've no real test rig, so I'd greatly appreciate any help with testing once I'm done..

AttilaFueloep avatar Mar 05 '25 00:03 AttilaFueloep

All three mentioned commits do that in different ways currently.

@AttilaFueloep if you are able to split the refactoring part out separately I would not mind rebasing my changes on top of yours, if that helps make an easier decision for the maintainers. Otherwise whichever is easiest for maintainers would be my preference.

Or, if you'd like to coordinate over the zfs-slack we can discuss there too (mainly how @robn and I communcate)

lowjoel avatar Mar 05 '25 00:03 lowjoel

Yes. I'm fine either way as well. After thinking more about it, I'd say I tend to prefer merging, so I don't feel pressed.

Or, if you'd like to coordinate over the zfs-slack

No, sorry. No Slack, IRC or whatever. Once bitten twice shy.

AttilaFueloep avatar Mar 05 '25 00:03 AttilaFueloep

@lowjoel BTW, did you see https://github.com/openzfs/zfs/pull/17058#issuecomment-2683240119 already?

AttilaFueloep avatar Mar 05 '25 01:03 AttilaFueloep

@AttilaFueloep sounds like merging now suits everyone: you get some breathing room, @lowjoel gets his gold star, and fancy folk with AVX2 machines get a nice bump (maybe in time for 2.3.1!)

And sure, once you've got a PR I'd be happy to stick it on the test rig and put it through its paces.

robn avatar Mar 05 '25 01:03 robn

@robn That would be nice!

While I have your attention, a review of #14531, once rebased on the refactoring, would be nice as well. To be honest, I was a bit disappointed to see no review feedback on it for years. It has a couple of satisfied users, so I'm trying to get it over the line currently.

AttilaFueloep avatar Mar 05 '25 01:03 AttilaFueloep

No, sorry. No Slack, IRC or whatever. Once bitten twice shy.

oops, sorry to hear that.

@lowjoel BTW, did you see #17058 (comment) already?

Thanks, I did, but the comment seemed out of place relative to the other asm comments. If I understand the assembly right, the line where the comment was added was intended to break the loop if we've reached the number of rounds for AES-256, but that seems to disagree with the comment. Maybe I'm misunderstanding something. Or does that affect some other key size?

I'm now also wondering if I'm missing testing for some corner case because that change hasn't been ported. And AFAIK BoringSSL uses the same API as OpenSSL

lowjoel avatar Mar 05 '25 01:03 lowjoel

@lowjoel Again, off the top of my head,

break the loop if we've reached the number of rounds for AES-256,

Right, the original code used a zero detection to break the loop, while my change checks for the number of rounds given. Of course this depends on how the assembly is written, it may be totally different in your case.

Let me look it up more thoroughly, it's been six years since I wrote this so my memories are a bit hazy.

AttilaFueloep avatar Mar 05 '25 01:03 AttilaFueloep

@lowjoel Ok, had a quick look:

module/icp/asm-x86_64/modes/aesni-gcm-avx2.S

384	movl	240(%rcx),%r10d
385	leal	-20(,%r10,4),%r10d
....
473	cmpl	$24,%r10d
474	jl	.Laes128__func1
475	je	.Laes192__func1

So %r10d seems to hold the number of rounds. 24 seems to be a strange value though, not sure what '%r10d' contains. Anyway you seem to be good since this code doesn't depend on a zeroed keyschedule.

AttilaFueloep avatar Mar 05 '25 02:03 AttilaFueloep

@AttilaFueloep

24 seems to be a strange value though, not sure what '%r10d' contains.

I also interpreted it to be number of rounds, 0-based. The instruction leal -20(,%r10,4),%r10d then converts r10 into one of {16,24,32} (had to fix the math for ICP), hence the later cmp against 24 checks for AES-192. r10 is also used to produce r11 which loads the later round key hence the mathing around... I had to stare at it for a good while myself.

Anyway you seem to be good since this code doesn't depend on a zeroed keyschedule.

Thanks!

lowjoel avatar Mar 05 '25 13:03 lowjoel

@lowjoel

24 seems to be a strange value though, not sure what '%r10d' contains.

I also interpreted it to be number of rounds, 0-based. The instruction leal -20(,%r10,4),%r10d then converts r10 into one of {16,24,32}

Sure, I was just wondering why it's expected to be 24 and not 11. If I read the code correctly %r10 is only used in compare and branch context where you could use both. I may miss something obvious though.

AttilaFueloep avatar Mar 06 '25 02:03 AttilaFueloep

At the risk of nitpicking, I think we should reconsider the module parameter name. Currently it's called avx2 but the main feature it adds is support for the VAES/VCLMUL extension to AES-NI/CLMUL. So I'd suggest something like vaes-avx2. (Actually it uses AVX2 only due to the fact that Zen 3 has no AVX-512/AVX10.) Thoughts?

Yeah. I think you're right. The commit calls it VAES+AVX2; I should have stuck with it. I'll push this later today

So modulo Htab size and module parameter name we are good to go I'd say.

Added htab refactor in https://github.com/openzfs/zfs/pull/17058/commits/c49a94cdbda560270967a897693c9a06e8b2be0a - the method seemed to not fit in the program flow any more hence the slightly more aggressive rewrite. Thanks @AttilaFueloep

lowjoel avatar Mar 08 '25 03:03 lowjoel

@AttilaFueloep renamed to avx2-vaes in https://github.com/openzfs/zfs/pull/17058/commits/59fb58a9d3ede6efe91c427527869871049c6c88; identifiers in code still remains just avx2 for symmetry with the assembly.

lowjoel avatar Mar 08 '25 10:03 lowjoel

@lowjoel Sorry for not being clear enough, was in a hurry. My main concern was the user visible module parameter that is shown in e.g. cat /sys/module/zfs/parameters/icp_gcm_impl. I think it would be good enough to just change the string, and keep the #define as is.

gcm.c L851

static const struct {
	const char *name;
	uint32_t sel;
} gcm_impl_opts[] = {
		{ "cycle",	IMPL_CYCLE },
		{ "fastest",	IMPL_FASTEST },
#ifdef CAN_USE_GCM_ASM
		{ "avx",	IMPL_AVX },
=>		{ "avx2",	IMPL_AVX2 },
#endif
};

AttilaFueloep avatar Mar 10 '25 15:03 AttilaFueloep

@lowjoel Regarding the Htab size: I'd have passed the GCM context to gcm_simd_get_htab_size() and returned an appropriate value there, but your change works as well.

AttilaFueloep avatar Mar 10 '25 15:03 AttilaFueloep

@AttilaFueloep @lowjoel just checking in - are you guys pretty happy with where this PR is right now? Are you just waiting on more approvals?

tonyhutter avatar Mar 25 '25 17:03 tonyhutter