icp: Port AVX2 implementation of aes-gcm from BoringSSL
Motivation and Context
Zen 3 CPUs support the VAES and VPCLMULDQ instructions which extend the width of each instruction from 128-bits to 256-bits. BoringSSL has recently implemented this version for AES-GCM and it provides up to a 80% speedup. See google/boringssl@3b6e1be4391d96e81cee022f77f7bab85d51cf4e.
Description
I've backported the implementation from BoringSSL, adapting code from google/boringssl@3b6e1be4391d96e81cee022f77f7bab85d51cf4e (but picking the tip of master), as well as from google/boringssl@62f9751ade8373ae3339daee581c80a173206321 which changed the primitive signature from 6 arguments to 7 (by not implicitly relying on the address offset of the ghash structure.)
Adaptations for icp (akin to #9749) as well as to use the RET macro for kernel code are in the third commit.
The fifth to seventh commits combine the use_avx/use_avx2 flags into an enum, allowing toggling of the different implementations that are available. Also, define different values of CAN_USE_GCM_ASM to indicate various levels of compiler support.
How Has This Been Tested?
Compile tested.
I'm now running it on my ZFS-on-root main machine.
@robn has a Wycheproof test set (#17089) that has been merged; these changes pass.
Types of changes
- [ ] Bug fix (non-breaking change which fixes an issue)
- [ ] New feature (non-breaking change which adds functionality)
- [x] Performance enhancement (non-breaking change which improves efficiency)
- [x] Code cleanup (non-breaking change which makes code smaller or more readable)
- [ ] Breaking change (fix or feature that would cause existing functionality to change)
- [ ] Library ABI change (libzfs, libzfs_core, libnvpair, libuutil and libzfsbootenv)
- [ ] Documentation (a change to man pages or other documentation)
Checklist:
- [x] My code follows the OpenZFS code style requirements.
- [ ] I have updated the documentation accordingly.
- [x] I have read the contributing document.
- [x] I have added tests to cover my changes.
- [x] I have run the ZFS Test Suite with this change applied.
- [x] All commit messages are properly formatted and contain
Signed-off-by.
Hey @AttilaFueloep I tried to port your changes in #9749 and I can't figure out what you mean by the change with the comment // ICP does not zero key schedule. in the modified assembly sources. I've ported the other two changes, for the round offset in the AES_KEY struct, as well as the OpenSSL vs ICP representation of number of AES rounds. I've also fixed the gcm_init_vpclmulqdq_avx2 method which you mentioned ICP stores H in network order.
Strong opening move! I've had a little play with it, here's what I got.
Before the final thing, you should run make checkstyle, and fix the errors it throws out.
I added this patch to let me see what it thinks is happening on my test machines:
diff --git module/zcommon/simd_stat.c module/zcommon/simd_stat.c
index d82a88ca9..da557bbb0 100644
--- module/zcommon/simd_stat.c
+++ module/zcommon/simd_stat.c
@@ -117,6 +117,10 @@ simd_stat_kstat_data(char *buf, size_t size, void *data)
"pclmulqdq", zfs_pclmulqdq_available());
off += SIMD_STAT_PRINT(simd_stat_kstat_payload,
"movbe", zfs_movbe_available());
+ off += SIMD_STAT_PRINT(simd_stat_kstat_payload,
+ "vaes", zfs_vaes_available());
+ off += SIMD_STAT_PRINT(simd_stat_kstat_payload,
+ "vpclmulqdq", zfs_vpclmulqdq_available());
off += SIMD_STAT_PRINT(simd_stat_kstat_payload,
"osxsave", boot_cpu_has(X86_FEATURE_OSXSAVE));
With that, my old Intel 2019 junker laptop says:
# grep -E 'vaes|vpclmulqdq' /proc/spl/kstat/zfs/simd
vaes 0
vpclmulqdq 0
My much nicer Ryzen 5 from last year says:
# grep -E 'vaes|vpclmulqdq' /proc/spl/kstat/zfs/simd
vaes 1
vpclmulqdq 1
Unfortunately we don't have visibility on the ICP microbenchmarks like we do for checksums and raidz, but we can at least see the options available:
# cat /sys/module/zfs/parameters/icp_gcm_impl
cycle [fastest] avx avx2 generic pclmulqdq
So I'd say it's all wired up right, which is half the fun.
I set it to avx and created a dataset, then unmounted and exported. Then I set it to avx2 and tried to zfs load-key, but that didn't work:
# zfs load-key -a
Enter passphrase for 'tank/enc':
Key load error: Incorrect key provided for 'tank/enc'.
Trying to create the pool and dataset with avx2 selected gets a crash:
# zpool create tank -O encryption=aes-256-gcm -O keyformat=passphrase /home/robn/blk
Enter new passphrase:
Re-enter new passphrase:
Killed
And the kernel has a nice complaint:
[ 2769.438918] BUG: unable to handle page fault for address: ffff9c23c8307e10
[ 2769.439133] #PF: supervisor read access in kernel mode
[ 2769.439290] #PF: error_code(0x0000) - not-present page
[ 2769.439447] PGD 110e01067 P4D 110e01067 PUD 0
[ 2769.439588] Oops: 0000 [#1] PREEMPT SMP NOPTI
[ 2769.439733] CPU: 0 PID: 73114 Comm: zpool Tainted: P OE 6.1.0-25-amd64 #1 Debian 6.1.106-3
[ 2769.440035] Hardware name: FreeBSD BHYVE/BHYVE, BIOS 14.0 10/17/2021
[ 2769.440236] RIP: 0010:aes_gcm_dec_update_vaes_avx2+0x3e/0x5e0 [zfs]
[ 2769.440596] Code: 0c 24 c4 e2 71 00 c8 c4 42 7d 5a 18 c4 62 25 00 d8 44 8b 91 78 01 00 00 46 8d 14 95 f0 ff ff ff 4e 8d 5c 91 60 c4 62 7d 5a 09 <c4> 42 7d 5a 13 c5 25 fe 1d d5 ed 0b 00 48 83 fa 7f 0f 8
6 31 03 00
[ 2769.441171] RSP: 0018:ffffb1b6c37df358 EFLAGS: 00010086
[ 2769.441339] RAX: ffffffffc089b140 RBX: ffffb1b6c37df43c RCX: ffff9c214b6eae00
[ 2769.441566] RDX: 0000000000000060 RSI: ffff9c208e918500 RDI: ffff9c208e918500
[ 2769.441793] RBP: 0000000000000006 R08: ffffb1b6c37df430 R09: ffff9c20e8eba3c0
[ 2769.442019] R10: 000000009f3073ec R11: ffff9c23c8307e10 R12: ffffb1b6c37df498
[ 2769.442102] R13: ffffffffc089b140 R14: ffffb1b6c37df420 R15: ffffb1b6c37df488
[ 2769.442102] FS: 00007f2d7ead5840(0000) GS:ffff9c23afc00000(0000) knlGS:0000000000000000
[ 2769.442102] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 2769.442102] CR2: ffff9c23c8307e10 CR3: 00000001cccaa000 CR4: 00000000003506f0
[ 2769.442102] Call Trace:
[ 2769.442102] <TASK>
[ 2769.442102] ? __die_body.cold+0x1a/0x1f
[ 2769.442102] ? page_fault_oops+0xd2/0x2b0
[ 2769.442102] ? srso_alias_return_thunk+0x5/0x7f
[ 2769.442102] ? search_bpf_extables+0x5b/0x80
[ 2769.442102] ? exc_page_fault+0xca/0x170
[ 2769.442102] ? asm_exc_page_fault+0x22/0x30
[ 2769.442102] ? aesni_gcm_decrypt_avx+0x10/0x10 [zfs]
[ 2769.442102] ? aesni_gcm_decrypt_avx+0x10/0x10 [zfs]
[ 2769.442102] ? aes_gcm_dec_update_vaes_avx2+0x3e/0x5e0 [zfs]
[ 2769.442102] ? aesni_gcm_decrypt_avx2+0x2a/0x50 [zfs]
[ 2769.442102] ? gcm_decrypt_final_avx+0x167/0x460 [zfs]
[ 2769.442102] ? crypto_update_uio+0xc0/0x120 [zfs]
[ 2769.442102] ? aesni_gcm_decrypt_avx+0x10/0x10 [zfs]
[ 2769.442102] ? aes_decrypt_atomic+0x1ca/0x310 [zfs]
[ 2769.442102] ? crypto_decrypt+0x78/0x1c0 [zfs]
[ 2769.442102] ? zio_do_crypt_uio+0x2ce/0x400 [zfs]
[ 2769.442102] ? zio_crypt_key_unwrap+0x254/0x480 [zfs]
[ 2769.442102] ? dsl_crypto_key_open.constprop.0+0x2d1/0x350 [zfs]
[ 2769.442102] ? dsl_crypto_key_open.constprop.0+0x2d1/0x350 [zfs]
[ 2769.442102] ? spa_keystore_dsl_key_hold_dd+0x12e/0x280 [zfs]
[ 2769.442102] ? __kmalloc_node+0x4c/0x150
[ 2769.442102] ? spa_keystore_create_mapping+0x79/0x200 [zfs]
[ 2769.442102] ? dsl_dataset_hold_obj_flags+0x48/0x90 [zfs]
[ 2769.442102] ? dsl_pool_create+0x1bc/0x490 [zfs]
[ 2769.442102] ? spa_create+0x83d/0xdb0 [zfs]
[ 2769.442102] ? zfs_ioc_pool_create+0xab/0x310 [zfs]
[ 2769.442102] ? zfsdev_ioctl_common+0x6a0/0x7c0 [zfs]
[ 2769.442102] ? __kmalloc_node+0xbf/0x150
[ 2769.442102] ? srso_alias_return_thunk+0x5/0x7f
[ 2769.442102] ? zfsdev_ioctl+0x4f/0xd0 [zfs]
[ 2769.442102] ? __x64_sys_ioctl+0x90/0xd0
[ 2769.442102] ? do_syscall_64+0x55/0xb0
That's all I have time for tonight. This is a good start!
Oh the other thing I forgot to add, after loading the module it says fastest, and fails as above. So it is selecting avx2 as fastest, which is good! (we need a stat for it, for sure).
With #17089, looks pretty great:
$ ./tests/zfs-tests/cmd/crypto_test -c tests/zfs-tests/tests/functional/crypto/aes_gcm_test.txt
AES-GCM [generic+generic]: tests=316: passed=316 failed=0
AES-GCM [x86_64+generic]: tests=316: passed=316 failed=0
AES-GCM [aesni+generic]: tests=316: passed=316 failed=0
AES-GCM [generic+pclmulqdq]: tests=316: passed=316 failed=0
AES-GCM [x86_64+pclmulqdq]: tests=316: passed=316 failed=0
AES-GCM [aesni+pclmulqdq]: tests=316: passed=316 failed=0
AES-GCM [x86_64+avx]: tests=316: passed=316 failed=0
AES-GCM [aesni+avx]: tests=316: passed=316 failed=0
AES-GCM [x86_64+avx2]: tests=316: passed=316 failed=0
AES-GCM [aesni+avx2]: tests=316: passed=316 failed=0
$ ./tests/zfs-tests/cmd/crypto_test -p AES-GCM
avg encrypt (1000 rounds) 1K 2K 4K 8K 16K 32K 64K 128K 256K 512K
AES-GCM [generic+generic] 26us 54us 149us 215us 458us 938us 1ms 3ms 7ms 14ms
AES-GCM [x86_64+generic] 14us 29us 89us 201us 428us 881us 1ms 3ms 7ms 14ms
AES-GCM [aesni+generic] 11us 24us 78us 182us 393us 809us 1ms 3ms 6ms 13ms
AES-GCM [generic+pclmulqdq] 7us 12us 21us 41us 79us 156us 310us 618us 1ms 2ms
AES-GCM [x86_64+pclmulqdq] 6us 10us 18us 33us 64us 127us 252us 501us 1ms 1ms
AES-GCM [aesni+pclmulqdq] 4us 6us 11us 19us 37us 73us 144us 287us 572us 1ms
AES-GCM [x86_64+avx] 2us 2us 3us 4us 5us 8us 15us 28us 55us 107us
AES-GCM [aesni+avx] 2us 2us 3us 3us 5us 8us 15us 28us 55us 106us
AES-GCM [x86_64+avx2] 2us 2us 2us 3us 4us 6us 10us 18us 34us 63us
AES-GCM [aesni+avx2] 2us 2us 2us 3us 4us 6us 10us 18us 34us 63us
I can't figure out what you mean by the change with the comment
// ICP does not zero key schedule.in the modified assembly sources.
Of the top of my head:
Openssl initializes the keyschedule with zeros while ICP does not. Later on the openssl code uses that fact to determine some parameter (the number of iterations IIRC). For IPC we have to cope with that since the keyschedule is created by the aes routines which I didn't want to touch.
If you need the patch or not depends on how the BoringSSL asm routine is handling this.
@lowjoel I just merged https://github.com/openzfs/zfs/pull/17089. Would you mind rebasing this PR so we can have it run against that new test code?
Given #16601 and #14531 we have now three PRs modifying gcm.c in incompatible ways. I already started refactoring the gcm routines in #14531. I'll try to extend that to make it easy to plug in additional SIMD routines and come up with something over the weekend. I think I've a plan already.
@lowjoel I just merged #17089. Would you mind rebasing this PR so we can have it run against that new test code?
@tonyhutter rebased cleanly. Letting the test suite run.
@lowjoel here's two commits you may or may not wish to include here:
- robn/zfs@a1cdddc63bb92d33c355a32b1a01e22c895e42c6: actually test
avx2incrypto_test - robn/zfs@cedd824e2ba30d5167abdfdd5ed08d1763e03507: include
vaesandvpclmulqdqin/proc/spl/kstat/zfs/simd
@robn thanks - pulled into this PR.
Another rebase on top of master should fix the Fedora build errors.
I made some progress in refactoring the GCM SIMD interface. There's still some refinements and a lot of cleanup and documentation to do, so I'll need another couple of days to come up with something usefull.
@AttilaFueloep is merging this PR going to mess up your work? I am all in favour of better abstractions inside the ICP, but that's going to take time to finish, review and merge, while this PR is ready to go right now. I'd like to not to hold it up unless it's really going to create a new mess for you.
Is there anything I can help you with?
Another rebase on top of master should fix the Fedora build errors.
Done, thanks @tonyhutter
@robn As I mentioned above, there are currently two other open PRs adding gcm SIMD: #14531 which is mine and #16601, opened by @w0xel . If you look at gcm.c in the three commits, you'll see what I mean. I'm currently refactoring the mess by writing generalized gcm_{enc,dec,..}_simd() functions where you could "just" plug in your ported assembly. So it's not about refactoring ICP but just gcm.c to enable multiple SIMD implementations. All three mentioned commits do that in different ways currently.
Would be good to hear the thoughts of @lowjoel and @w0xel as well.
is merging this PR going to mess up your work?
This depends on how you define mess up. If this is merged I can easily revert it in the upcoming refactoring PR and then rebase it on top of the refactoring. So basically merging will take time pressure off me but add additional work.
Is there anything I can help you with?
Yes, I've no real test rig, so I'd greatly appreciate any help with testing once I'm done..
All three mentioned commits do that in different ways currently.
@AttilaFueloep if you are able to split the refactoring part out separately I would not mind rebasing my changes on top of yours, if that helps make an easier decision for the maintainers. Otherwise whichever is easiest for maintainers would be my preference.
Or, if you'd like to coordinate over the zfs-slack we can discuss there too (mainly how @robn and I communcate)
Yes. I'm fine either way as well. After thinking more about it, I'd say I tend to prefer merging, so I don't feel pressed.
Or, if you'd like to coordinate over the zfs-slack
No, sorry. No Slack, IRC or whatever. Once bitten twice shy.
@lowjoel BTW, did you see https://github.com/openzfs/zfs/pull/17058#issuecomment-2683240119 already?
@AttilaFueloep sounds like merging now suits everyone: you get some breathing room, @lowjoel gets his gold star, and fancy folk with AVX2 machines get a nice bump (maybe in time for 2.3.1!)
And sure, once you've got a PR I'd be happy to stick it on the test rig and put it through its paces.
@robn That would be nice!
While I have your attention, a review of #14531, once rebased on the refactoring, would be nice as well. To be honest, I was a bit disappointed to see no review feedback on it for years. It has a couple of satisfied users, so I'm trying to get it over the line currently.
No, sorry. No Slack, IRC or whatever. Once bitten twice shy.
oops, sorry to hear that.
@lowjoel BTW, did you see #17058 (comment) already?
Thanks, I did, but the comment seemed out of place relative to the other asm comments. If I understand the assembly right, the line where the comment was added was intended to break the loop if we've reached the number of rounds for AES-256, but that seems to disagree with the comment. Maybe I'm misunderstanding something. Or does that affect some other key size?
I'm now also wondering if I'm missing testing for some corner case because that change hasn't been ported. And AFAIK BoringSSL uses the same API as OpenSSL
@lowjoel Again, off the top of my head,
break the loop if we've reached the number of rounds for AES-256,
Right, the original code used a zero detection to break the loop, while my change checks for the number of rounds given. Of course this depends on how the assembly is written, it may be totally different in your case.
Let me look it up more thoroughly, it's been six years since I wrote this so my memories are a bit hazy.
@lowjoel Ok, had a quick look:
module/icp/asm-x86_64/modes/aesni-gcm-avx2.S
384 movl 240(%rcx),%r10d
385 leal -20(,%r10,4),%r10d
....
473 cmpl $24,%r10d
474 jl .Laes128__func1
475 je .Laes192__func1
So %r10d seems to hold the number of rounds. 24 seems to be a strange value though, not sure what '%r10d' contains. Anyway you seem to be good since this code doesn't depend on a zeroed keyschedule.
@AttilaFueloep
24seems to be a strange value though, not sure what '%r10d' contains.
I also interpreted it to be number of rounds, 0-based. The instruction leal -20(,%r10,4),%r10d then converts r10 into one of {16,24,32} (had to fix the math for ICP), hence the later cmp against 24 checks for AES-192. r10 is also used to produce r11 which loads the later round key hence the mathing around... I had to stare at it for a good while myself.
Anyway you seem to be good since this code doesn't depend on a zeroed keyschedule.
Thanks!
@lowjoel
24seems to be a strange value though, not sure what '%r10d' contains.I also interpreted it to be number of rounds, 0-based. The instruction
leal -20(,%r10,4),%r10dthen converts r10 into one of {16,24,32}
Sure, I was just wondering why it's expected to be 24 and not 11. If I read the code correctly %r10 is only used in compare and branch context where you could use both. I may miss something obvious though.
At the risk of nitpicking, I think we should reconsider the module parameter name. Currently it's called
avx2but the main feature it adds is support for the VAES/VCLMUL extension to AES-NI/CLMUL. So I'd suggest something likevaes-avx2. (Actually it uses AVX2 only due to the fact that Zen 3 has no AVX-512/AVX10.) Thoughts?
Yeah. I think you're right. The commit calls it VAES+AVX2; I should have stuck with it. I'll push this later today
So modulo Htab size and module parameter name we are good to go I'd say.
Added htab refactor in https://github.com/openzfs/zfs/pull/17058/commits/c49a94cdbda560270967a897693c9a06e8b2be0a - the method seemed to not fit in the program flow any more hence the slightly more aggressive rewrite. Thanks @AttilaFueloep
@AttilaFueloep renamed to avx2-vaes in https://github.com/openzfs/zfs/pull/17058/commits/59fb58a9d3ede6efe91c427527869871049c6c88; identifiers in code still remains just avx2 for symmetry with the assembly.
@lowjoel Sorry for not being clear enough, was in a hurry. My main concern was the user visible module parameter that is shown in e.g. cat /sys/module/zfs/parameters/icp_gcm_impl. I think it would be good enough to just change the string, and keep the #define as is.
gcm.c L851
static const struct {
const char *name;
uint32_t sel;
} gcm_impl_opts[] = {
{ "cycle", IMPL_CYCLE },
{ "fastest", IMPL_FASTEST },
#ifdef CAN_USE_GCM_ASM
{ "avx", IMPL_AVX },
=> { "avx2", IMPL_AVX2 },
#endif
};
@lowjoel Regarding the Htab size: I'd have passed the GCM context to gcm_simd_get_htab_size() and returned an appropriate value there, but your change works as well.
@AttilaFueloep @lowjoel just checking in - are you guys pretty happy with where this PR is right now? Are you just waiting on more approvals?