john
john copied to clipboard
Add SIMD support in BitLocker CPU format
FWIW, I just ran all of our CPU benchmarks on AWS c5.xlarge
vs. c6i.xlarge
. Both are AVX-512 capable, so I was unsure which is faster. I was surprised to find out, due to relbench -v
, that Bitlocker became 3.5x faster! Turns out that's because we lack SIMD support for it, but the newer CPUs support SHA-256 instructions (and presumably OpenSSL on Amazon Linux 2 is recent enough to use them).
c5.xlarge
(4 vCPUs / 2 cores in Intel(R) Xeon(R) Platinum 8124M CPU @ 3.00GHz
):
Benchmarking: BitLocker, BitLocker [SHA-256 AES 32/64]... (4xOMP) DONE
Speed for cost 1 (iteration count) of 1048576
Raw: 5.4 c/s real, 1.3 c/s virtual
c6i.xlarge
(4 vCPUs / 2 cores in Intel(R) Xeon(R) Platinum 8375C CPU @ 2.90GHz
):
Benchmarking: BitLocker, BitLocker [SHA-256 AES 32/64]... (4xOMP) DONE
Speed for cost 1 (iteration count) of 1048576
Raw: 19.1 c/s real, 4.7 c/s virtual
To confirm, I interrupted a re-run of the latter in gdb
:
0x00007ffff77900fe <sha256_block_data_order_shaext+318>: movdqa %xmm5,%xmm7
0x00007ffff7790102 <sha256_block_data_order_shaext+322>: palignr $0x4,%xmm4,%xmm7
0x00007ffff7790108 <sha256_block_data_order_shaext+328>: nop
0x00007ffff7790109 <sha256_block_data_order_shaext+329>: paddd %xmm7,%xmm6
0x00007ffff779010d <sha256_block_data_order_shaext+333>: sha256msg1 %xmm4,%xmm3
0x00007ffff7790111 <sha256_block_data_order_shaext+337>: sha256rnds2 %xmm0,%xmm2,%xmm1
0x00007ffff7790115 <sha256_block_data_order_shaext+341>: movdqa 0x40(%rcx),%xmm0
0x00007ffff779011a <sha256_block_data_order_shaext+346>: paddd %xmm5,%xmm0
0x00007ffff779011e <sha256_block_data_order_shaext+350>: sha256msg2 %xmm5,%xmm6
0x00007ffff7790122 <sha256_block_data_order_shaext+354>: sha256rnds2 %xmm0,%xmm1,%xmm2
0x00007ffff7790126 <sha256_block_data_order_shaext+358>: pshufd $0xe,%xmm0,%xmm0
0x00007ffff779012b <sha256_block_data_order_shaext+363>: movdqa %xmm6,%xmm7
0x00007ffff779012f <sha256_block_data_order_shaext+367>: palignr $0x4,%xmm5,%xmm7
0x00007ffff7790135 <sha256_block_data_order_shaext+373>: nop
0x00007ffff7790136 <sha256_block_data_order_shaext+374>: paddd %xmm7,%xmm3
=> 0x00007ffff779013a <sha256_block_data_order_shaext+378>: sha256msg1 %xmm5,%xmm4
0x00007ffff779013e <sha256_block_data_order_shaext+382>: sha256rnds2 %xmm0,%xmm2,%xmm1
0x00007ffff7790142 <sha256_block_data_order_shaext+386>: movdqa 0x60(%rcx),%xmm0
0x00007ffff7790147 <sha256_block_data_order_shaext+391>: paddd %xmm6,%xmm0
0x00007ffff779014b <sha256_block_data_order_shaext+395>: sha256msg2 %xmm6,%xmm3
0x00007ffff779014f <sha256_block_data_order_shaext+399>: sha256rnds2 %xmm0,%xmm1,%xmm2
0x00007ffff7790153 <sha256_block_data_order_shaext+403>: pshufd $0xe,%xmm0,%xmm0
0x00007ffff7790158 <sha256_block_data_order_shaext+408>: movdqa %xmm3,%xmm7
0x00007ffff779015c <sha256_block_data_order_shaext+412>: palignr $0x4,%xmm6,%xmm7
0x00007ffff7790162 <sha256_block_data_order_shaext+418>: nop
0x00007ffff7790163 <sha256_block_data_order_shaext+419>: paddd %xmm7,%xmm4
0x00007ffff7790167 <sha256_block_data_order_shaext+423>: sha256msg1 %xmm6,%xmm5
0x00007ffff779016b <sha256_block_data_order_shaext+427>: sha256rnds2 %xmm0,%xmm2,%xmm1
0x00007ffff779016f <sha256_block_data_order_shaext+431>: movdqa 0x80(%rcx),%xmm0
I wonder what the nop
are for in there.
Just thought I'd share. SIMD should be faster yet, since we can fit 16 parallel SHA-256's in AVX-512.