john icon indicating copy to clipboard operation
john copied to clipboard

Add SIMD support in BitLocker CPU format

Open kholia opened this issue 7 years ago • 1 comments

kholia avatar Apr 22 '17 07:04 kholia

FWIW, I just ran all of our CPU benchmarks on AWS c5.xlarge vs. c6i.xlarge. Both are AVX-512 capable, so I was unsure which is faster. I was surprised to find out, due to relbench -v, that Bitlocker became 3.5x faster! Turns out that's because we lack SIMD support for it, but the newer CPUs support SHA-256 instructions (and presumably OpenSSL on Amazon Linux 2 is recent enough to use them).

c5.xlarge (4 vCPUs / 2 cores in Intel(R) Xeon(R) Platinum 8124M CPU @ 3.00GHz):

Benchmarking: BitLocker, BitLocker [SHA-256 AES 32/64]... (4xOMP) DONE
Speed for cost 1 (iteration count) of 1048576
Raw:    5.4 c/s real, 1.3 c/s virtual

c6i.xlarge (4 vCPUs / 2 cores in Intel(R) Xeon(R) Platinum 8375C CPU @ 2.90GHz):

Benchmarking: BitLocker, BitLocker [SHA-256 AES 32/64]... (4xOMP) DONE
Speed for cost 1 (iteration count) of 1048576
Raw:    19.1 c/s real, 4.7 c/s virtual

To confirm, I interrupted a re-run of the latter in gdb:

   0x00007ffff77900fe <sha256_block_data_order_shaext+318>:     movdqa %xmm5,%xmm7
   0x00007ffff7790102 <sha256_block_data_order_shaext+322>:     palignr $0x4,%xmm4,%xmm7
   0x00007ffff7790108 <sha256_block_data_order_shaext+328>:     nop
   0x00007ffff7790109 <sha256_block_data_order_shaext+329>:     paddd  %xmm7,%xmm6
   0x00007ffff779010d <sha256_block_data_order_shaext+333>:     sha256msg1 %xmm4,%xmm3
   0x00007ffff7790111 <sha256_block_data_order_shaext+337>:     sha256rnds2 %xmm0,%xmm2,%xmm1
   0x00007ffff7790115 <sha256_block_data_order_shaext+341>:     movdqa 0x40(%rcx),%xmm0
   0x00007ffff779011a <sha256_block_data_order_shaext+346>:     paddd  %xmm5,%xmm0
   0x00007ffff779011e <sha256_block_data_order_shaext+350>:     sha256msg2 %xmm5,%xmm6
   0x00007ffff7790122 <sha256_block_data_order_shaext+354>:     sha256rnds2 %xmm0,%xmm1,%xmm2
   0x00007ffff7790126 <sha256_block_data_order_shaext+358>:     pshufd $0xe,%xmm0,%xmm0
   0x00007ffff779012b <sha256_block_data_order_shaext+363>:     movdqa %xmm6,%xmm7
   0x00007ffff779012f <sha256_block_data_order_shaext+367>:     palignr $0x4,%xmm5,%xmm7
   0x00007ffff7790135 <sha256_block_data_order_shaext+373>:     nop
   0x00007ffff7790136 <sha256_block_data_order_shaext+374>:     paddd  %xmm7,%xmm3
=> 0x00007ffff779013a <sha256_block_data_order_shaext+378>:     sha256msg1 %xmm5,%xmm4
   0x00007ffff779013e <sha256_block_data_order_shaext+382>:     sha256rnds2 %xmm0,%xmm2,%xmm1
   0x00007ffff7790142 <sha256_block_data_order_shaext+386>:     movdqa 0x60(%rcx),%xmm0
   0x00007ffff7790147 <sha256_block_data_order_shaext+391>:     paddd  %xmm6,%xmm0
   0x00007ffff779014b <sha256_block_data_order_shaext+395>:     sha256msg2 %xmm6,%xmm3
   0x00007ffff779014f <sha256_block_data_order_shaext+399>:     sha256rnds2 %xmm0,%xmm1,%xmm2
   0x00007ffff7790153 <sha256_block_data_order_shaext+403>:     pshufd $0xe,%xmm0,%xmm0
   0x00007ffff7790158 <sha256_block_data_order_shaext+408>:     movdqa %xmm3,%xmm7
   0x00007ffff779015c <sha256_block_data_order_shaext+412>:     palignr $0x4,%xmm6,%xmm7
   0x00007ffff7790162 <sha256_block_data_order_shaext+418>:     nop
   0x00007ffff7790163 <sha256_block_data_order_shaext+419>:     paddd  %xmm7,%xmm4
   0x00007ffff7790167 <sha256_block_data_order_shaext+423>:     sha256msg1 %xmm6,%xmm5
   0x00007ffff779016b <sha256_block_data_order_shaext+427>:     sha256rnds2 %xmm0,%xmm2,%xmm1
   0x00007ffff779016f <sha256_block_data_order_shaext+431>:     movdqa 0x80(%rcx),%xmm0

I wonder what the nop are for in there.

Just thought I'd share. SIMD should be faster yet, since we can fit 16 parallel SHA-256's in AVX-512.

solardiz avatar Mar 23 '23 20:03 solardiz