john
john copied to clipboard
Make BitLocker OpenCL use shared SHA-256 code
Right now, opencl_bitlocker.h and bitlocker_kernel.cl have their own implementation of SHA-256. We could both reduce source code size/duplication and likely speed things up by switching to our shared and maintained SHA-256 code.
In particular, I notice that BitLocker's current code does not use bitselect - it can only use LUT3 and classic (two-operand) Boolean operations. This likely makes it less optimal on AMD GPUs. It also does not use rotate. So a temporary fix (before the refactoring to use shared code) may be to add such usage (in the same cases where we do it in our shared SHA-256 code).
We also have some of #if 0 && SHA256_LUT3 in the shared code - where other alternatives are faster. We currently use LUT3 for Ch and Maj but not for the Sigma/sigma functions. That's subject to change though, it's not tested with recent drivers.
nvidia 2080ti, speeds at -lws=64 -gws=8704 (not optimal size, the format is so incredibly slow I just picked a number for rough evaluation):
Original kernel binary size: 885314 bytes and 2371 c/s. Using rotate(): 884633 bytes and 2378 c/s. Also using our shared lut3() function instead of the existing, slightly different, one: 882102 bytes and 2397 c/s.
The speeds might be just the usual flux, but the sizes are interesting.
I tried using plain code instead of lut3 for all three functions (one at a time) but only one of them made for smaller code size (the xor) while not really gaining any speed.
More interesting would be testing it on AMD of course. I'm not aware of bitselect alternatives for the "xor" or the "xorand" though. Current code is in #5278.
The difference of lut3 implementation is interesting. Here's our shared one:
inline uint lut3(uint a, uint b, uint c, uint imm)
{
uint r;
asm("lop3.b32 %0, %1, %2, %3, %4;"
: "=r" (r)
: "r" (a), "r" (b), "r" (c), "i" (imm));
return r;
}
Here's the old one in bitlocker:
inline unsigned int LOP3LUT_XOR(unsigned int a, unsigned int b, unsigned int c)
{
unsigned int d;
asm("lop3.b32 %0, %1, %2, %3, 0x96;": "=r"(d):"r"(a), "r"(b), "r"(c));
return d;
}
Is the difference that the latter somehow isn't using "immediate" for the 0x96??
For whatever a single run test is worth, super's AMD with other session paused went from 1779 c/s to 1800 c/s. Not a lot.
The difference of lut3 implementation is interesting. Here's our shared one:
BTW with the new code, the number of lop3 instructions in the PTX somehow bumps with one - from 2592 to 2593. The only difference apart from the exact inline assembler for LUT3 would be using rotate() instead of shift+or.
Is the difference that the latter somehow isn't using "immediate" for the 0x96??
Looking at PTX output, there's no difference in the actual instruction:
Before:
$ grep -C3 -m1 lop3 bitlocker_kernel.cl.bin.1
mov.u32 %r58, 1340744138;
mov.u32 %r59, -2027339864;
// begin inline asm
lop3.b32 %r56, %r57, %r58, %r59, 0x96;
// end inline asm
mov.u32 %r61, 528734635;
mov.u32 %r62, 1359893119;
Using shared (150 == 0x96):
$ grep -C3 -m1 lop3 bitlocker_kernel.cl.bin
shf.l.wrap.b32 %r69, %r72, %r72, 7;
shf.l.wrap.b32 %r68, %r72, %r72, 21;
// begin inline asm
lop3.b32 %r66, %r67, %r68, %r69, 150;
// end inline asm
mov.u32 %r71, 528734635;
mov.u32 %r73, -1694144372;
#5278 was a start, but Yours Truly aren't doing anything more with that format - it's very hard to spot where a standard SHA-2 operation starts or ends. And testing is slow!
it's very hard to spot where a standard SHA-2 operation starts or ends
I guess it would be easier to rewrite based on the CPU format, which is pretty simple.