john Switch to Sovyn Y.'s DES S-box expressions

These improve gate counts over DeepLearningJohnDoe's.

Mar 12 '24 23:03 solardiz

Done for AVX-512. Yet to do for OpenCL.

I've been switching to new S-boxes one-by-one, watching code size. It decreased for all but S4, where there was a slight increase compared to Roman Rusakov's (also 17 gates), so I kept the latter enabled by default for now (this is now easy to switch back and forth in the source file). For some others, code size decreased, but performance might have been hurt - hopefully not! It's hard to tell as my current testing is in a VM on a laptop running some other VMs/apps - we need to retest on an otherwise-idle system and not in a VM. For now, I optimized the S-box choice by code size (as gcc generates here), not by speed (which fluctuates here). Overall, DES_bs_b.o .text section decreased by 4.3% (41922 to 40117 bytes) and there is definitely some speedup. :-)

Oct 20 '24 02:10 solardiz

Also, need to add this to NEWS crediting Sovyn Y. - this is definitely newsworthy. Perhaps along with the OpenCL implementation, as I forgot to update NEWS in today's commit for AVX-512.

Also historically the bitslice DES stuff has mentions in CREDITS, so need to add to there as well, even though in jumbo it's weird and maybe misleading to single out this and a few other bits of functionality and ingenuity while omitting lots of others, so we may reconsider having those credits at all later.

Oct 20 '24 03:10 solardiz

Testing on GTX 1080 with Driver Version: 418.39 CUDA Version: 10.1, I get for our old code:

ptxas info    : 0 bytes gmem
ptxas info    : Compiling entry function 'DES_bs_25' for 'sm_61'
ptxas info    : Function properties for DES_bs_25
ptxas         .     0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info    : Used 186 registers, 28676 bytes smem, 336 bytes cmem[0]
binary size 290647
Salt compiled from Source:0

for all 8 new S-boxes (Sovyn Y.'s):

ptxas info    : 0 bytes gmem
ptxas info    : Compiling entry function 'DES_bs_25' for 'sm_61'
ptxas info    : Function properties for DES_bs_25
ptxas         .     0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info    : Used 216 registers, 28676 bytes smem, 336 bytes cmem[0]
binary size 278704
Salt compiled from Source:0

reverting to Roman Rusakov's S4 (same gate count) decreases both register pressure and code size:

ptxas info    : 0 bytes gmem
ptxas info    : Compiling entry function 'DES_bs_25' for 'sm_61'
ptxas info    : Function properties for DES_bs_25
ptxas         .     0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info    : Used 211 registers, 28676 bytes smem, 336 bytes cmem[0]
binary size 277179
Salt compiled from Source:0

also reverting to DeepLearningJohnDoe's S8 (same gate count) decreases register pressure a lot but increases code size a little:

ptxas info    : 0 bytes gmem
ptxas info    : Compiling entry function 'DES_bs_25' for 'sm_61'
ptxas info    : Function properties for DES_bs_25
ptxas         .     0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info    : Used 190 registers, 28676 bytes smem, 336 bytes cmem[0]
binary size 277243
Salt compiled from Source:0

I think I'll keep this final selection. With it, we went from 186 to 190 registers (+2.2%), but from 290647 to 277243 bytes (-4.6%).

Speeds change from:

Device 4: GeForce GTX 1080
Benchmarking: descrypt-opencl, traditional crypt(3) [DES OpenCL/mask accel]... LWS=64 GWS=65536 x950 DONE
Warning: "Many salts" test limited: 11/256
Many salts:     337365K c/s real, 326119K c/s virtual, Dev#4 util: 100%
Only one salt:  320023K c/s real, 318535K c/s virtual, Dev#4 util: 100%

to:

Device 4: GeForce GTX 1080
Benchmarking: descrypt-opencl, traditional crypt(3) [DES OpenCL/mask accel]... LWS=128 GWS=65536 x950 DONE
Warning: "Many salts" test limited: 12/256
Many salts:     357469K c/s real, 355766K c/s virtual, Dev#4 util: 100%
Only one salt:  337365K c/s real, 337365K c/s virtual, Dev#4 util: 98%

or forced to old LWS/GWS:

Device 4: GeForce GTX 1080
Benchmarking: descrypt-opencl, traditional crypt(3) [DES OpenCL/mask accel]... LWS=64 GWS=65536 x950 DONE
Warning: "Many salts" test limited: 12/256
Many salts:     355766K c/s real, 345884K c/s virtual, Dev#4 util: 100%
Only one salt:  339035K c/s real, 337365K c/s virtual, Dev#4 util: 98%

That's almost 6% speedup. (All of these speeds are pretty bad compared to hashcat. I assume that's because we don't have a rolled version of the code as mentioned in #1908. If we ever fix that issue, then the new S-boxes should help on top of that. Also, apparently that issue is not nearly as bad on newer GPUs.)

Oct 21 '24 01:10 solardiz

There appears to be a regression for LM, register pressure increases a lot. Was:

[solar@super run]$ LWS=128 GWS=65536 ./john -te -form=lm-opencl -dev=4 -v=5
initUnicode(UNICODE, RAW/RAW)
RAW -> RAW -> RAW
Device 4: GeForce GTX 1080
Benchmarking: LM-opencl [DES BS OpenCL/mask accel]... Loaded 9 hashes with 1 different salts to test db from test vectors
Options used: -I opencl -cl-mad-enable -cl-nv-verbose -DSM_MAJOR=6 -DSM_MINOR=1 -D__GPU__ -DDEVICE_INFO=524306 -D__SIZEOF_HOST_SIZE_T__=8 -DDEV_VER_MAJOR=418 -DDEV_VER_MINOR=39 -D_OPENCL_COMPILER -D FULL_UNROLL=1 -D USE_LOCAL_MEM=1 -D WORK_GROUP_SIZE=8 -D OFFSET_TABLE_SIZE=13 -D HASH_TABLE_SIZE=9 -D MASK_ENABLE=0 -D ITER_COUNT=1 -D LOC_0=-1 -D LOC_1=-1 -D LOC_2=-1 -D LOC_3=-1 -D IS_STATIC_GPU_MASK=0 -D CONST_CACHE_SIZE=65536 -D SELECT_CMP_STEPS=2 -D BITMAP_MASK=0x3ffffU -D REQ_BITMAP_BITS=18 $JOHN/opencl/lm_kernel_f.cl
Build log:
ptxas info    : 0 bytes gmem
ptxas info    : Compiling entry function 'lm_bs_f' for 'sm_61'
ptxas info    : Function properties for lm_bs_f
ptxas         .     0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info    : Used 168 registers, 1796 bytes smem, 384 bytes cmem[0], 72 bytes cmem[2]
binary size 293486
LWS=8 GWS=64 PASS,
Test mask: ?a?a?l?u?d?d?s
Options used: -I opencl -cl-mad-enable -cl-nv-verbose -DSM_MAJOR=6 -DSM_MINOR=1 -D__GPU__ -DDEVICE_INFO=524306 -D__SIZEOF_HOST_SIZE_T__=8 -DDEV_VER_MAJOR=418 -DDEV_VER_MINOR=39 -D_OPENCL_COMPILER -D FULL_UNROLL=1 -D USE_LOCAL_MEM=1 -D WORK_GROUP_SIZE=128 -D OFFSET_TABLE_SIZE=13 -D HASH_TABLE_SIZE=9 -D MASK_ENABLE=1 -D ITER_COUNT=561 -D LOC_0=0 -D LOC_1=2 -D LOC_2=4 -D LOC_3=-1 -D IS_STATIC_GPU_MASK=1 -D CONST_CACHE_SIZE=65536 -D SELECT_CMP_STEPS=2 -D BITMAP_MASK=0x3ffffU -D REQ_BITMAP_BITS=18 $JOHN/opencl/lm_kernel_f.cl
Build log:
ptxas info    : 0 bytes gmem
ptxas info    : Compiling entry function 'lm_bs_f' for 'sm_61'
ptxas info    : Function properties for lm_bs_f
ptxas         .     0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info    : Used 168 registers, 28676 bytes smem, 384 bytes cmem[0], 72 bytes cmem[2]
binary size 295381
LWS=128 GWS=65536 x17940 DONE
Raw:    7142M c/s real, 6915M c/s virtual, Dev#4 util: 97%

Now (with S-box selection as in my comment above):

[solar@super run]$ LWS=128 GWS=65536 ./john -te -form=lm-opencl -dev=4 -v=5
initUnicode(UNICODE, RAW/RAW)
RAW -> RAW -> RAW
Device 4: GeForce GTX 1080
Benchmarking: LM-opencl [DES BS OpenCL/mask accel]... Loaded 9 hashes with 1 different salts to test db from test vectors
Options used: -I opencl -cl-mad-enable -cl-nv-verbose -DSM_MAJOR=6 -DSM_MINOR=1 -D__GPU__ -DDEVICE_INFO=524306 -D__SIZEOF_HOST_SIZE_T__=8 -DDEV_VER_MAJOR=418 -DDEV_VER_MINOR=39 -D_OPENCL_COMPILER -D FULL_UNROLL=1 -D USE_LOCAL_MEM=1 -D WORK_GROUP_SIZE=8 -D OFFSET_TABLE_SIZE=13 -D HASH_TABLE_SIZE=9 -D MASK_ENABLE=0 -D ITER_COUNT=1 -D LOC_0=-1 -D LOC_1=-1 -D LOC_2=-1 -D LOC_3=-1 -D IS_STATIC_GPU_MASK=0 -D CONST_CACHE_SIZE=65536 -D SELECT_CMP_STEPS=2 -D BITMAP_MASK=0x3ffffU -D REQ_BITMAP_BITS=18 $JOHN/opencl/lm_kernel_f.cl
Build time: 1.456 s
Build log:
ptxas info    : 0 bytes gmem
ptxas info    : Compiling entry function 'lm_bs_f' for 'sm_61'
ptxas info    : Function properties for lm_bs_f
ptxas         .     0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info    : Used 168 registers, 1796 bytes smem, 384 bytes cmem[0], 72 bytes cmem[2]
binary size 280126
LWS=8 GWS=64 PASS,
Test mask: ?a?a?l?u?d?d?s
Options used: -I opencl -cl-mad-enable -cl-nv-verbose -DSM_MAJOR=6 -DSM_MINOR=1 -D__GPU__ -DDEVICE_INFO=524306 -D__SIZEOF_HOST_SIZE_T__=8 -DDEV_VER_MAJOR=418 -DDEV_VER_MINOR=39 -D_OPENCL_COMPILER -D FULL_UNROLL=1 -D USE_LOCAL_MEM=1 -D WORK_GROUP_SIZE=128 -D OFFSET_TABLE_SIZE=13 -D HASH_TABLE_SIZE=9 -D MASK_ENABLE=1 -D ITER_COUNT=561 -D LOC_0=0 -D LOC_1=2 -D LOC_2=4 -D LOC_3=-1 -D IS_STATIC_GPU_MASK=1 -D CONST_CACHE_SIZE=65536 -D SELECT_CMP_STEPS=2 -D BITMAP_MASK=0x3ffffU -D REQ_BITMAP_BITS=18 $JOHN/opencl/lm_kernel_f.cl
Build time: 1.483 s
Build log:
ptxas info    : 0 bytes gmem
ptxas info    : Compiling entry function 'lm_bs_f' for 'sm_61'
ptxas info    : Function properties for lm_bs_f
ptxas         .     0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info    : Used 222 registers, 28676 bytes smem, 384 bytes cmem[0], 72 bytes cmem[2]
binary size 282011
LWS=128 GWS=65536 x17940 DONE
Raw:    5368M c/s real, 5225M c/s virtual, Dev#4 util: 98%

The first build stays at 168 registers like before, but the second build somehow increases from 168 to 222.

Oct 21 '24 01:10 solardiz

There appears to be a regression for LM, register pressure increases a lot.

The first build stays at 168 registers like before, but the second build somehow increases from 168 to 222.

Reverting S1 and S3 avoids this problem (getting 168 for both builds then), but hurts code size and speeds at descrypt (there's still a speedup compared to what we had, but it's smaller).

Oct 21 '24 01:10 solardiz

I don't know how much of a performance regression it actually was for LM because benchmark results fluctuate between 5xxxM and 7xxxM for both old and new S-boxes. I guess longer actual-cracking runs would be needed to figure this out. Also on more and newer devices. But for now I'll condition usage of the new S1 and S3 to non-LM hash builds, because the register pressure regression is clear at least in these test builds here and may very well hurt performance a lot elsewhere.

Oct 21 '24 01:10 solardiz