john descrypt-opencl should allow for use of a generic kernel (not force per-salt kernels)

Often it is important to minimize startup time rather than maximize runtime performance. Right now, descrypt-opencl always(?) uses per-salt kernels, which typically take up to ~2 hours to build from source and up to tens of minutes to "build from binary". It should be possible to request use a generic kernel instead - a kernel that would run slower (needing pointer indirection for the salts), but would only be built once. We used to have that, but lost it since. We should reintroduce it as an option (maybe even as the default), and should print a "Note: ..." to the user indicating how to enable the other behavior.

IIRC, previously the HARDCODE_SALT setting controlled this, but now it's at 0 yet we do hard-code salts into the multiple kernels.

Now there's also the USE_BASIC_KERNEL setting, currently only enabled when the device is a CPU, or on OS X. I thought that maybe this was what we needed. I tried setting "--device" to a CPU to test it, but first I got unrealistically good speeds for a CPU (200M+ c/s on a machine that only does under 80M with C+intrinsics) with AMD OpenCL and a segfault with Intel's. Restarting these, I also got an instant segfault with AMD's. So this is actually unreliable at least with those (outdated versions of) OpenCL backends. Trying it on NVIDIA GPU (by forcing "#define USE_BASIC_KERNEL 1" in the source), I got:

OpenCL CL_INVALID_KERNEL_ARGS error in opencl_DES_bs_b_plug.c:677 - Enque kernel DES_bs_25 failed.

I'd rather leave (re)implementing this to someone familiar with the code. ;-)

Aug 14 '17 19:08 solardiz

https://github.com/openwall/john/pull/5902#issuecomment-3565101416

BTW since some time now, we can use JOHN_DES_KERNEL=bs_b to force a kernel that is not as fast, but doesn't use per-salt kernels. For short runs with tons of salts, it's a gem.

Please add this note to #2666, and can we make this feature more apparent to users? Maybe even the default when there are many salts loaded or many missing per-salt kernels?

For defaulting to revert to bs_b when more than n salts are loaded, we need a value of n. Actually counting missing (not cached yet) kernels is a great idea except it's probably somewhat complicated. Making it even more complicated: For eg. a hundred salts even cached binaries ~~will~~ may be slow. So maybe cached binaries (or some cached binaries) should merely increase n accordingly? A rough assumption could be that a cached binary takes half the time to build (edit: cached is sometimes much faster, but it depends on runtime).

For a starter, I guess we could simply always print a notice when more than 10 salts are loaded (the current threshold for printing that "Building %d per-salt kernels, one dot per three salts" at all, is 10). I can take care of that. So what exact message should I add? Something like "Notice: Setting environment variable JOHN_DES_KERNEL=bs_b will speed up loading at expense of slower cracking.".

Nov 22 '25 00:11 magnumripper

we could simply always print a notice when more than 10 salts are loaded [...] Something like "Notice: Setting environment variable JOHN_DES_KERNEL=bs_b will speed up loading at expense of slower cracking.".

I agree that's a good start, and maybe all we need long-term as well because any magic like what I had suggested would be complicated and confusing. Your suggested message looks fine to me.

However, JOHN_DES_KERNEL=bs_b is unnecessarily cryptic, exposing implementation detail/naming to the user, and is specific to just this one format. Maybe we should use a uniform way of altering formats' configuration in john.conf (we already have a few settings like this in there), and refer to that in the message.

Nov 22 '25 01:11 solardiz

~~Oh and BTW on my 5080, our DEScrypt-opencl does 1113M/1091M c/s while hashcat somehow manages to reach 4389.5 MH/s~~

Edit: I accidentally ran john with bs_b at first. JtR and Hashcat are almost exactly as fast, at ~4388M c/s but Hashcat does it without salt-specific kernels. If we can achieve performance without salt-specific kernels, this problem becomes obsolete...

Nov 23 '25 08:11 magnumripper

$ for i in bs_b bs_h bs_f ; do JOHN_DES_KERNEL=$i ../run/john -test -form:descrypt-opencl ; done
Device 1: NVIDIA GeForce RTX 5080
Using basic kernel (bs_b)
Benchmarking: descrypt-opencl, traditional crypt(3) [DES OpenCL/mask accel]... LWS=256 GWS=65536 x3300 DONE
Warning: "Many salts" test limited: 11/256
Many salts:     1106M c/s real, 1111M c/s virtual, Dev#1 util: 100%
Only one salt:  1086M c/s real, 1086M c/s virtual, Dev#1 util: 100%

Device 1: NVIDIA GeForce RTX 5080
Using salt-specific kernels (bs_h)
Benchmarking: descrypt-opencl, traditional crypt(3) [DES OpenCL/mask accel]... LWS=128 GWS=131072 x3300 DONE
Warning: "Many salts" test limited: 12/256
Many salts:     2495M c/s real, 2495M c/s virtual, Dev#1 util: 100%
Only one salt:  2380M c/s real, 2380M c/s virtual, Dev#1 util: 100%

Device 1: NVIDIA GeForce RTX 5080
Using fully unrolled, salt-specific kernels (bs_f)
Benchmarking: descrypt-opencl, traditional crypt(3) [DES OpenCL/mask accel]... LWS=64 GWS=131072 x3300 DONE
Warning: "Many salts" test limited: 21/256
Many salts:     4366M c/s real, 4346M c/s virtual, Dev#1 util: 100%
Only one salt:  4048M c/s real, 4048M c/s virtual, Dev#1 util: 98%

And for LM:

$ for i in bs_b bs_f ; do JOHN_DES_KERNEL=$i ../run/john -test -form:lm-opencl ; done
Device 1: NVIDIA GeForce RTX 5080
Benchmarking: LM-opencl [DES BS OpenCL/mask accel]... Using basic kernel (lm_bs_b)
LWS=160 GWS=65536 x47610 DONE
Raw: 26129M c/s real, 26129M c/s virtual, Dev#1 util: 97%

Device 1: NVIDIA GeForce RTX 5080
Benchmarking: LM-opencl [DES BS OpenCL/mask accel]... Using fully unrolled kernel (lm_bs_f)
LWS=32 GWS=65536 x47610 DONE
Raw: 30589M c/s real, 30589M c/s virtual, Dev#1 util: 99%

Nov 23 '25 09:11 magnumripper

Someone should implement the loop rolling as mentioned in #1908, I guess. Then per-salt kernels may probably provide a further optional speedup. (Even on CPU, I estimate that a ~7% speedup is possible through code specialization per salt. We just don't bother for now.)

Nov 23 '25 09:11 solardiz

Someone should implement the loop rolling as mentioned in #1908, I guess.

I implemented such rolling (if I understood it correctly, looking at Hashcat's current kernel) and it made no difference at all. But this was on 5080 using a recent driver.

Dec 05 '25 12:12 magnumripper

Someone should implement the loop rolling as mentioned in #1908, I guess.

I implemented such rolling (if I understood it correctly, looking at Hashcat's kernel) and it made no difference at all. But this was on 5080 using a recent driver.

Looking at it again, I'm not sure I did anything right at all. In https://github.com/openwall/john/issues/1908#issuecomment-2131311023 you said "Looking at our code now, we have two kinds of kernels - with fully unrolled DES and with 2 rounds unrolled (which is quite natural as it allows for fixed indices to be used). We do not have a rolled version. We should probably implement that.".

I am completely lost - the one with 2 rounds unrolled is otherwise rolled, no? So that would be exactly what Atom said? This all is apparently above my skills. I'm not even sure what exactly is a round - is s1 a round or is it s1..s8 together? Or is it even H1+H2?

Dec 05 '25 13:12 magnumripper