john
john copied to clipboard
md5crypt-opencl on Intel Alder Lake GPU: CL_OUT_OF_RESOURCES (-5) error in opencl_md5crypt_fmt_plug.c:404 - Copy data back
Hi, I've got an Intel Alder Lake (N100) with "Intel Xe (Gen 12.2) GPU" and looking to use md5crypt-opencl on it.
I'm getting this error:
$ /snap/john-the-ripper/610/run/john --format=md5crypt-opencl shadow
Device 1: Intel(R) Graphics [0x46d1]
Using default input encoding: UTF-8
Loaded 1 password hash (md5crypt-opencl, crypt(3) $1$ [MD5 OpenCL])
0: OpenCL CL_OUT_OF_RESOURCES (-5) error in opencl_md5crypt_fmt_plug.c:404 - Copy data back
Segmentation fault
I searched for similar reports, and also tried with
GWS=64
GWS=524288
did not change the outcome (still CL_OUT_OF_RESOURCES).
Code link from error message: https://github.com/openwall/john/blob/f55f42067431c0e8f67e600768cd8a3ad8439818/src/opencl_md5crypt_fmt_plug.c#L404
clinfo
Number of platforms 1
Platform Name Intel(R) OpenCL HD Graphics
Platform Vendor Intel(R) Corporation
Platform Version OpenCL 3.0
Platform Profile FULL_PROFILE
...
Platform Name Intel(R) OpenCL HD Graphics
Number of devices 1
Device Name Intel(R) Graphics [0x46d1]
Device Vendor Intel(R) Corporation
Device Vendor ID 0x8086
Device Version OpenCL 3.0 NEO
Device UUID 86800000-d146-0000-0000-000000000000
Driver UUID 32322e34-332e-3234-3539-350000000000
...
Driver Version 22.43.24595
Device OpenCL C Version OpenCL C 1.2
...
Latest conformance test passed v2022-04-22-00
Device Type GPU
Device PCI bus info (KHR) PCI-E, 0000:00:02.0
Device Profile FULL_PROFILE
Device Available Yes
Compiler Available Yes
Linker Available Yes
Max compute units 24
Max clock frequency 750MHz
Device IP (Intel) 0xc0000 (0.192.0)
Device ID (Intel) 18129
Slices (Intel) 1
Sub-slices per slice (Intel) 2
EUs per sub-slice (Intel) 16
Threads per EU (Intel) 7
Feature capabilities (Intel) DP4A
Device Partition (core)
Max number of sub-devices 0
Supported partition types None
Supported affinity domains (n/a)
Max work item dimensions 3
Max work item sizes 512x512x512
Max work group size 512
Preferred work group size multiple (device) 64
Preferred work group size multiple (kernel) 64
Max sub-groups per work group 64
Sub-group sizes (Intel) 8, 16, 32
...
Global memory size 13229461504 (12.32GiB)
Error Correction support No
Max memory allocation 4294959104 (4GiB)
Unified memory for Host and Device Yes
Shared Virtual Memory (SVM) capabilities (core)
Coarse-grained buffer sharing Yes
Fine-grained buffer sharing No
Fine-grained system sharing No
Atomics No
Unified Shared Memory (USM) (cl_intel_unified_shared_memory)
Host USM capabilities (Intel) USM access, USM atomic access
Device USM capabilities (Intel) USM access, USM atomic access
Single-Device USM caps (Intel) USM access, USM atomic access
Cross-Device USM caps (Intel) (n/a)
Shared System USM caps (Intel) (n/a)
Minimum alignment for any data type 128 bytes
Alignment of base address 1024 bits (128 bytes)
...
john --list=build-info
$ /snap/john-the-ripper/610/run/john --list=build-info
Version: 1.9.0-jumbo-1+bleeding-39db7dd63e 2023-09-20 17:02:33 -0300
Build: linux-gnu 64-bit x86_64 AVX2 AC OMP OPENCL
SIMD: AVX2, interleaving: MD4:3 MD5:3 SHA1:1 SHA256:1 SHA512:1
System-wide exec: /snap/john-the-ripper/current/run
System-wide home: /snap/john-the-ripper/current/run
Private home: ~/.john
CPU tests: AVX2
CPU fallback binary: john-avx-omp
OMP fallback binary: john-avx2
$JOHN is /snap/john-the-ripper/current/run/
Format interface version: 14
Max. number of reported tunable costs: 4
Rec file version: REC4
Charset file version: CHR3
CHARSET_MIN: 1 (0x01)
CHARSET_MAX: 255 (0xff)
CHARSET_LENGTH: 24
SALT_HASH_SIZE: 1048576
SINGLE_IDX_MAX: 2147483648
SINGLE_BUF_MAX: 4294967295
Effective limit: Number of salts vs. SingleMaxBufferSize
Max. Markov mode level: 400
Max. Markov mode password length: 30
gcc version: 11.4.0
GNU libc version: 2.35 (loaded: 2.36)
OpenCL headers version: 1.2
Crypto library: OpenSSL
OpenSSL library version: 030000020 (loaded: 0300000b0)
OpenSSL 3.0.2 15 Mar 2022 (loaded: OpenSSL 3.0.11 19 Sep 2023)
GMP library version: 6.2.1
File locking: fcntl()
fseek(): fseek
ftell(): ftell
fopen(): fopen
memmem(): System's
times(2) sysconf(_SC_CLK_TCK) is 100
Using times(2) for timers, resolution 10 ms
HR timer: clock_gettime(), latency 42 ns
Total physical host memory: 15770 MiB
Available physical host memory: 12074 MiB
Terminal locale string: en_US.UTF-8
Parsed terminal locale: UTF-8
Input file
root:$1$uMJfnnig$O6<snip>X1:16314:0:99999:7:::
I'm familiar with modifying and building code, let me know if there's something I can try in the code.
Hi @cpatulea. Thank you for reporting this. Can you try these:
john --format=md5crypt-opencl --test -v=5
john --format=phpass-opencl --test -v=5
john --format=pbkdf2-hmac-md5-opencl --test -v=5
john --format=md5crypt-opencl --skip-self-test shadow
john --format=md5crypt-opencl --test -v=5
$ /snap/john-the-ripper/610/run/john --format=md5crypt-opencl --test -v=5
initUnicode(UNICODE, RAW/RAW)
RAW -> RAW -> RAW
Device 1: Intel(R) Graphics [0x46d1]
Benchmarking: md5crypt-opencl, crypt(3) $1$ [MD5 OpenCL]... Loaded 68 hashes with 33 different salts to test db from test vectors
Options used: -I opencl -cl-mad-enable -cl-std=CL1.2 -D__GPU__ -DDEVICE_INFO=34 -D__SIZEOF_HOST_SIZE_T__=8 -DDEV_VER_MAJOR=22 -DDEV_VER_MINOR=43 -D_OPENCL_COMPILER -DPLAINTEXT_LENGTH=15 $JOHN/opencl/cryptmd5_kernel.cl
binary size 340456
LWS=7 GWS=49 (7 blocks) 0: OpenCL CL_OUT_OF_RESOURCES (-5) error in opencl_md5crypt_fmt_plug.c:404 - Copy data back
Segmentation fault
john --format=phpass-opencl --test -v=5
$ /snap/john-the-ripper/610/run/john --format=phpass-opencl --test -v=5
initUnicode(UNICODE, RAW/RAW)
RAW -> RAW -> RAW
Device 1: Intel(R) Graphics [0x46d1]
Benchmarking: phpass-opencl ($P$9) [MD5 OpenCL 4x]... Loaded 49 hashes with 18 different salts to test db from test vectors
Options used: -I opencl -cl-mad-enable -cl-std=CL1.2 -D__GPU__ -DDEVICE_INFO=34 -D__SIZEOF_HOST_SIZE_T__=8 -DDEV_VER_MAJOR=22 -DDEV_VER_MINOR=43 -D_OPENCL_COMPILER -DV_WIDTH=4 -DPLAINTEXT_LENGTH=39 $JOHN/opencl/phpass_kernel.cl
binary size 124976
LWS=7 GWS=49 (7 blocks) PASS,
Test mask: ?a?a?l?u?d?d?s
Calculating best GWS for LWS=16; max. 100 ms single kernel invocation.
Raw speed figures including buffer transfers:
Tuning for iteration count of 2048 and password length 7
xfer: 5.052 us, crypt: 12.135 ms, xfer: 4.856 us
gws: 256 84314 c/s 84314 rounds/s 12.145 ms per crypt_all()!
xfer: 7.447 us, crypt: 18.871 ms, xfer: 5.802 us
gws: 512 108447 c/s 108447 rounds/s 18.884 ms per crypt_all()+
xfer: 10.364 us, crypt: 33.706 ms, xfer: 8.351 us
gws: 1024 121453 c/s 121453 rounds/s 33.724 ms per crypt_all()+
xfer: 21.041 us, crypt: 59.523 ms, xfer: 13.635 us
gws: 2048 137546 c/s 137546 rounds/s 59.558 ms per crypt_all()+
xfer: 31.614 us, crypt: 116.152 ms (exceeds 100 ms)
xfer: 15.937 us, crypt: 33.715 ms, xfer: 8.237 us
gws: 1024 121400 c/s 121400 rounds/s 33.739 ms per crypt_all()-
Calculating best LWS for GWS=2048
Testing LWS=16 GWS=2048 ... 238.094 ms+
Testing LWS=32 GWS=2048 ... 238.107 ms
Testing LWS=64 GWS=2048 ... 238.093 ms
Testing LWS=128 GWS=2048 ... 238.127 ms
Testing LWS=256 GWS=2048 ... 238.402 ms
Testing LWS=512 GWS=2048 ... 288.140 ms
Calculating best GWS for LWS=16; max. 200 ms single kernel invocation.
Raw speed figures including buffer transfers:
xfer: 6.875 us, crypt: 10.545 ms, xfer: 4.036 us
gws: 192 72754 c/s 72754 rounds/s 10.556 ms per crypt_all()!
xfer: 6.718 us, crypt: 12.438 ms, xfer: 4.725 us
gws: 384 123377 c/s 123377 rounds/s 12.449 ms per crypt_all()+
xfer: 8.645 us, crypt: 23.740 ms, xfer: 6.641 us
gws: 768 129316 c/s 129316 rounds/s 23.755 ms per crypt_all()+
xfer: 16.666 us, crypt: 44.440 ms, xfer: 10.786 us
gws: 1536 138167 c/s 138167 rounds/s 44.467 ms per crypt_all()+
xfer: 19.739 us, crypt: 85.624 ms, xfer: 24.266 us
gws: 3072 143436 c/s 143436 rounds/s 85.668 ms per crypt_all()+
xfer: 44.791 us, crypt: 168.012 ms, xfer: 58.454 us
gws: 6144 146185 c/s 146185 rounds/s 168.115 ms per crypt_all()+
xfer: 73.333 us, crypt: 332.766 ms (exceeds 200 ms)
xfer: 20.052 us, crypt: 85.626 ms, xfer: 20.820 us
gws: 3072 143438 c/s 143438 rounds/s 85.667 ms per crypt_all()-
LWS=16 GWS=6144 (384 blocks) DONE
Speed for cost 1 (iteration count) of 2048
Warning: "Many salts" test limited: 12/256
Many salts: 145996 c/s real, 145996 c/s virtual
Only one salt: 145996 c/s real, 145276 c/s virtual
john --format=pbkdf2-hmac-md5-opencl --test -v=5
$ /snap/john-the-ripper/610/run/john --format=pbkdf2-hmac-md5-opencl --test -v=5
initUnicode(UNICODE, RAW/RAW)
RAW -> RAW -> RAW
Device 1: Intel(R) Graphics [0x46d1]
Benchmarking: PBKDF2-HMAC-MD5-opencl [PBKDF2-MD5 OpenCL 4x]... Loaded 20 hashes with 19 different salts to test db from test vectors
Options used: -I opencl -cl-mad-enable -cl-std=CL1.2 -D__GPU__ -DDEVICE_INFO=34 -D__SIZEOF_HOST_SIZE_T__=8 -DDEV_VER_MAJOR=22 -DDEV_VER_MINOR=43 -D_OPENCL_COMPILER -DHASH_LOOPS=333 -DOUTLEN=16 -DPLAINTEXT_LENGTH=64 -DV_WIDTH=4 $JOHN/opencl/pbkdf2_hmac_md5_kernel.cl
binary size 504112
LWS=7 GWS=49 (7 blocks) PASS,
Test mask: ?a?a?l?u?d?d?s
Calculating best GWS for LWS=16; max. 100 ms single kernel invocation.
Raw speed figures including buffer transfers:
Tuning for iterations of 1000 and password length 7
P xfer: 5.781 us, init: 145.312 us, loop: 3x3.691 ms, final: 10.052 us, res xfer: 4.222 us
gws: 256 90922 c/s 182025844 rounds/s 11.262 ms per crypt_all()!
P xfer: 8.750 us, init: 154.166 us, loop: 3x5.932 ms, final: 15.677 us, res xfer: 6.247 us
gws: 512 113664 c/s 227555328 rounds/s 18.017 ms per crypt_all()+
P xfer: 12.083 us, init: 180.156 us, loop: 3x10.390 ms, final: 36.458 us, res xfer: 7.585 us
gws: 1024 130155 c/s 260570310 rounds/s 31.469 ms per crypt_all()+
P xfer: 18.385 us, init: 325 us, loop: 3x18.268 ms, final: 54.427 us, res xfer: 12.175 us
gws: 2048 148067 c/s 296430134 rounds/s 55.326 ms per crypt_all()+
P xfer: 36.041 us, init: 637.291 us, loop: 3x35.554 ms, final: 106.927 us, res xfer: 34.276 us
gws: 4096 152137 c/s 304578274 rounds/s 107.691 ms per crypt_all()+
P xfer: 87.604 us, init: 1.125 ms, loop: 3x68.558 ms, final: 205.416 us, res xfer: 73.477 us
gws: 8192 157857 c/s 316029714 rounds/s 207.578 ms per crypt_all()+
P xfer: 259.843 us, init: 2.111 ms, loop: 3x136.129 ms (exceeds 100 ms)
P xfer: 46.927 us, init: 642.760 us, loop: 3x35.560 ms, final: 109.739 us, res xfer: 30.445 us
gws: 4096 152089 c/s 304482178 rounds/s 107.726 ms per crypt_all()-
Calculating best LWS for GWS=8192
Testing LWS=16 GWS=8192 ... 205.680 ms+
Testing LWS=32 GWS=8192 ... 205.680 ms
Testing LWS=64 GWS=8192 ... 205.679 ms
Testing LWS=128 GWS=8192 ... 205.702 ms
Testing LWS=256 GWS=8192 ... 205.743 ms
Testing LWS=512 GWS=8192 ... 217.335 ms
Calculating best GWS for LWS=16; max. 200 ms single kernel invocation.
Raw speed figures including buffer transfers:
P xfer: 5.416 us, init: 169.791 us, loop: 3x3.689 ms, final: 8.750 us, res xfer: 9.778 us
gws: 192 68051 c/s 136238102 rounds/s 11.285 ms per crypt_all()!
P xfer: 7.395 us, init: 154.218 us, loop: 3x5.924 ms, final: 17.864 us, res xfer: 5.451 us
gws: 384 85362 c/s 170894724 rounds/s 17.993 ms per crypt_all()+
P xfer: 11.510 us, init: 165.156 us, loop: 3x7.387 ms, final: 21.822 us, res xfer: 7.245 us
gws: 768 137070 c/s 274414140 rounds/s 22.411 ms per crypt_all()+
P xfer: 17.760 us, init: 298.906 us, loop: 3x13.659 ms, final: 42.343 us, res xfer: 12.499 us
gws: 1536 148296 c/s 296888592 rounds/s 41.430 ms per crypt_all()+
P xfer: 28.281 us, init: 488.020 us, loop: 3x26.237 ms, final: 83.177 us, res xfer: 18.874 us
gws: 3072 154590 c/s 309489180 rounds/s 79.487 ms per crypt_all()+
P xfer: 57.656 us, init: 841.458 us, loop: 3x51.375 ms, final: 155.781 us, res xfer: 56.939 us
gws: 6144 157996 c/s 316307992 rounds/s 155.547 ms per crypt_all()+
P xfer: 166.718 us, init: 1.643 ms, loop: 3x101.661 ms, final: 301.093 us, res xfer: 129.374 us
gws: 12288 159669 c/s 319657338 rounds/s 307.835 ms per crypt_all()+
P xfer: 594.479 us, init: 3.195 ms, loop: 3x202.240 ms (exceeds 200 ms)
P xfer: 59.062 us, init: 848.020 us, loop: 3x51.389 ms, final: 160.260 us, res xfer: 51.444 us
gws: 6144 157949 c/s 316213898 rounds/s 155.594 ms per crypt_all()-
LWS=16 GWS=12288 (768 blocks) DONE
Speed for cost 1 (iterations) of 1000 and 10000
Raw: 29170 c/s real, 29170 c/s virtual
john --format=md5crypt-opencl --skip-self-test shadow
$ /snap/john-the-ripper/610/run/john --format=md5crypt-opencl --skip-self-test shadow
Device 1: Intel(R) Graphics [0x46d1]
Using default input encoding: UTF-8
Loaded 1 password hash (md5crypt-opencl, crypt(3) $1$ [MD5 OpenCL])
LWS=32 GWS=192 (6 blocks)
Proceeding with single, rules:Single
Press 'q' or Ctrl-C to abort, 'h' for help, almost any other key for status
0: OpenCL CL_OUT_OF_RESOURCES (-5) error in opencl_md5crypt_fmt_plug.c:404 - Copy data back
Thanks @cpatulea. So the issue is specific to md5crypt-opencl. While we could possibly have a bug in there causing this, that format works just fine on many other devices, including on older Intel HD Graphics with older Intel OpenCL backend. So I don't see what we'd do about your report now, other than being aware of it.
Also, as you can see these other related formats' speeds are quite low so that even if you do get this working, the speed will probably be similar to what you're getting on the CPU cores, so you'll at most double the total speed by using both CPU and GPU at once (or less than double, especially if the total TDP limit kicks in). You can estimate this by benchmarking --format=phpass on CPU and comparing to what you're getting on this GPU. It should scale for md5crypt similarly.
OTOH, this isn't bad for a 6W, $50 CPU.