john argon2-opencl fix for macOS

The Apple driver doesn't let us use inline assembler.

With this patch:

Device 3: AMD Radeon Pro Vega 20 Compute Engine
Benchmarking: argon2-opencl [Blake2 OpenCL]... |
Trying to compute 62 hashes at a time using 992 of 4080 MiB device memory
LWS=[64-192] GWS=[1984-5952] ([10-93] blocks) => Mode: LOCAL_MEMORY
DONE
Speed for cost 1 (t) of 3, cost 2 (m) of 4096, cost 3 (p) of 1, cost 4 (type [0:Argon2d 1:Argon2i 2:Argon2id]) of 0 and 1
Raw:	1099 c/s real, 15942 c/s virtual

There are other problems I haven't looked into yet, with other devices:

Device 1: Intel(R) Core(TM) i9-9980HK CPU @ 2.40GHz
Testing: argon2-opencl [Blake2 OpenCL]... 0: OpenCL CL_INVALID_BUFFER_SIZE (-61) error in opencl_argon2_fmt_plug.c:608 - Error creating memory buffer

Device 2: Intel(R) UHD Graphics 630
Testing: argon2-opencl [Blake2 OpenCL]... Binary Build log: <program source>:68:7: warning: no previous prototype for function 'u64_shuffle'
ulong u64_shuffle(ulong v, uint thread_src, uint thread, __local ulong *buf)
      ^
<program source>:115:7: warning: no previous prototype for function 'block_th_get'
ulong block_th_get(const struct block_th *b, uint idx)
      ^
<program source>:125:6: warning: no previous prototype for function 'block_th_set'
void block_th_set(struct block_th *b, uint idx, ulong v)
     ^
<program source>:133:7: warning: no previous prototype for function 'mul_wide_u32'
ulong mul_wide_u32(ulong a, ulong b)
      ^
<program source>:149:6: warning: no previous prototype for function 'g'
void g(struct block_th *block)
     ^
<program source>:171:6: warning: no previous prototype for function 'apply_shuffle_shift2'
uint apply_shuffle_shift2(uint thread, uint idx)
     ^
<program source>:182:6: warning: no previous prototype for function 'shuffle_block'
void shuffle_block(struct block_th *block, uint thread, __local ulong *buf)
     ^
<program source>:260:6: warning: no previous prototype for function 'next_addresses'
void next_addresses(struct block_th *addr, uint thread_input, uint thread, __local ulong *buf)
     ^
<program source>:295:6: warning: no previous prototype for function 'blake2b_block'
void blake2b_block(ulong m[16], ulong out_len, ulong message_size)
     ^

FAILED (cmp_one(10))

Jan 08 '24 01:01 magnumripper

Thank you for testing this on macOS, and I'm happy to see you're active with the project again.

The Apple driver doesn't let us use inline assembler.

Maybe that's a property of the AMD driver? e.g. AMDGPU-Pro on super doesn't let us use inline asm either (but manages fine without a compiler memory barrier), which is why we have the version check here.

There are other problems I haven't looked into yet, with other devices:

That's interesting. This format is known to fail on CPU devices and Intel HD Graphics, but for me it fails differently - with FAILED (cmp_one(1)), see #5417.

FAILED (cmp_one(10))

That's the very last test vector, meaning the code almost worked for you on HD Graphics.

Jan 08 '24 09:01 solardiz

The Apple drivers never seemed to share a single line of code with Windows/Linux. I've only used macOS with older nvidia but I'm sure it never had inline asm. Also, none of the proprietary other things ever worked, such as temp/fan sensors, querying compute capability (unless CUDA), cl_amd_media_ops, nothing of the sorts. Also there were never any similarity whatsoever in version numbers. The driver version number is just 1.2 (for OpenCL 1.2) or sometimes 1.1, and a date. This means things like DEV_VER_MAJOR < 2500 will always match...

Anyway I'll have a look at that host side thing, and perhaps also see if I can find some old nvidia macbook, just because why not. And a look-see at the Intel UHD Grahpics thing, in case it's close to fully working.

Jan 09 '24 11:01 magnumripper

FWIW, I just noticed this in hashcat, could also be relevant to getting this format fully working on macOS:

            // work around, for some reason apple opencl can't have buffers larger 2^31
            // typically runs into trap 6
            // maybe 32/64 bit problem affecting size_t?
            // this seems to affect global memory as well no just single allocations
            if ((device_param->opencl_platform_vendor_id == VENDOR_ID_APPLE) && (device_param->is_metal == false))
            {
              const size_t undocumented_single_allocation_apple = 0x7fffffff;
              if (((c + 1 + 1) * MAX_ALLOC_CHECKS_SIZE) >= undocumented_single_allocation_apple) break;
            }

@magnumripper Perhaps you can try getting it to allocate more than 2 GiB on your laptop and see what happens? For this experiment, I'd try increasing the 6 here in opencl_argon2_fmt_plug.c:

        // Use almost all GPU memory by default
        unsigned int warps = 6, limit, target;

or do a real cracking run with a hash requiring more memory (e.g., one of our test vectors that use 16 MiB, or you can also increase the value of m in there - no need for the hash to remain crackable).

Jan 09 '24 14:01 solardiz

This was sitting here for 3 months with no progress, so I went ahead and merged it as-is, then made my own follow-up commit c042fa3e31217a96160dd9f35762dbbea57bdc56 to fix the NVIDIA host part (untested). Consistent with how we do it elsewhere in the codebase, on host I check for __APPLE__. I notice in OpenCL we check for __OS_X__. I hope this is right, and will match on the same systems, @magnumripper? Is there a good way for us to make this more reliable in case of future changes? For example, if Apple releases another OS where it sets __APPLE__, but does not set __OS_X__ in OpenCL? BTW, I have no idea what macros they set in iOS, which we don't currently support here.

I did not try addressing the 2G single allocation limit mentioned above. I think this needs testing first.

Apr 06 '24 18:04 solardiz

I notice in OpenCL we check for OS_X. I hope this is right, and will match

This is something we created and control at:

https://github.com/openwall/john/blob/c042fa3e31217a96160dd9f35762dbbea57bdc56/src/opencl_common.c#L1154

Plus: https://github.com/openwall/john/blob/c042fa3e31217a96160dd9f35762dbbea57bdc56/src/opencl_common.c#L1155

Apr 06 '24 19:04 claudioandre-br