argon2-opencl fix for macOS
The Apple driver doesn't let us use inline assembler.
With this patch:
Device 3: AMD Radeon Pro Vega 20 Compute Engine
Benchmarking: argon2-opencl [Blake2 OpenCL]... |
Trying to compute 62 hashes at a time using 992 of 4080 MiB device memory
LWS=[64-192] GWS=[1984-5952] ([10-93] blocks) => Mode: LOCAL_MEMORY
DONE
Speed for cost 1 (t) of 3, cost 2 (m) of 4096, cost 3 (p) of 1, cost 4 (type [0:Argon2d 1:Argon2i 2:Argon2id]) of 0 and 1
Raw: 1099 c/s real, 15942 c/s virtual
There are other problems I haven't looked into yet, with other devices:
Device 1: Intel(R) Core(TM) i9-9980HK CPU @ 2.40GHz
Testing: argon2-opencl [Blake2 OpenCL]... 0: OpenCL CL_INVALID_BUFFER_SIZE (-61) error in opencl_argon2_fmt_plug.c:608 - Error creating memory buffer
Device 2: Intel(R) UHD Graphics 630
Testing: argon2-opencl [Blake2 OpenCL]... Binary Build log: <program source>:68:7: warning: no previous prototype for function 'u64_shuffle'
ulong u64_shuffle(ulong v, uint thread_src, uint thread, __local ulong *buf)
^
<program source>:115:7: warning: no previous prototype for function 'block_th_get'
ulong block_th_get(const struct block_th *b, uint idx)
^
<program source>:125:6: warning: no previous prototype for function 'block_th_set'
void block_th_set(struct block_th *b, uint idx, ulong v)
^
<program source>:133:7: warning: no previous prototype for function 'mul_wide_u32'
ulong mul_wide_u32(ulong a, ulong b)
^
<program source>:149:6: warning: no previous prototype for function 'g'
void g(struct block_th *block)
^
<program source>:171:6: warning: no previous prototype for function 'apply_shuffle_shift2'
uint apply_shuffle_shift2(uint thread, uint idx)
^
<program source>:182:6: warning: no previous prototype for function 'shuffle_block'
void shuffle_block(struct block_th *block, uint thread, __local ulong *buf)
^
<program source>:260:6: warning: no previous prototype for function 'next_addresses'
void next_addresses(struct block_th *addr, uint thread_input, uint thread, __local ulong *buf)
^
<program source>:295:6: warning: no previous prototype for function 'blake2b_block'
void blake2b_block(ulong m[16], ulong out_len, ulong message_size)
^
FAILED (cmp_one(10))
Thank you for testing this on macOS, and I'm happy to see you're active with the project again.
The Apple driver doesn't let us use inline assembler.
Maybe that's a property of the AMD driver? e.g. AMDGPU-Pro on super doesn't let us use inline asm either (but manages fine without a compiler memory barrier), which is why we have the version check here.
There are other problems I haven't looked into yet, with other devices:
That's interesting. This format is known to fail on CPU devices and Intel HD Graphics, but for me it fails differently - with FAILED (cmp_one(1)), see #5417.
FAILED (cmp_one(10))
That's the very last test vector, meaning the code almost worked for you on HD Graphics.
The Apple drivers never seemed to share a single line of code with Windows/Linux. I've only used macOS with older nvidia but I'm sure it never had inline asm. Also, none of the proprietary other things ever worked, such as temp/fan sensors, querying compute capability (unless CUDA), cl_amd_media_ops, nothing of the sorts. Also there were never any similarity whatsoever in version numbers. The driver version number is just 1.2 (for OpenCL 1.2) or sometimes 1.1, and a date. This means things like DEV_VER_MAJOR < 2500 will always match...
Anyway I'll have a look at that host side thing, and perhaps also see if I can find some old nvidia macbook, just because why not. And a look-see at the Intel UHD Grahpics thing, in case it's close to fully working.
FWIW, I just noticed this in hashcat, could also be relevant to getting this format fully working on macOS:
// work around, for some reason apple opencl can't have buffers larger 2^31
// typically runs into trap 6
// maybe 32/64 bit problem affecting size_t?
// this seems to affect global memory as well no just single allocations
if ((device_param->opencl_platform_vendor_id == VENDOR_ID_APPLE) && (device_param->is_metal == false))
{
const size_t undocumented_single_allocation_apple = 0x7fffffff;
if (((c + 1 + 1) * MAX_ALLOC_CHECKS_SIZE) >= undocumented_single_allocation_apple) break;
}
@magnumripper Perhaps you can try getting it to allocate more than 2 GiB on your laptop and see what happens? For this experiment, I'd try increasing the 6 here in opencl_argon2_fmt_plug.c:
// Use almost all GPU memory by default
unsigned int warps = 6, limit, target;
or do a real cracking run with a hash requiring more memory (e.g., one of our test vectors that use 16 MiB, or you can also increase the value of m in there - no need for the hash to remain crackable).
This was sitting here for 3 months with no progress, so I went ahead and merged it as-is, then made my own follow-up commit c042fa3e31217a96160dd9f35762dbbea57bdc56 to fix the NVIDIA host part (untested). Consistent with how we do it elsewhere in the codebase, on host I check for __APPLE__. I notice in OpenCL we check for __OS_X__. I hope this is right, and will match on the same systems, @magnumripper? Is there a good way for us to make this more reliable in case of future changes? For example, if Apple releases another OS where it sets __APPLE__, but does not set __OS_X__ in OpenCL? BTW, I have no idea what macros they set in iOS, which we don't currently support here.
I did not try addressing the 2G single allocation limit mentioned above. I think this needs testing first.
I notice in OpenCL we check for OS_X. I hope this is right, and will match
This is something we created and control at:
https://github.com/openwall/john/blob/c042fa3e31217a96160dd9f35762dbbea57bdc56/src/opencl_common.c#L1154
Plus: https://github.com/openwall/john/blob/c042fa3e31217a96160dd9f35762dbbea57bdc56/src/opencl_common.c#L1155