Argon2: Precompute or cache 2i/2id indices
As I wrote in https://github.com/openwall/john/issues/2738#issuecomment-328273076
-- cut --
... potential optimization for 2i and 2id, where the data-independent indices can be reused across hash computations, rather than recomputed each time like upstream does. In our current hack of older upstream code, we're already passing the pseudo_rands array from the application, yet somehow we don't appear to be making this optimization. In latest upstream code, this array is gone - as far as I can tell, the data-independent indices are being calculated in smaller portions, which makes sense for that approach - but we'll probably need to reintroduce an equivalent of the array (as an option), to be written-to (if for the first time or when invoked with higher parameters) and reused (on subsequent calls with same or lower parameters) where the new upstream code normally does these things (so that we won't deviate from upstream too much). This probably means in next_addresses().
I think https://gitlab.com/omos/argon2-gpu already has this optimization (for GPU) - grep it for "precompute". -- cut --
However, now that we got a revision of the above OpenCL implementation into our tree here, it doesn't appear to have this optimization. Maybe it was dropped at some point, or maybe I was wrong that it was there, or am wrong that it isn't now. But right now it looks to me that we'd need to try implementing this for both CPU and OpenCL.
The pseudo_rands array is indeed gone from our tree with #5557.
A further and more important optimization for 2i would be to introduce index mapping and store only those blocks that would actually be subsequently used, reusing memory for blocks that are not going to be ever read from again. Per the Balloon paper, this provides a memory reduction to 1/4 or even 1/5 in single-pass Argon2i, and asymptotically 1/e, without increase in computation (but with tiny overhead for the extra index indirection). So we could actually lower our memory allocation per-hash and fit more concurrent hash computations in same GPU memory, which should provide speedup for high m_cost.
For 2id, this is trickier because we'd then need different memory allocation and different GWS for the i and d phases - e.g., maybe run the i kernel once and then the d kernel 2 to 5 times sequentially.
https://eprint.iacr.org/2016/027
The above attacks were for Argon2i version 1.2.1, whereas in version 1.3 that is in use now "The blocks are XORed with, not overwritten in the second pass and later" specifically to mitigate the attacks. However, note how this says "in the second pass and later", suggesting that single-pass Argon2i (and similar single pass in 2id?) may still be attacked for 1/4 or even 1/5 reduction in memory usage and thus greater concurrency.
Also relevant:
Towards Practical Attacks on Argon2i and Balloon Hashing PDF Joël Alwen; Jeremiah Blocki
https://ieeexplore.ieee.org/document/7961977
but focuses on larger m_cost than likely practical for GPUs anyway (1+ GB).