Hamish Ivey-Law
Hamish Ivey-Law
See p24 at https://cryptojedi.org/peter/data/space-20141020.pdf
See, for example, https://www.shiftleft.org/blog/faster_arithmetic/
Reduces the number of carry-reduction layers from O(n) to O(log(n)). See: https://en.wikipedia.org/wiki/Wallace_tree
Main advantage is to speed up addition by eliminating chains of dependent carries. 2-NAF is the special case in binary. See: https://en.wikipedia.org/wiki/Signed-digit_representation
- For modular operations we should load the modulus into shared memory so it can be used across the block without wasting registers. - Need to consider access patterns and...
See [IACR 2018/300](https://eprint.iacr.org/2018/300).
From https://github.com/data61/cuda-fixnum/issues/50: > Including, in no particular order > - [ ] Generate data and graph it, rather than generating large difficult-to-interpret tables. > - [ ] Generate data for...
Consider implementing a [register cache in shared memory](https://devblogs.nvidia.com/register-cache-warp-cuda/) to ease register pressure. Particularly relevant with slot_layout/grids with 'large' height.
From https://github.com/data61/cuda-fixnum/issues/45: > Guidance is provided [here](https://docs.nvidia.com/cuda/cuda-c-programming-guide/#independent-thread-scheduling-7-x).
The [Wycheproof Project](https://github.com/google/wycheproof/) provides tools for crypto libraries to be tested against known attacks. These are inputs that are known to cause serious security failures in implementations of RSA and...