cuda-fixnum
cuda-fixnum copied to clipboard
Extended-precision modular arithmetic library that targets CUDA.
Could make use of [CUB's "block-wide collective primitives"](https://nvlabs.github.io/cub/group___block_module.html).
Currently takes 3 hours (1 CPU) to generate all the 256 byte modexp test cases.
The `modexp_256` test cases take over 10GiB, much of which is redundant.
First saw the idea [here](https://arxiv.org/abs/1303.0328).
It will relieve register pressure. Allows easily sharing the data between warps. Any other benefits? Could cause bank conflicts when sharing a single word between multiple threads; to investigate.
Potentially useful instructions include - min and max without branching - sum of absolute differences: `sad.u32` - [funnel shift](https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#logic-and-shift-instructions-shf)
Seems like a leaky abstraction. For example, probably want right/left shift & rotation functions in fixnum implementations that are defined in terms of the slot_layout shuffles.
There is an implementation of this in the `attic` that definitely performs better than the full version. It would be nicer still to have a single version that works well...
`assert`s in device code are surprisingly (i) compiled in even without debugging flags and (ii) somewhat expensive. They should be kept in for debug builds only.
Montgomery reduction requires an odd modulus, and that's the only one that's implemented at present. Even moduli could be handled by having special code for moduli that are powers of...