cuda-fixnum icon indicating copy to clipboard operation
cuda-fixnum copied to clipboard

Extended-precision modular arithmetic library that targets CUDA.

Results 38 cuda-fixnum issues
Sort by recently updated
recently updated
newest added

Could make use of [CUB's "block-wide collective primitives"](https://nvlabs.github.io/cub/group___block_module.html).

Currently takes 3 hours (1 CPU) to generate all the 256 byte modexp test cases.

The `modexp_256` test cases take over 10GiB, much of which is redundant.

First saw the idea [here](https://arxiv.org/abs/1303.0328).

It will relieve register pressure. Allows easily sharing the data between warps. Any other benefits? Could cause bank conflicts when sharing a single word between multiple threads; to investigate.

Potentially useful instructions include - min and max without branching - sum of absolute differences: `sad.u32` - [funnel shift](https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#logic-and-shift-instructions-shf)

Seems like a leaky abstraction. For example, probably want right/left shift & rotation functions in fixnum implementations that are defined in terms of the slot_layout shuffles.

There is an implementation of this in the `attic` that definitely performs better than the full version. It would be nicer still to have a single version that works well...

`assert`s in device code are surprisingly (i) compiled in even without debugging flags and (ii) somewhat expensive. They should be kept in for debug builds only.

Montgomery reduction requires an odd modulus, and that's the only one that's implemented at present. Even moduli could be handled by having special code for moduli that are powers of...