icicle
icicle copied to clipboard
split_scalars_kernel kernel function
The for loop in this kernel can be eliminated with the integration of cooperative groups. Instead of single thread looping over all the limbs for a single scalar, multiple threads can access a different limbs (or sub-parts of the same limb) of the same scalar in parallel. This would require refactoring the arithmetic to support multi-threaded field operations. This is a longer-term optimization worth looking into, and if it's right for your codebase.