Iterator adaptors for vectorized point decompression
curve25519-dalek has multiple implementations of finite field arithmetic: serial implementations (using u32s or u64s), and vector implementations (using SIMD vectors). The vector implementations provide arithmetic on vectors of field elements, performing the same operations on multiple field elements simultaneously.
The vector implementations are used for scalar multiplication, which uses parallelism present in the curve formulas to be able to use vectors of field elements. However, point decompression requires a square root computation, which doesn't have "internal" parallelism. For this reason, point decompression uses the serial implementation, even when scalar multiplication uses the vector implementation. However, if an application wanted to decompress multiple points at the same time, it would be possible to parallelize across decompression operations and use the vectorized field arithmetic.
One application where this would be useful would be in batch verification of Ed25519 signatures, where the work breaks into two parts: decompressing points that are part of the signatures, then doing a multiscalar multiplication involving the decompressed points. Right now, the accelerated finite field arithmetic is only used in the second step, not the first.
Here's a sketch of a design for vectorized point decompression that I think would be flexible and reusable:
-
Use the serial implementation of
sqrt_ratio_ito implement thesqrt_ratio_ifunction for the vector field element types, which would compute four square roots simultaneously. -
Define traits
BatchEdwardsDecompress,BatchRistrettoDecompress(naming?) along the lines of
trait BatchRistrettoDecompress {
fn batch_decompress(self) -> impl Iterator<Item=Option<RistrettoPoint>>;
}
impl<T: Iterator<Item=CompressedRistretto>> BatchRistrettoDecompress for T {
/* ... */
}
so that any iterator with Item=CompressedRistretto can be turned into an iterator with Item=Option<RistrettoPoint>.
-
Define a type (maybe
BatchRistrettoDecompressor(naming?)) that implementsIterator<Item=Option<RistrettoPoint>>and holds onto a source iterator withItem=CompressedRistretto. It should take compressed points from the source iterator in groups of four, convert each group into a vector, perform the decompression in parallel to get fourRistrettoPoints, and then return them one at a time as the output ofnext(). (In other words, it should keep a buffer of four results, then refill the buffer with a single vectorized operation). This is what would be returned by thebatch_decompressfunction, but I think the decompressor type can be hidden behind animpl Iterator. -
There are two edge cases to handle in the above: what happens when the source iterator is not a multiple of four (probably this should be handled by padding the vector and then discarding the results), and what happens when some of the input points are invalid (so that decompression gives
None).
When this is all done, the workflow this is supposed to enable is something like the following:
use BatchRistrettoDecompress;
let points: Vec<CompressedRistretto> = /* ... */;
let scalars: Vec<Scalar> = /* ... */;
let result = RistrettoPoint::optional_multiscalar_mul(
scalars.iter(),
points.iter().batch_decompress(),
);
When optional_multiscalar_mul starts consuming the second iterator, it will call next(), which loads four compressed points, decompresses them with vector operations, returns the first one. The following three next() calls return the already-decompressed points, emptying the buffer, and the process repeats for the fourth next() call.
Note: implementing this requires working with the vector backends, which are in the middle of being rewritten / moved around in #215.
I'd like to work on this issue!
Cool! As noted, the vector backends are all rearranged by #215, which hasn't landed yet, so there's no rush. A good first step might be measuring how much time of the ed25519-dalek batch verification is spent on decompression.
I'll try to land #215 so that this can start.
Instead of starting with the vectorized field ops, this could start by defining the iterator adapters, and internally just making them do four (serial) field operations at a time, so that the field arithmetic code can be dropped in later, and that way starting on this doesn't depend on #215