curve25519-dalek Iterator adaptors for vectorized point decompression

curve25519-dalek has multiple implementations of finite field arithmetic: serial implementations (using u32s or u64s), and vector implementations (using SIMD vectors). The vector implementations provide arithmetic on vectors of field elements, performing the same operations on multiple field elements simultaneously.

The vector implementations are used for scalar multiplication, which uses parallelism present in the curve formulas to be able to use vectors of field elements. However, point decompression requires a square root computation, which doesn't have "internal" parallelism. For this reason, point decompression uses the serial implementation, even when scalar multiplication uses the vector implementation. However, if an application wanted to decompress multiple points at the same time, it would be possible to parallelize across decompression operations and use the vectorized field arithmetic.

One application where this would be useful would be in batch verification of Ed25519 signatures, where the work breaks into two parts: decompressing points that are part of the signatures, then doing a multiscalar multiplication involving the decompressed points. Right now, the accelerated finite field arithmetic is only used in the second step, not the first.

Here's a sketch of a design for vectorized point decompression that I think would be flexible and reusable:

Use the serial implementation of sqrt_ratio_i to implement the sqrt_ratio_i function for the vector field element types, which would compute four square roots simultaneously.
Define traits BatchEdwardsDecompress, BatchRistrettoDecompress (naming?) along the lines of

trait BatchRistrettoDecompress {
    fn batch_decompress(self) -> impl Iterator<Item=Option<RistrettoPoint>>;
}

impl<T: Iterator<Item=CompressedRistretto>> BatchRistrettoDecompress for T {
     /* ... */
}

so that any iterator with Item=CompressedRistretto can be turned into an iterator with Item=Option<RistrettoPoint>.

Define a type (maybe BatchRistrettoDecompressor (naming?)) that implements Iterator<Item=Option<RistrettoPoint>> and holds onto a source iterator with Item=CompressedRistretto. It should take compressed points from the source iterator in groups of four, convert each group into a vector, perform the decompression in parallel to get four RistrettoPoints, and then return them one at a time as the output of next(). (In other words, it should keep a buffer of four results, then refill the buffer with a single vectorized operation). This is what would be returned by the batch_decompress function, but I think the decompressor type can be hidden behind an impl Iterator.
There are two edge cases to handle in the above: what happens when the source iterator is not a multiple of four (probably this should be handled by padding the vector and then discarding the results), and what happens when some of the input points are invalid (so that decompression gives None).

When this is all done, the workflow this is supposed to enable is something like the following:

use BatchRistrettoDecompress;
let points: Vec<CompressedRistretto> = /* ... */;
let scalars: Vec<Scalar> = /* ... */;
let result = RistrettoPoint::optional_multiscalar_mul(
    scalars.iter(),
    points.iter().batch_decompress(),
);

When optional_multiscalar_mul starts consuming the second iterator, it will call next(), which loads four compressed points, decompresses them with vector operations, returns the first one. The following three next() calls return the already-decompressed points, emptying the buffer, and the process repeats for the fourth next() call.

Note: implementing this requires working with the vector backends, which are in the middle of being rewritten / moved around in #215.

Dec 12 '18 19:12 hdevalence

I'd like to work on this issue!

Dec 12 '18 19:12 DebugSteven

Cool! As noted, the vector backends are all rearranged by #215, which hasn't landed yet, so there's no rush. A good first step might be measuring how much time of the ed25519-dalek batch verification is spent on decompression.

Dec 12 '18 21:12 hdevalence

I'll try to land #215 so that this can start.

Jan 12 '19 23:01 hdevalence

Instead of starting with the vectorized field ops, this could start by defining the iterator adapters, and internally just making them do four (serial) field operations at a time, so that the field arithmetic code can be dropped in later, and that way starting on this doesn't depend on #215

Jan 17 '19 01:01 hdevalence