Hoisting optimization for repeated rotations of the same ciphertext
I came across this tidbit when reading the Gazelle Paper [1]
Book-keeping: Hoisting
The hoisting optimization reduces the cost of the ciphertext rotation when the same ciphertext must be rotated by multiple shift amounts. The idea, roughly speaking, is to “look inside” the ciphertext rotation operation, and hoist out the part of the computation that would be common to these rotations and then compute it only once thus amortizing it over many rotations. It turns out that this common computation involves computing the NTT^{-1} (taking the ciphertext to the coefficient domain), followed by a
w_{relin}-bit decomposition that splits the ciphertext(log2 q)/w_{relin}ciphertexts and finally takes these ciphertexts back to the evaluation domain using separate applications of NTT. The parameterw_{relin}is called the relinearization window and represents a tradeoff between the speed and noise growth of the Perm operation. This computation, which we denote as PermDecomp, hasΘ(n log n)complexity because of the number theoretic transforms. In contrast, the independent computation in each rotation, denoted by PermAuto, is a simpleΘ(n)multiply and accumulate operation. As such, hoisting can provide substantial savings in contrast with direct applications of the Perm operation and this is also borne out by the benchmarks in Table VII.
This seems like a relevant optimization for us, particularly when we lower SIMD FHE schemes to polynomial.
[1]: Gazelle: A Low Latency Framework for Secure Neural Network Inference. Chiraag Juvekar, Vinod Vaikuntanathan, Anantha Chandrakasan. https://arxiv.org/abs/1801.05507
For reference, this is the original work on hoisting: https://eprint.iacr.org/2018/244.
I wonder to what extent CSE/canonicalization can give us (some of these) for "free"...
I wonder to what extent CSE/canonicalization can give us (some of these) for "free"...
I would assume at a low enough level it would, but only at the point that we element-wise-to-affine and unroll the loop that performs the digit decomposition and the automorphism, which is pretty low-level (but this is the poly level, so i think hardware accelerators would get that!). Otherwise, it relies on the fact that the automorphism and digit decomposition "commute" since the automorphism is a linear operator.
This would be a really useful optimization for the diagonal matrix-vector multiplication method, which needs to perform many rotations on the input vector
Not particularly about the SIMD Schemes -> Polynomial flow but for reference this is implemented in libraries other than HELib as well. Relevantly, OpenFHE includes APIs for the hoisted and optimized operations as EvalFastRotationPrecompute and EvalFastRotation. Here is the advanced-real-numbers.cpp example in OpenFHE.
Relevantly, OpenFHE includes APIs for the hoisted and optimized operations as EvalFastRotationPrecompute and EvalFastRotation.
So cool, thanks @eymay - it's great that there's OpenFHE support for this, we can definitely add it to the pipeline and create a pattern rewrite for a number of rotations to the fast rotation API. @lawrencekhlim check this out for the vector rotations in H-V / squat matrix multiplication! https://github.com/openfheorg/openfhe-development/blob/main/src/pke/examples/advanced-real-numbers.cpp#L748-L759
Should we close this given that we can do this (at least for the openfhe dialect) and can file an issue for lattigo / others