Add optimized ciphertext-plaintext matmul from Tricycle paper
A continuation of https://github.com/google/heir/issues/1376
We added a ct-ct matmul kernel, with modifications from the Tricycle paper. The same paper has a more optimized kernel for ct-pt matmul (6.1), which should go alongside (or replace?) our existing ct-pt matmul kernel that just "stacks" the Halevi-Shoup kernel.
Admittedly, this one is trickier to implement because while the paper discusses how to derive the diagonal traversal, it omits some details (for brevity reasons) on how to calculate the negative rotations properly (last paragraph of 6.1):
To fix the remaining incorrect slots, we perform additional rotations (in this case, a -2 rotation) to obtain the remaining correct slots (the latter 4 values), which are then also masked by a plaintext properly.
What's also not discussed is how to have these additional rotations be optimized with BSGS.
Still though, it's a good first issue as the paper's example in 6.1 (I believe) should provide enough to figure it out as you go.
@lawrencekhlim do you have a public reference implementation?
My implementation is not publicly available yet unfortunately, but I can help talk through things that may need to be tweaked.