Horner's method for POLY1305-AVX2
Again, not an issue, just an implementation question.
I noticed that while you followed Bernstein and Schwabe's approach to use an unrolled version of Horner's method for Neon and AVX implementations, i.e.
((inp[0]*r^4+inp[2]*r^2+inp[4])*r^4+inp[6]*r^2
+ ((inp[1]*r^4+inp[3]*r^2+inp[5])*r^3+inp[7]*r
for AVX2 you instead use unperturbed Horner's method with the same power of r for all but tail blocks:
((inp[0]*r^4+inp[4])*r^4+inp[ 8])*r^4
+ ((inp[1]*r^4+inp[5])*r^4+inp[ 9])*r^3
+ ((inp[2]*r^4+inp[6])*r^4+inp[10])*r^2
+ ((inp[3]*r^4+inp[7])*r^4+inp[11])*r^1
This has ramification of performing reduction twice more often.
I was wondering why didn't you chose the same level of unrolling as for 128-bit vectors for 256-bit vectors, e.g. for l==24:
((m[0])*r^8 + m[4]*r^4 + m[8])*r^8 + m[12]*r^4 + m[16])*r^8 + m[20]*r^4
+ ((m[1])*r^8 + m[5]*r^4 + m[9])*r^8 + m[13]*r^4 + m[17])*r^7 + m[21]*r^3 \\
+ ((m[2])*r^8 + m[6]*r^4 + m[10])*r^8 + m[14]*r^4 + m[18])*r^6 + m[22]*r^2 \\
+ ((m[3])*r^8 + m[7]*r^4 + m[11])*r^8 + m[15]*r^4 + m[17])*r^5 + m[21]*r
This will come at an additional overhead of pre-calculating r^5-r^8, but should be quite more performant for the main loop, no?
I was just wondering if you tried this and it proved less efficient, or for some reason you decided to stick to the original Horner's method from scratch? Thanks!
Sigh... I mean it's so many variables... But briefly. On ARM you have to consider that there are in-order execution cores. ARM processors also tend to have higher vector instruction latency. This means that the reduction ~will~ can cost you dearly, you're absolutely motivated to amortize it, hence the unrolling of the Horner's method. On Intel on the other hand all AVX processors are out-of-order and low-latency. The said unrolling is less important. One can even say it's not that important, because non-unrolled 256-bit AVX2 code was observed to be just a few percent short of 2x improvement over unrolled 128-bit AVX. I mean all things equal AVX2 should be ~2x faster just because it processes twice as much data in the same unit of time. [And it almost was even with things not being equal.] And maybe more importantly, not unrolling comes with the advantage that you don't need to change the size of the state structure on the C side. This made AVX2 [implementation] binary compatible with application code, yet it delivered almost 2x improvement...
This makes perfect sense, thank you very much for the explanation!
May I ask another question regarding your (this time armv8) implementation of POLY1305?
In the main loop, when contracting over IN01 terms, you start from IN01_2, not IN01_0 (starting from l630). This seems to defer the loading of next iteration input (and make it more congested). I wonder if this had performance (although I fail to understand how) or security reasons? I am guessing this was not by mistake as the same pattern appears in the "tail" part of the code.
I am finishing my Neon->SVE2 porting of the 2-way (i.e. 128-bit) vectorisation of POLY1305 (to be soon PR'd to OpenSSL hopefully), and decided to do this in order (IN01_0 -> IN01_1->IN01_2 ...), as the nature of SVE2 umullb/umullt and umlalb/umlalt instructions requires even/odd-element layout of data in registers and adds an extra instruction to each GPR->SVE2 data copy, making things even more congested down the line. Such reorder did not seem to worsen performance but I am worried I am somehow compromising security of the algorithm in this way?
Btw, I am seeing "only" 5% drop in efficiency reduction in my SVE2 version, which seems to me quite a good result, taken this requirement of "distributing" packed input data into even lanes of vector registers.
you start from IN01_2, not IN01_0
The reason is not IN01, but H. The goal was to consume Hn-s in the order they are calculated with intention to provide out-of-order execution logic with every hint possible about dependencies. It probably makes no sense past a certain point defined by how many physical registers are available for renaming. I mean if the amount of physical registers is not sufficient, the pipeline can "hick up." Well, even then it's not given that it would actually be a significant problem, because the bottleneck can be elsewhere... In other words, it's not about security and alternative order should be fine on high-end processors.
Ohhh I see... I guess since Hn's are calculated in the reduction phase, and used starting from the middle of the loop (so are separated by about 40 instructions) I didn't think there could be an issue there... But I understand now, thank you very much!
Btw according to perf record on Graviton 4, biggest bottlenecks are at the load instructions in the main loop (and for some reason at the neighboring umlal).
Would it be OK if I recommend you as a reviewer when I PR to OpenSSL btw?
Neon->SVE2 porting of the 2-way (i.e. 128-bit) vectorisation
Does it mean that the implementation won't automatically adjust to wider registers? If so, then what's the point? It's not like SVE2-capable processors don't support NEON. Or is it going to change? But even then, auto-adjustable/scalable implementation ought to be the goal. It's no point to target exclusively 128-bit vector unit. Even if it turns out that a scalable implementation is slower than a dedicated one, it's unlikely to be that much slower to make it irrelevant. In the worst case one can choose to execute NEON on 128-bit processors and SVE2 on wider ones...
Would it be OK if I recommend you as a reviewer when I PR to OpenSSL btw?
I can't prevent anybody from tagging me, but I won't make any commitments.
You are right of course. I guess my motivation is to provide a "blueprint" for SVE/2 implementation that could be (I'd be happy to do it myself, I think I know how) extended to say 256-bit vector registers, once there is hardware that supports them (apparently Fujitsu Monaka will have them). I was thinking to try and PR this into OpenSSL, but do realize that similar considerations might lead to the reviewers rejecting it.
My current understanding is that because one has to re-formulate the polynomial for each level of parallelisation (Horner's method), involving different number of growing powers of r, and also different "tails", general implementation is not possible. The main loop however should be general and could be factored away to be re-used.
Also I thought it might be interesting to keep SVE/2 implementation anyway in case in the future Arm processors will be more optimized towards it than Neon, although currently this doesn't seem to be the case.
I by no means want to trouble you tagging in my PR if you wouldn't like to be involved (and of course not asking for any commitments). I am grateful enough for your answers here!
I mean, of course it IS possible in principle... just I thought Very cumbersome in pure assembly... But I am starting to doubt my initial assessment of how hard it is...
In https://github.com/dot-asm/cryptogams/issues/17#issuecomment-3053181340 I've suggested reading the manual, but you chose to google things :-) You can't google things that were never done! SVE (yes, pre-SVE2) does have multiplication instructions that allow base 2^44 implementation. It's not given that they will deliver better results, because it takes more than just these instructions. But the possibility should be evaluated...
"blueprint" for SVE/2 implementation that could be extended to say 256-bit vector registers,
No, the suggestion is to make it actually width-agnostic. Well, it's unrealistic to make it literally unbounded, but at least it shouldn't be limited by a 2025 prospect value. I've just committed a poly1305-riscv implementation that utilizes up to 2048 bits. Tested only up to qemu's limit of 1024 though. It doesn't calculate all the powers, but only powers of 2. This translates to more multiplications in the tail, but I consider the compromise acceptable. At the very least because it won't require adjustments for many years. Besides, the most common usage pattern is single call per key, you always pre-compute powers, so there always is overhead... Formally speaking the implementation should adjust to both register widths and input length. I mean the next step is if input length is not sufficiently large, I can utilize 1/2 register to minimize the overhead for short inputs.
In https://github.com/dot-asm/cryptogams/issues/17#issuecomment-3053181340 I've suggested reading the manual, but you chose to google things :-) You can't google things that were never done! SVE (yes, pre-SVE2) does have multiplication instructions that allow base 2^44 implementation. It's not given that they will deliver better results, because it takes more than just these instructions. But the possibility should be evaluated...
I am sorry, I still need to wrap my head around what implementing in different base entails. Could you please point me at the instructions you meant?
For the record, I do try to read manual as much as I can, just not all the 14.5k pages from scratch... :) I relate my inefficient googling attempt to my general novelty in the field and lack of much of the context.
I've just committed a poly1305-riscv implementation that utilizes up to 2048 bits. Tested only up to qemu's limit of 1024 though. It doesn't calculate all the powers, but only powers of 2. This translates to more multiplications in the tail, but I consider the compromise acceptable.
Wow, interesting (and thank you)! Now I have to learn riscv assembly as well (or just do the maths myself) :D
I am sorry,
And where is smiley? I mean we're just kidding around with regard to reading the manual, right?
I still need to wrap my head around what implementing in different base entails. Could you please point me at the instructions you meant?
mul and umulh are specified with .d qualifier, so you have 64-bit widening multiplication. One can build either base 2^64 or base 2^44. Again, not saying that it's guaranteed path, but it needs to be evaluated...
For the record, I do try to read manual as much as I can, just not all the 14.5k pages from scratch... :)
This is no excuse, SVE part is only 1900 pages :-)
And where is smiley? I mean we're just kidding around with regard to reading the manual, right?
Of course :))
mul and umulh are specified with .d qualifier, so you have 64-bit widening multiplication. One can build either base 2^64 or base 2^44. Again, not saying that it's guaranteed path, but it needs to be evaluated...
I think I start to understand... I reformulate the 130-bit accumulator in terms of a fewer higher-base limbs, then use 44x44 -> 88 bit multiplication, and do MUL/UMUL back-to back to get the lower-and higher bits of the result right? Then do some (simpler) reduction... Bummer that SVE/1 doesn't allow for unpredicated versions of this...
This is no excuse, SVE part is only 1900 pages :-)
Indeed... :)
mul and umulh are specified with .d qualifier, so you have 64-bit widening multiplication. One can build either base 2^64 or base 2^44. Again, not saying that it's guaranteed path, but it needs to be evaluated...
I think I start to understand... I reformulate the 130-bit accumulator in terms of a fewer higher-base limbs, then use 44x44 -> 88 bit multiplication, and do
MUL/UMULback-to back to get the lower-and higher bits of the result right? Then do some (simpler) reduction...
Right. And 44 is because 44*3 is closest to 130 while being larger.
I wish somebody has written a (white) paper with exact formulas :) But I guess I want too much.
I will study your riscv implementation, thanks for this!