Porting X25519 to OpenSSL?
Hi,
This is not really an issue, but I figured it might be the best place to ask.
I can see there exists an armv8 assembly implementation of X25519 curve here, but it is absent in OpenSSL. I was wondering if there is a license (or some other) issue that prevents from doing this, and if not, I'd be happy to do the porting (and benchmarking on modern Arm archs). As far as I understand, (most of?) other algorithms, which implementation is present in this repo, have been also ported to OpenSSL.
Many thanks! Iakov
I'd be happy to do the porting (and benchmarking on modern Arm archs).
Sure, go ahead. However, I wouldn't be surprised if it turns out that benefits would be less impressive than expected, and hence unjustifiable on newer processors. Just in case for reference, the point is to reduce the amount of multiplications, but it comes at the cost of longer dependency chains. This might be a problem. As an extreme example on Power8 and forward the radix 2^64 subroutines are even slower than 2^51. Go figure... Either way. If you're set on squeezing more performance, you can consider implementing the ladder step [in scalar multiplication] in assembly, but it makes less sense on ARM64, because assembly subroutines in question don't allocate stack frames, so you won't be eliminating much of the loads and stores. You might find it tempting to implement sqr_n_mul to assist the inversion, but one has to recognize that there are faster algorithms than the one in question. I can see if I can adapt one I have for the 2^255-19 modulus...
Thank you very much!
Yes, I already suspected that in its current state there might not be much performance gain from simply porting the existing assembly implementation, if one can judge from comparing to x86_64 performance. According to my results, speed X25519 performs at 199k cycles per encaps (cpe) on Graviton4, and at 219k cpe on Intel Sapphire Rapids (and on a brand new Arrow Lake laptop at 198.5 cpe), whence Intel processors make use of assembly implementation, and Arm ones don't. I however looked at stack trace (of the full speed X25519 though, so not only encaps) and realised that for x86_64, assembly functions (fe51_mul and fe51_sq) were called only about 50% of time - the rest of time is mostly taken by the C-implemented fe_mul and fe_sq (called from ge_scalarmult_base). Anyway, I suppose this is specific to OpenSSL and not worth discussing here.
Thank you for you suggestions - I currently lack the deep understanding of the algorithms involved, but I will definitely keep them in mind if I take the route of trying to further optimise them. Regarding the ladder step - is it by any chance what's described here ? I also found this relatively recent work on the subject, but haven't studied it yet.
I am currently looking into porting the POLY1305 Neon implementation to SVE/SVE2, to try and make use of wider registers (if available). This is because, while very performant, it is still underperforms compared to AVX2 and AVX512 implementations (of which I am sure you are aware). I don't hope to squeeze any more perf on 128-bit registers, but Graviton3 and A64FX for example have 256-bit SVE registers, so I thought its worth a shot.
First of all, originally the module in question was primarily about improving DH key exchange. This doesn't necessarily translate to other algorithms. Well, it demonstrably doesn't, since you mention generic fe_mul and fe_sqr. One can naturally expand the scope ~to make use of~ utilize the faster field operations ~to~ in other algorithms, so maybe you should concentrate on that first. Though this will naturally improve performance across the board, which might not align with your goals, who knows ;-)
for x86_64, assembly functions (
fe51_mulandfe51_sq)
Then something is wrong, fe64 subroutines should have been called, those provide significantly better performance on x86_64. As for compiler-generated fe51 on ARM64. Compiler has no problem generating adequate code, because it doesn't have to juggle as limited amount of registers as on x86_64. The reason for going to assembly for fe64 is because it's a problem to make the compiler build carry chains. But, as already mentioned, it's not given that it's the best option universally. It's a balancing act...
By "ladder" I meant the inner loop in the x25519_scalar_mul[x].
Then something is wrong, fe64 subroutines should have been called
I am sorry, you are absolutely right - fe51 were called on Arm (with no assembly implementation), I looked at the wrong perf report!
One can naturally expand the scope to make use of the faster field operations to other algorithms, so maybe you should concentrate on that first. Though this will naturally improve performance across the board, which might not align with your goals, who knows ;-)
That would be perfectly in line with my goals as long as it's on Arm :) Thank you, I will at some point definitely look into this!
Compiler has no problem generating adequate code, because it doesn't have to juggle as limited amount of registers as on x86_64. The reason for going to assembly for fe64 is because it's a problem to make the compiler build carry chains.
I see... Thank you very much for the valuable insights!
As for Poly1305. It looks like Graviton3 has 4 128-bit vector units and it simply takes a pair of them to execute a 256-bit instruction. This means that a 128-bit implementation that fully utilizes 4 128-bit units won't run faster if ported to 256-bit SVE. The natural question in the context is if NEON Poly1305 can utilize 4 128-bit units. It can. Well, there is more to it than noting how many dependency chains does implementation have, instruction latency can mess up the assessment. What do you measure with the current NEON implementation on Graviton3?
Just in case. This is not to say the targeting SVE makes no sense, just that it's not given that it will do magic specifically on Graviton3.
It looks like Graviton3 has 4 128-bit vector units and it simply takes a pair of them to execute a 256-bit instruction.
Oh, I was unaware of that... It makes sense... Yes, from what I learned about your Neon implementation, it should be able to make use of up to 5 units... Moreover, at my current understanding, refactoring implementation to make use of 256-bit vectors would incur a higher overhead (as one would need to use 4-fold parallelisation of Horner's rule (well, unwrapped version of it)), and if Graviton3 has twice less number of 256-bit registers, it might further reduce potential efficiency gains? But it could be still an interesting experiment, and also there is A64FX to try (although I won't be surprised if there is a similar situation), as you say, targeting SVE is still meaningful.
What do you measure with the current NEON implementation on Graviton3?
I must say I've been lazy so far, and simply measured performance (in cpb) of speed -evp ChaCha20-Poly1305 minus that of speed -evp ChaCha20 - but I got 1.4 cpb on ThunderX2 - a very similar result to your 1.36 number on the same arch. You might be interested to know that it improves to 0.6 cpb on Graviton3, but this still looses to 0.4 cpb on Arrow Lake (AVX2) and 0.25 cpb on Sapphire Rapids (AVX512), although it might be that this subtraction method is not accurate enough - I should write a dedicated benchmark... Still, I was thinking I should address SVE/SVE2 since there are implementations for AVX2 and AVX512.
What do you measure with the current NEON implementation on Graviton3?
0.6 cpb on Graviton3,
Thanks. Apple M1 and Snapdragon X deliver ~0.5. 0.6 is a tad worse, but it sounds like it engages more than 3 units...
but this still looses to 0.4 cpb on Arrow Lake (AVX2) and 0.25 cpb on Sapphire Rapids (AVX512),
Intel AVX2 processors customarily have 3 full-width 256-bit units, 50% more compute power in comparison, so 0.6 vs. 0.4 is totally expected. AVX512 [implementation] on the other hand utilizes larger radix, which means that it takes smaller amount of operations, so it's apples vs. oranges.
Intel AVX2 processors customarily have 3 full-width 256-bit units, 50% more compute power in comparison, so 0.6 vs. 0.4 is totally expected.
Though Arrow Lake has 4 256-bit units, so it's 2x more compute power in comparison. Either way, the difference should not be a surprise, nor can you expect a processor with less resources be as fast.
Right... According to this analysis I am not very hopeful indeed to see any better numbers with SVE on existing Arm architectures, but I guess it would still be interesting to see, and maybe useful in the future...
AVX512 [implementation] on the other hand utilizes larger radix, which means that it takes smaller amount of operations, so it's apples vs. oranges.
Do you think one should be doing a similar thing with SVE? I still need to study the AVX2 and AVX512 implementations to understand this better though.
AVX512 [implementation] on the other hand utilizes larger radix, which means that it takes smaller amount of operations, so it's apples vs. oranges.
Do you think one should be doing a similar thing with SVE?
Larger radix in AVX512 implementation is enabled by new instructions, specifically ~51~52-bit IFMA. In other words it's not about what you want, but about what you actually can do. Maybe SVE2 can facilitate even larger radix? 1900 pages... You tell me :-)
Right, I can see the specialized routines using vpmadd52luq and vpmadd52huq... Quick googling suggests that SVE2 lacks direct equivalents of those :/ So maybe no luck there.
Another bummer - apparently SVE lacks the 64-bit integer widening multiplies (umull/umlal) - only SVE2 introduces the equivalents :( So probably no point porting to SVE, only to SVE2, for which there are no 256-bit register implementations.