kyber
kyber copied to clipboard
ARM assembly support?
I'm attempting to work on a PoC using mobile devices, wondering if anyone has attempted to translate the AVX2 code over to something that is compatible with ARM64 CPUs?
Tom Leavy [email protected] wrote:
I'm attempting to work on a PoC using mobile devices, wondering if anyone has attempted to translate the AVX2 code over to something that is compatible with ARM64 CPUs?
Not yet as far as I know, but an ARM-NEON implementation is certainly on our TODO list.
Don't have much experience with ASM coding but maybe I can find some time to try and kick start the process. I'm assuming a command by command port would be fine, unless there are any particular nuances I should know about?
Tom Leavy [email protected] wrote:
Don't have much experience with ASM coding but maybe I can find some time to try and kick start the process.
That would of course be great and highly appreciated!
I'm assuming a command by command port would be fine, unless there are any particular nuances I should know about?
That won't quite work. The AVX2 implementation uses 256-bit vector registers, ARM-NEON registers have a length of only 128 bits. Also, instructions operating on those registers are slightly different, in particular the multiplication instructions.
Hi @cryptojedi and @tomleavy,
I have my NEON implementation is at https://github.com/cothan/kyber/tree/round3 The NTT part is speed up by 3x. Overal speed up is 1.5x. I'm close to finish NEON implementation, the critical component is Keccak, I have my NEON code written for Keccak, my preliminary result for NEON Keccak is a fraction of speed up.
This is my NEON benchmark on ARMv8, Rasberry Pi 4 8Gb 1.9 Ghz (overclocked). Just Polynomial Multiplication module in Kyber.
NTT:
median: 72 cycles/ticks
average: 72 cycles/ticks
INVNTT:
median: 115 cycles/ticks
average: 115 cycles/ticks
kyber_keypair:
median: 7162 cycles/ticks
average: 7185 cycles/ticks
kyber_encaps:
median: 8207 cycles/ticks
average: 8247 cycles/ticks
kyber_decaps:
median: 8313 cycles/ticks
average: 8342 cycles/ticks
And this is reference code on the same platform
NTT:
median: 281 cycles/ticks
average: 281 cycles/ticks
INVNTT:
median: 358 cycles/ticks
average: 358 cycles/ticks
kyber_keypair:
median: 11160 cycles/ticks
average: 11198 cycles/ticks
kyber_encaps:
median: 12835 cycles/ticks
average: 12877 cycles/ticks
kyber_decaps:
median: 14646 cycles/ticks
average: 14696 cycles/ticks
Good stuff @cothan . Will gladly utilize your version when it's ready
cothan [email protected] wrote:
Hi @cryptojedi and @tomleavy,
Hi @cothan,
I have my NEON implementation is at https://github.com/cothan/kyber/tree/round3 The NTT part is speed up by 6x. Overal speed up is 1.5x. I'm close to finish NEON implementation, the critical component is Keccak, I have my NEON code written for Keccak, my preliminary result for NEON Keccak is a fraction of speed up.
This sounds great! Would you be willing to send a PR to upstream?
Just a question: the benchmarks below look more like a factor-4 speedup for the NTT and a factor-3 speedup for the inverse NTT. That's great, but not quite the factor of 6 you're mentioning. Am I missing something?
This is my NEON benchmark on ARMv8, Rasberry Pi 4 8Gb 1.9 Ghz. Just Polynomial Multiplication module in Kyber.
NTT: median: 72 cycles/ticks average: 72 cycles/ticks INVNTT: median: 115 cycles/ticks average: 115 cycles/ticks kyber_keypair: median: 7162 cycles/ticks average: 7185 cycles/ticks kyber_encaps: median: 8207 cycles/ticks average: 8247 cycles/ticks kyber_decaps: median: 8313 cycles/ticks average: 8342 cycles/ticks
And this is reference code on the same platform
NTT: median: 281 cycles/ticks average: 281 cycles/ticks INVNTT: median: 358 cycles/ticks average: 358 cycles/ticks kyber_keypair: median: 11160 cycles/ticks average: 11198 cycles/ticks kyber_encaps: median: 12835 cycles/ticks average: 12877 cycles/ticks kyber_decaps: median: 14646 cycles/ticks average: 14696 cycles/ticks
-- You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub: https://github.com/pq-crystals/kyber/issues/31#issuecomment-723483015
Just a question: the benchmarks below look more like a factor-4 speedup for the NTT and a factor-3 speedup for the inverse NTT. That's great, but not quite the factor of 6 you're mentioning. Am I missing something?
@cryptojedi
It's my bad, I've edited the message above. The NTT speedup is only 3x and overall speed up is only 1.5x.
I'm at very last step for Kyber, I'm trying to finish NEON reject_sampling
, there is a few if-else
condition that I'm not sure how to handle the permutation part.
Anyway, when I finish the last step, I'm happy to create a PR to upstream.
So I'm kinda finish my implementation.
There are 2 functions left, poly_tomsg
and poly_frommsg
. It won't affect much the total execution time, but anyway they need to be implemented.
I'm happy if anyone can help.
Okay, let's see the benchmark for ./test_kyber1024
in neon/
vs ref
.NEON
gen_a:
median: 2950 cycles/ticks
average: 2983 cycles/ticks
neon_poly_getnoise_eta1_2x/2:
median: 138 cycles/ticks
average: 140 cycles/ticks
neon_poly_getnoise_eta2:
median: 73 cycles/ticks
average: 73 cycles/ticks
poly_tomsg:
median: 16 cycles/ticks
average: 17 cycles/ticks
poly_frommsg:
median: 6 cycles/ticks
average: 6 cycles/ticks
NTT:
median: 72 cycles/ticks
average: 72 cycles/ticks
INVNTT:
median: 115 cycles/ticks
average: 116 cycles/ticks
kyber_keypair:
median: 5917 cycles/ticks
average: 5960 cycles/ticks
kyber_encaps:
median: 7137 cycles/ticks
average: 7168 cycles/ticks
kyber_decaps:
median: 7171 cycles/ticks
average: 7206 cycles/ticks
kex_uake_initA:
median: 13062 cycles/ticks
average: 13141 cycles/ticks
kex_uake_sharedB:
median: 14374 cycles/ticks
average: 14437 cycles/ticks
kex_uake_sharedA:
median: 7224 cycles/ticks
average: 7265 cycles/ticks
kex_ake_initA:
median: 13059 cycles/ticks
average: 13128 cycles/ticks
kex_ake_sharedB:
median: 21521 cycles/ticks
average: 21622 cycles/ticks
kex_ake_sharedA:
median: 14408 cycles/ticks
average: 14487 cycles/ticks
.C_REF
gen_a:
median: 3748 cycles/ticks
average: 3774 cycles/ticks
poly_getnoise_eta1:
median: 69 cycles/ticks
average: 68 cycles/ticks
poly_getnoise_eta2:
median: 69 cycles/ticks
average: 69 cycles/ticks
poly_tomsg:
median: 61 cycles/ticks
average: 60 cycles/ticks
poly_frommsg:
median: 6 cycles/ticks
average: 5 cycles/ticks
NTT:
median: 280 cycles/ticks
average: 281 cycles/ticks
INVNTT:
median: 358 cycles/ticks
average: 358 cycles/ticks
kyber_keypair:
median: 10500 cycles/ticks
average: 10537 cycles/ticks
kyber_encaps:
median: 12226 cycles/ticks
average: 12280 cycles/ticks
kyber_decaps:
median: 13788 cycles/ticks
average: 13833 cycles/ticks
kex_uake_initA:
median: 22755 cycles/ticks
average: 22854 cycles/ticks
kex_uake_sharedB:
median: 26199 cycles/ticks
average: 26321 cycles/ticks
kex_uake_sharedA:
median: 13870 cycles/ticks
average: 13944 cycles/ticks
kex_ake_initA:
median: 22750 cycles/ticks
average: 22843 cycles/ticks
kex_ake_sharedB:
median: 38417 cycles/ticks
average: 38825 cycles/ticks
kex_ake_sharedA:
median: 27555 cycles/ticks
average: 27771 cycles/ticks
Benchmark on ARMv8, Pi 4 8Gb 1.9 Ghz (overclocked).
My SHA3 implementation can be seen here: https://github.com/cothan/NEON-SHA3_2x. Too bad, the speed up is very minimal.
NEON Keccak-2x is only 5%-2% faster than C reference implementation, C reference implementation is compiled without -fno-tree-vectorize
https://github.com/cothan/kyber/tree/round3
Next week I will create a PR.
cothan [email protected] wrote:
So I'm kinda finish my implementation. There are 2 functions left,
poly_tomsg
andpoly_frommsg
. It won't affect much the total execution time, but anyway they need to be implemented.
Is there a reason for not using the implementations from the "ref" implementation?
Hi Peter,
Let me give some perspective on this. The avx2 optimized implementations of the (de)compression functions including poly_{to,from}msg are around 10x faster than the reference versions. This results in a 30% speed-up of indcpa_enc and a >60% speed-up of indcpa_dec after our other optimizations have been applied. In the KEM this difference gets drowned out because of the huge fraction of time spend on hashing the public key and ciphertext due to our very conservative choice of CCA transform. But we'll soon see numbers from Intel CPUs with hardware support for SHA2 (and much higher AES throughput due to the new vector AES instructions that speed-up the matrix expansion and noise sampling).
Cheers, Gregor
Gregor Seiler [email protected] wrote:
Hi Peter,
Hi Gregor,
Let me give some perspective on this. The avx2 optimized implementations of the (de)compression functions including poly_{to,from}msg are around 10x faster than the reference versions. This results in a 30% speed-up of indcpa_enc and a >60% speed-up of indcpa_dec after our other optimizations have been applied. In the KEM this difference gets drowned out because of the huge fraction of time spend on hashing the public key and ciphertext due to our very conservative choice of CCA transform. But we'll soon see numbers from Intel CPUs with hardware support for SHA2 (and much higher AES throughput due to the new vector AES instructions that speed-up the matrix expansion and noise sampling).
Oh, didn't mean to say that we don't want an optimized implementation eventually; I was just wondering if there is any reason that the existing C implementation wouldn't work.
Cheers,
Peter
Hello @tomleavy @cryptojedi and @gregorseiler,
We have made a fork of the Kyber repo and made an ARM32 and ARM64 implementation of Kyber: https://github.com/BeechatNetworkSystemsLtd/Kyber-ARM
We have also made a Java wrapper for Kyber: https://github.com/BeechatNetworkSystemsLtd/JNIPQC
And we are currently working on a Python bindings for Kyber as we speak as well.
We hope this helps and to contribute to the Kyber project as well as its implementation on multiple platforms.
Cheers,
The Beechat Network team
After iteration of code, finally my NEON ARMv8 code is complete. I also have some benchmarks on Apple M1. The speed up is a bit better.
Kyber M1 NEON | neon Level 1 | neon Level 3 | neon Level 5 |
---|---|---|---|
gen_matrix | 7,680 | 17,944 | 30,743 |
neon_ntt | 413 | 413 | 413 |
neon_invntt | 428 | 428 | 428 |
crypto_kem_keypair | 22,958 | 36,342 | 55,951 |
crypto_kem_enc | 32,522 | 49,134 | 71,579 |
crypto_kem_dec | 29,412 | 45,100 | 67,026 |
Kyber M1 REF | ref Level 1 | ref Level 3 | ref Level 5 |
---|---|---|---|
gen_matrix | 11,989 | 26,894 | 47,558 |
ref_ntt | 3,171 | 3,223 | 3,217 |
ref_invntt | 5,171 | 5,118 | 5,148 |
crypto_kem_keypair | 59,622 | 105,058 | 163,075 |
crypto_kem_enc | 76,513 | 120,766 | 175,568 |
crypto_kem_dec | 90,254 | 138,813 | 198,509 |
EDIT: the unit is in cycles.
I think this neon version is ready, what else should I do to make a pull request?
@cothan Could you also report the CCs for your latest code running on your Raspberry Pi's?
Kyber A72 NEON | neon Level 1 | neon Level 3 | neon Level 5 |
---|---|---|---|
gen_matrix | 27,152 | 59,800 | 108,518 |
neon_poly_getnoise_eta1_2x | 9,247 | 4,422 | 4,416 |
neon_poly_getnoise_eta2 | 2,679 | 2,679 | 2,683 |
poly_tomsg | 976 | 979 | 980 |
poly_frommsg | 810 | 816 | 815 |
neon_ntt | 1,496 | 1,473 | 1,476 |
neon_invntt | 1,659 | 1,661 | 1,657 |
crypto_kem_keypair | 72,015 | 116,390 | 184,965 |
crypto_kem_enc | 95,287 | 150,959 | 223,786 |
crypto_kem_dec | 94,069 | 149,845 | 220,671 |
Kyber A72 REF | ref Level 1 | ref Level 3 | ref Level 5 |
---|---|---|---|
gen_matrix | 28,921 | 70,117 | 127,776 |
ref_poly_getnoise_eta1 | 4,397 | 3,317 | 3,315 |
ref_poly_getnoise_eta2 | 3,311 | 3,314 | 3,317 |
poly_tomsg | 976 | 979 | 979 |
poly_frommsg | 810 | 815 | 817 |
ref_ntt | 8,496 | 8,500 | 8,551 |
ref_invntt | 12,530 | 12,533 | 12,409 |
crypto_kem_keypair | 136,934 | 237,601 | 371,906 |
crypto_kem_enc | 184,533 | 298,928 | 440,645 |
crypto_kem_dec | 223,359 | 349,122 | 503,820 |
Yes, sure, I conduct the result by using a patch of PAPI for Cortex-A72. https://github.com/cothan/PAPI_ARMv8_Cortex_A72
Note: neon_poly_getnoise_eta1_2x
includes NEON implementation of NEON_SHA2x on Cortex-A72, when compare with ref_poly_getnoise_eta1
in Level 3 and Level 5, you have to divide by 2.
@cothan which OS and compiler did you use on Raspberry PI?
Hi @marco-palumbi ,
Here is the configuration in my RP4:
cothan@manjaro-rp4 ~> uname -a
Linux manjaro-rp4 5.10.20-1-MANJARO-ARM #1 SMP PREEMPT Thu Mar 4 14:36:16 CST 2021 aarch64 GNU/Linux
cothan@manjaro-rp4 ~> gcc -v
Using built-in specs.
COLLECT_GCC=gcc
COLLECT_LTO_WRAPPER=/usr/lib/gcc/aarch64-unknown-linux-gnu/10.2.0/lto-wrapper
Target: aarch64-unknown-linux-gnu
Configured with: /build/gcc/src/gcc/configure --prefix=/usr --libdir=/usr/lib --libexecdir=/usr/lib --mandir=/usr/share/man --infodir=/usr/share/info --with-bugurl=https://github.com/archlinuxarm/PKGBUILDs/issues --enable-languages=c,c++,fortran,go,lto,objc,obj-c++,d --with-isl --with-linker-hash-style=gnu --with-system-zlib --enable-__cxa_atexit --enable-checking=release --enable-clocale=gnu --enable-default-pie --enable-default-ssp --enable-gnu-indirect-function --enable-gnu-unique-object --enable-install-libiberty --enable-linker-build-id --enable-lto --enable-plugin --enable-shared --enable-threads=posix --disable-libssp --disable-libstdcxx-pch --disable-libunwind-exceptions --disable-multilib --disable-werror --host=aarch64-unknown-linux-gnu --build=aarch64-unknown-linux-gnu --with-arch=armv8-a --enable-fix-cortex-a53-835769 --enable-fix-cortex-a53-843419 gdc_include_dir=/usr/include/dlang/gdc
Thread model: posix
Supported LTO compression algorithms: zlib zstd
gcc version 10.2.0 (GCC)
cothan@manjaro-rp4 ~> clang -v
clang version 11.1.0
Target: aarch64-unknown-linux-gnu
Thread model: posix
InstalledDir: /usr/bin
Found candidate GCC installation: /usr/bin/../lib/gcc/aarch64-unknown-linux-gnu/10.2.0
Found candidate GCC installation: /usr/lib/gcc/aarch64-unknown-linux-gnu/10.2.0
Selected GCC installation: /usr/bin/../lib/gcc/aarch64-unknown-linux-gnu/10.2.0
Candidate multilib: .;@m64
Selected multilib: .;@m64
I use Clang across all my results.
Thanks a lot @cothan for sharing your work on developing optimized PQC implementations for Arm, it's great to see more interest and progress in this area.
Note that you can further improve performance by using the fixed-point doubling multiply-high VQDMULH
to implement @gregorseiler's approach to Montgomery multiplication from Faster AVX2 optimized NTT multiplication for Ring-LWE lattice cryptography (use the doubling VQDMULH
for the high multiply, VMUL
for the low multiply, and the halving subtract VHSUB
to compensate for the doubling). This reduces the Montgomery multiplication sequence from 9 to 5 instructions and should be a drop-in replacement for your fqmul
macro. If you further precompute twisted constants following NTTRU: Truly Fast NTRU Using NTT, you'll get to 4 instructions.
Finally, you can use the fixed-point multiply-high-accumulate VQRDMLAH
(v8.1-A onwards) to get to 3 instructions, but this is quite a bit trickier because of the carry-in mentioned in @gregorseiler's paper, as well as the rounding, doubling, and especially impacted range analysis in case of the Kyber prime.
If you publish or present your code with those modifications, a mention would be appreciated. A cite-able paper on the use of fixed point instructions for modular arithmetic -- which also applies to MVE, SVE and SVE2 -- will hopefully appear shortly.
Hi @hanno-arm,
Yes, you're right. It is the multiply instruction I'm looking for. I'm fully aware that the specialty of AVX2 that they can multiply high or low in single instruction. I tried to look similar instruction in the past, due to the time limit, so I don't know much about rounding and saturating.
I declare exact the issue you mentioned in a submitted paper (mine is going to be published soon), and you solve my bottleneck in fqmul
. Please let me know your paper after you publish it, I am happy to cite it.
I had a precise Kyber range analysis, improve upon the paper of Bas Westerbaan: When to Barrett reduce in the inverse NTT, I hope I can apply it to the new approach.
Thank you very much. I will definitely mention you, and cite your work.
I declare exact the issue you mentioned in a submitted paper (mine is going to be published soon), and you solve my bottleneck in fqmul.
Great! Here's a complete patch, by the way:
#define fqmul(out, in, zeta, t) \
t.val[0] = (int16x8_t)vqdmulhq_s16(in, zeta); /* (2*a)_H */ \
t.val[1] = (int16x8_t)vmulq_s16(in, zeta); /* a_L */ \
t.val[2] = vmulq_s16(t.val[1], neon_qinv); /* a_L = a_L * QINV */ \
t.val[3] = (int16x8_t)vqdmulhq_s16(t.val[2], neon_kyberq); /* (2*a_L*Q)_H */ \
out = vhsubq_s16(t.val[0], t.val[3]); /* ((2*a)_H - (2*a_L*Q)_H)/2 */
This, of course, doesn't yet implement constant merging, though.
It would be interesting see what difference it actually makes on various CPUs. Since the bottleneck may be the multiplication sequence, rather than the boilerplate around it, it might not be as large as one would expect from the mere instruction count.
Please let me know your paper after you publish it, I am happy to cite it.
Yes, will do!
I'm fully aware that the specialty of AVX2 that they can multiply high or low in single instruction
Note that MVE and SVE also have separate high/low multiply instructions. It is also noteworthy that in contrast to AVX2, those instructions aren't limited to 16-bit lanes, which is of interest e.g. for Dilithium or an NTT-based implementation of Saber.
When looking through your code, I noticed that you often duplicate twiddle factors across vectors. Have you considered loading multiple twiddles into a single vector and using lane-indexed instructions? This should work for layers 0-4 and save you some instructions (and free up some vectors).
@cothan hello,is this "neon" version sensitive to big-endian and little-endian mode? https://github.com/cothan/kyber/tree/round3
Hi @Ruofei-Li,
I only test my code in A72 and Apple M1 so I use the default endian on such platform (whatever endian it is). You can feel free to make an issue over there if you find it bugs, or support mismatch endian.