kyber ARM assembly support?

I'm attempting to work on a PoC using mobile devices, wondering if anyone has attempted to translate the AVX2 code over to something that is compatible with ARM64 CPUs?

Nov 03 '20 15:11 tomleavy

Tom Leavy [email protected] wrote:

I'm attempting to work on a PoC using mobile devices, wondering if anyone has attempted to translate the AVX2 code over to something that is compatible with ARM64 CPUs?

Not yet as far as I know, but an ARM-NEON implementation is certainly on our TODO list.

Nov 05 '20 06:11 cryptojedi

Don't have much experience with ASM coding but maybe I can find some time to try and kick start the process. I'm assuming a command by command port would be fine, unless there are any particular nuances I should know about?

Nov 05 '20 15:11 tomleavy

Tom Leavy [email protected] wrote:

Don't have much experience with ASM coding but maybe I can find some time to try and kick start the process.

That would of course be great and highly appreciated!

I'm assuming a command by command port would be fine, unless there are any particular nuances I should know about?

That won't quite work. The AVX2 implementation uses 256-bit vector registers, ARM-NEON registers have a length of only 128 bits. Also, instructions operating on those registers are slightly different, in particular the multiplication instructions.

Nov 07 '20 12:11 cryptojedi

Hi @cryptojedi and @tomleavy,

I have my NEON implementation is at https://github.com/cothan/kyber/tree/round3 The NTT part is speed up by 3x. Overal speed up is 1.5x. I'm close to finish NEON implementation, the critical component is Keccak, I have my NEON code written for Keccak, my preliminary result for NEON Keccak is a fraction of speed up.

This is my NEON benchmark on ARMv8, Rasberry Pi 4 8Gb 1.9 Ghz (overclocked). Just Polynomial Multiplication module in Kyber.

NTT: 
median: 72 cycles/ticks
average: 72 cycles/ticks

INVNTT: 
median: 115 cycles/ticks
average: 115 cycles/ticks

kyber_keypair: 
median: 7162 cycles/ticks
average: 7185 cycles/ticks

kyber_encaps: 
median: 8207 cycles/ticks
average: 8247 cycles/ticks

kyber_decaps: 
median: 8313 cycles/ticks
average: 8342 cycles/ticks

And this is reference code on the same platform

NTT: 
median: 281 cycles/ticks
average: 281 cycles/ticks

INVNTT: 
median: 358 cycles/ticks
average: 358 cycles/ticks

kyber_keypair: 
median: 11160 cycles/ticks
average: 11198 cycles/ticks

kyber_encaps: 
median: 12835 cycles/ticks
average: 12877 cycles/ticks

kyber_decaps: 
median: 14646 cycles/ticks
average: 14696 cycles/ticks

Nov 07 '20 19:11 cothan

Good stuff @cothan . Will gladly utilize your version when it's ready

Nov 07 '20 19:11 tomleavy

cothan [email protected] wrote:

Hi @cryptojedi and @tomleavy,

Hi @cothan,

I have my NEON implementation is at https://github.com/cothan/kyber/tree/round3 The NTT part is speed up by 6x. Overal speed up is 1.5x. I'm close to finish NEON implementation, the critical component is Keccak, I have my NEON code written for Keccak, my preliminary result for NEON Keccak is a fraction of speed up.

This sounds great! Would you be willing to send a PR to upstream?

Just a question: the benchmarks below look more like a factor-4 speedup for the NTT and a factor-3 speedup for the inverse NTT. That's great, but not quite the factor of 6 you're mentioning. Am I missing something?

This is my NEON benchmark on ARMv8, Rasberry Pi 4 8Gb 1.9 Ghz. Just Polynomial Multiplication module in Kyber.
NTT: 
median: 72 cycles/ticks
average: 72 cycles/ticks

INVNTT: 
median: 115 cycles/ticks
average: 115 cycles/ticks

kyber_keypair: 
median: 7162 cycles/ticks
average: 7185 cycles/ticks

kyber_encaps: 
median: 8207 cycles/ticks
average: 8247 cycles/ticks

kyber_decaps: 
median: 8313 cycles/ticks
average: 8342 cycles/ticks
And this is reference code on the same platform
NTT: 
median: 281 cycles/ticks
average: 281 cycles/ticks

INVNTT: 
median: 358 cycles/ticks
average: 358 cycles/ticks

kyber_keypair: 
median: 11160 cycles/ticks
average: 11198 cycles/ticks

kyber_encaps: 
median: 12835 cycles/ticks
average: 12877 cycles/ticks

kyber_decaps: 
median: 14646 cycles/ticks
average: 14696 cycles/ticks
-- You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub: https://github.com/pq-crystals/kyber/issues/31#issuecomment-723483015

Nov 12 '20 10:11 cryptojedi

Just a question: the benchmarks below look more like a factor-4 speedup for the NTT and a factor-3 speedup for the inverse NTT. That's great, but not quite the factor of 6 you're mentioning. Am I missing something?

@cryptojedi It's my bad, I've edited the message above. The NTT speedup is only 3x and overall speed up is only 1.5x. I'm at very last step for Kyber, I'm trying to finish NEON reject_sampling, there is a few if-else condition that I'm not sure how to handle the permutation part. Anyway, when I finish the last step, I'm happy to create a PR to upstream.

Nov 12 '20 16:11 cothan

So I'm kinda finish my implementation. There are 2 functions left, poly_tomsg and poly_frommsg. It won't affect much the total execution time, but anyway they need to be implemented. I'm happy if anyone can help.

Okay, let's see the benchmark for ./test_kyber1024 in neon/ vs ref

.NEON

gen_a: 
median: 2950 cycles/ticks
average: 2983 cycles/ticks

neon_poly_getnoise_eta1_2x/2: 
median: 138 cycles/ticks
average: 140 cycles/ticks

neon_poly_getnoise_eta2: 
median: 73 cycles/ticks
average: 73 cycles/ticks

poly_tomsg: 
median: 16 cycles/ticks
average: 17 cycles/ticks

poly_frommsg: 
median: 6 cycles/ticks
average: 6 cycles/ticks

NTT: 
median: 72 cycles/ticks
average: 72 cycles/ticks

INVNTT: 
median: 115 cycles/ticks
average: 116 cycles/ticks

kyber_keypair: 
median: 5917 cycles/ticks
average: 5960 cycles/ticks

kyber_encaps: 
median: 7137 cycles/ticks
average: 7168 cycles/ticks

kyber_decaps: 
median: 7171 cycles/ticks
average: 7206 cycles/ticks

kex_uake_initA: 
median: 13062 cycles/ticks
average: 13141 cycles/ticks

kex_uake_sharedB: 
median: 14374 cycles/ticks
average: 14437 cycles/ticks

kex_uake_sharedA: 
median: 7224 cycles/ticks
average: 7265 cycles/ticks

kex_ake_initA: 
median: 13059 cycles/ticks
average: 13128 cycles/ticks

kex_ake_sharedB: 
median: 21521 cycles/ticks
average: 21622 cycles/ticks

kex_ake_sharedA: 
median: 14408 cycles/ticks
average: 14487 cycles/ticks

.C_REF

gen_a: 
median: 3748 cycles/ticks
average: 3774 cycles/ticks

poly_getnoise_eta1: 
median: 69 cycles/ticks
average: 68 cycles/ticks

poly_getnoise_eta2: 
median: 69 cycles/ticks
average: 69 cycles/ticks

poly_tomsg: 
median: 61 cycles/ticks
average: 60 cycles/ticks

poly_frommsg: 
median: 6 cycles/ticks
average: 5 cycles/ticks

NTT: 
median: 280 cycles/ticks
average: 281 cycles/ticks

INVNTT: 
median: 358 cycles/ticks
average: 358 cycles/ticks

kyber_keypair: 
median: 10500 cycles/ticks
average: 10537 cycles/ticks

kyber_encaps: 
median: 12226 cycles/ticks
average: 12280 cycles/ticks

kyber_decaps: 
median: 13788 cycles/ticks
average: 13833 cycles/ticks

kex_uake_initA: 
median: 22755 cycles/ticks
average: 22854 cycles/ticks

kex_uake_sharedB: 
median: 26199 cycles/ticks
average: 26321 cycles/ticks

kex_uake_sharedA: 
median: 13870 cycles/ticks
average: 13944 cycles/ticks

kex_ake_initA: 
median: 22750 cycles/ticks
average: 22843 cycles/ticks

kex_ake_sharedB: 
median: 38417 cycles/ticks
average: 38825 cycles/ticks

kex_ake_sharedA: 
median: 27555 cycles/ticks
average: 27771 cycles/ticks

Benchmark on ARMv8, Pi 4 8Gb 1.9 Ghz (overclocked). My SHA3 implementation can be seen here: https://github.com/cothan/NEON-SHA3_2x. Too bad, the speed up is very minimal. NEON Keccak-2x is only 5%-2% faster than C reference implementation, C reference implementation is compiled without -fno-tree-vectorize

https://github.com/cothan/kyber/tree/round3

Next week I will create a PR.

Nov 20 '20 01:11 cothan

cothan [email protected] wrote:

So I'm kinda finish my implementation. There are 2 functions left, poly_tomsg and poly_frommsg. It won't affect much the total execution time, but anyway they need to be implemented.

Is there a reason for not using the implementations from the "ref" implementation?

Nov 24 '20 07:11 cryptojedi

Hi Peter,

Let me give some perspective on this. The avx2 optimized implementations of the (de)compression functions including poly_{to,from}msg are around 10x faster than the reference versions. This results in a 30% speed-up of indcpa_enc and a >60% speed-up of indcpa_dec after our other optimizations have been applied. In the KEM this difference gets drowned out because of the huge fraction of time spend on hashing the public key and ciphertext due to our very conservative choice of CCA transform. But we'll soon see numbers from Intel CPUs with hardware support for SHA2 (and much higher AES throughput due to the new vector AES instructions that speed-up the matrix expansion and noise sampling).

Cheers, Gregor

Dec 07 '20 16:12 gregorseiler

Gregor Seiler [email protected] wrote:

Hi Peter,

Hi Gregor,

Let me give some perspective on this. The avx2 optimized implementations of the (de)compression functions including poly_{to,from}msg are around 10x faster than the reference versions. This results in a 30% speed-up of indcpa_enc and a >60% speed-up of indcpa_dec after our other optimizations have been applied. In the KEM this difference gets drowned out because of the huge fraction of time spend on hashing the public key and ciphertext due to our very conservative choice of CCA transform. But we'll soon see numbers from Intel CPUs with hardware support for SHA2 (and much higher AES throughput due to the new vector AES instructions that speed-up the matrix expansion and noise sampling).

Oh, didn't mean to say that we don't want an optimized implementation eventually; I was just wondering if there is any reason that the existing C implementation wouldn't work.

Cheers,

Peter

Dec 08 '20 16:12 cryptojedi

Hello @tomleavy @cryptojedi and @gregorseiler,

We have made a fork of the Kyber repo and made an ARM32 and ARM64 implementation of Kyber: https://github.com/BeechatNetworkSystemsLtd/Kyber-ARM

We have also made a Java wrapper for Kyber: https://github.com/BeechatNetworkSystemsLtd/JNIPQC

And we are currently working on a Python bindings for Kyber as we speak as well.

We hope this helps and to contribute to the Kyber project as well as its implementation on multiple platforms.

Cheers,

The Beechat Network team

Mar 22 '21 12:03 BeechatNetworkSystemsLtd

After iteration of code, finally my NEON ARMv8 code is complete. I also have some benchmarks on Apple M1. The speed up is a bit better.

Kyber M1 NEON	neon Level 1	neon Level 3	neon Level 5
gen_matrix	7,680	17,944	30,743
neon_ntt	413	413	413
neon_invntt	428	428	428
crypto_kem_keypair	22,958	36,342	55,951
crypto_kem_enc	32,522	49,134	71,579
crypto_kem_dec	29,412	45,100	67,026

Kyber M1 REF	ref Level 1	ref Level 3	ref Level 5
gen_matrix	11,989	26,894	47,558
ref_ntt	3,171	3,223	3,217
ref_invntt	5,171	5,118	5,148
crypto_kem_keypair	59,622	105,058	163,075
crypto_kem_enc	76,513	120,766	175,568
crypto_kem_dec	90,254	138,813	198,509

EDIT: the unit is in cycles.

I think this neon version is ready, what else should I do to make a pull request?

Mar 26 '21 23:03 cothan

@cothan Could you also report the CCs for your latest code running on your Raspberry Pi's?

Apr 06 '21 15:04 hanno-becker

Kyber A72 NEON neon Level 1 neon Level 3 neon Level 5

gen_matrix 27,152 59,800 108,518

neon_poly_getnoise_eta1_2x 9,247 4,422 4,416

neon_poly_getnoise_eta2 2,679 2,679 2,683

poly_tomsg 976 979 980

poly_frommsg 810 816 815

neon_ntt 1,496 1,473 1,476

neon_invntt 1,659 1,661 1,657

crypto_kem_keypair 72,015 116,390 184,965

crypto_kem_enc 95,287 150,959 223,786

crypto_kem_dec 94,069 149,845 220,671

Kyber A72 REF ref Level 1 ref Level 3 ref Level 5

gen_matrix 28,921 70,117 127,776

ref_poly_getnoise_eta1 4,397 3,317 3,315

ref_poly_getnoise_eta2 3,311 3,314 3,317

poly_tomsg 976 979 979

poly_frommsg 810 815 817

ref_ntt 8,496 8,500 8,551

ref_invntt 12,530 12,533 12,409

crypto_kem_keypair 136,934 237,601 371,906

crypto_kem_enc 184,533 298,928 440,645

crypto_kem_dec 223,359 349,122 503,820

Kyber A72 NEON	neon Level 1	neon Level 3	neon Level 5
gen_matrix	27,152	59,800	108,518
neon_poly_getnoise_eta1_2x	9,247	4,422	4,416
neon_poly_getnoise_eta2	2,679	2,679	2,683
poly_tomsg	976	979	980
poly_frommsg	810	816	815
neon_ntt	1,496	1,473	1,476
neon_invntt	1,659	1,661	1,657
crypto_kem_keypair	72,015	116,390	184,965
crypto_kem_enc	95,287	150,959	223,786
crypto_kem_dec	94,069	149,845	220,671

Kyber A72 REF	ref Level 1	ref Level 3	ref Level 5
gen_matrix	28,921	70,117	127,776
ref_poly_getnoise_eta1	4,397	3,317	3,315
ref_poly_getnoise_eta2	3,311	3,314	3,317
poly_tomsg	976	979	979
poly_frommsg	810	815	817
ref_ntt	8,496	8,500	8,551
ref_invntt	12,530	12,533	12,409
crypto_kem_keypair	136,934	237,601	371,906
crypto_kem_enc	184,533	298,928	440,645
crypto_kem_dec	223,359	349,122	503,820

Yes, sure, I conduct the result by using a patch of PAPI for Cortex-A72. https://github.com/cothan/PAPI_ARMv8_Cortex_A72

Note: neon_poly_getnoise_eta1_2x includes NEON implementation of NEON_SHA2x on Cortex-A72, when compare with ref_poly_getnoise_eta1 in Level 3 and Level 5, you have to divide by 2.

Apr 06 '21 15:04 cothan

@cothan which OS and compiler did you use on Raspberry PI?

May 07 '21 13:05 marco-palumbi

Hi @marco-palumbi ,

Here is the configuration in my RP4:

cothan@manjaro-rp4 ~> uname -a
Linux manjaro-rp4 5.10.20-1-MANJARO-ARM #1 SMP PREEMPT Thu Mar 4 14:36:16 CST 2021 aarch64 GNU/Linux

cothan@manjaro-rp4 ~> gcc -v 
Using built-in specs.
COLLECT_GCC=gcc
COLLECT_LTO_WRAPPER=/usr/lib/gcc/aarch64-unknown-linux-gnu/10.2.0/lto-wrapper
Target: aarch64-unknown-linux-gnu
Configured with: /build/gcc/src/gcc/configure --prefix=/usr --libdir=/usr/lib --libexecdir=/usr/lib --mandir=/usr/share/man --infodir=/usr/share/info --with-bugurl=https://github.com/archlinuxarm/PKGBUILDs/issues --enable-languages=c,c++,fortran,go,lto,objc,obj-c++,d --with-isl --with-linker-hash-style=gnu --with-system-zlib --enable-__cxa_atexit --enable-checking=release --enable-clocale=gnu --enable-default-pie --enable-default-ssp --enable-gnu-indirect-function --enable-gnu-unique-object --enable-install-libiberty --enable-linker-build-id --enable-lto --enable-plugin --enable-shared --enable-threads=posix --disable-libssp --disable-libstdcxx-pch --disable-libunwind-exceptions --disable-multilib --disable-werror --host=aarch64-unknown-linux-gnu --build=aarch64-unknown-linux-gnu --with-arch=armv8-a --enable-fix-cortex-a53-835769 --enable-fix-cortex-a53-843419 gdc_include_dir=/usr/include/dlang/gdc
Thread model: posix
Supported LTO compression algorithms: zlib zstd
gcc version 10.2.0 (GCC) 

cothan@manjaro-rp4 ~> clang -v 
clang version 11.1.0
Target: aarch64-unknown-linux-gnu
Thread model: posix
InstalledDir: /usr/bin
Found candidate GCC installation: /usr/bin/../lib/gcc/aarch64-unknown-linux-gnu/10.2.0
Found candidate GCC installation: /usr/lib/gcc/aarch64-unknown-linux-gnu/10.2.0
Selected GCC installation: /usr/bin/../lib/gcc/aarch64-unknown-linux-gnu/10.2.0
Candidate multilib: .;@m64
Selected multilib: .;@m64

I use Clang across all my results.

May 07 '21 16:05 cothan

Thanks a lot @cothan for sharing your work on developing optimized PQC implementations for Arm, it's great to see more interest and progress in this area.

Note that you can further improve performance by using the fixed-point doubling multiply-high VQDMULH to implement @gregorseiler's approach to Montgomery multiplication from Faster AVX2 optimized NTT multiplication for Ring-LWE lattice cryptography (use the doubling VQDMULH for the high multiply, VMUL for the low multiply, and the halving subtract VHSUB to compensate for the doubling). This reduces the Montgomery multiplication sequence from 9 to 5 instructions and should be a drop-in replacement for your fqmul macro. If you further precompute twisted constants following NTTRU: Truly Fast NTRU Using NTT, you'll get to 4 instructions.

Finally, you can use the fixed-point multiply-high-accumulate VQRDMLAH (v8.1-A onwards) to get to 3 instructions, but this is quite a bit trickier because of the carry-in mentioned in @gregorseiler's paper, as well as the rounding, doubling, and especially impacted range analysis in case of the Kyber prime.

If you publish or present your code with those modifications, a mention would be appreciated. A cite-able paper on the use of fixed point instructions for modular arithmetic -- which also applies to MVE, SVE and SVE2 -- will hopefully appear shortly.

Jun 02 '21 20:06 hanno-becker

Hi @hanno-arm,

Yes, you're right. It is the multiply instruction I'm looking for. I'm fully aware that the specialty of AVX2 that they can multiply high or low in single instruction. I tried to look similar instruction in the past, due to the time limit, so I don't know much about rounding and saturating. I declare exact the issue you mentioned in a submitted paper (mine is going to be published soon), and you solve my bottleneck in fqmul. Please let me know your paper after you publish it, I am happy to cite it. I had a precise Kyber range analysis, improve upon the paper of Bas Westerbaan: When to Barrett reduce in the inverse NTT, I hope I can apply it to the new approach.

Thank you very much. I will definitely mention you, and cite your work.

Jun 02 '21 20:06 cothan

I declare exact the issue you mentioned in a submitted paper (mine is going to be published soon), and you solve my bottleneck in fqmul.

Great! Here's a complete patch, by the way:

#define fqmul(out, in, zeta, t)                                                                  \
    t.val[0] = (int16x8_t)vqdmulhq_s16(in, zeta);                  /* (2*a)_H */                 \ 
    t.val[1] = (int16x8_t)vmulq_s16(in, zeta);                     /* a_L */                     \
    t.val[2] = vmulq_s16(t.val[1], neon_qinv);                     /* a_L = a_L * QINV */        \
    t.val[3] = (int16x8_t)vqdmulhq_s16(t.val[2], neon_kyberq);     /* (2*a_L*Q)_H */             \
    out = vhsubq_s16(t.val[0], t.val[3]);                          /* ((2*a)_H - (2*a_L*Q)_H)/2 */

This, of course, doesn't yet implement constant merging, though.

It would be interesting see what difference it actually makes on various CPUs. Since the bottleneck may be the multiplication sequence, rather than the boilerplate around it, it might not be as large as one would expect from the mere instruction count.

Please let me know your paper after you publish it, I am happy to cite it.

Yes, will do!

Jun 03 '21 04:06 hanno-becker

I'm fully aware that the specialty of AVX2 that they can multiply high or low in single instruction

Note that MVE and SVE also have separate high/low multiply instructions. It is also noteworthy that in contrast to AVX2, those instructions aren't limited to 16-bit lanes, which is of interest e.g. for Dilithium or an NTT-based implementation of Saber.

When looking through your code, I noticed that you often duplicate twiddle factors across vectors. Have you considered loading multiple twiddles into a single vector and using lane-indexed instructions? This should work for layers 0-4 and save you some instructions (and free up some vectors).

Jun 03 '21 06:06 hanno-becker

@cothan hello，is this "neon" version sensitive to big-endian and little-endian mode? https://github.com/cothan/kyber/tree/round3

Jun 14 '22 10:06 Ruofei-Li

Hi @Ruofei-Li,

I only test my code in A72 and Apple M1 so I use the default endian on such platform (whatever endian it is). You can feel free to make an issue over there if you find it bugs, or support mismatch endian.

Jun 15 '22 04:06 cothan

kyber kyber copied to clipboard

ARM assembly support?

kyber
kyber copied to clipboard