constantine
constantine copied to clipboard
Perf: Assembly code generator for ARM and ARM64
https://github.com/mratsim/constantine/pull/69 introduced an assembly ode generator for x86 and x86-64 at https://github.com/mratsim/constantine/blob/7d29cb9/constantine/platforms/isa/macro_assembler_x86.nim
We need the same for ARM for efficiency on Raspberry Pi, Phones, Apple Silicon and other resource-restricted devices.
Efficient multiplication on ARM:
- slides: http://arith24.arithsymposium.org/slides/s2-liu.pdf
paper 1: https://orbilu.uni.lu/bitstream/10993/34104/1/ARMv8_KJ_zhe.pdf
paper 2: https://core.ac.uk/download/pdf/275655534.pdf
Multiprecision Multiplication on ARMv8
Related papers:
https://eprint.iacr.org/2021/185.pdf
No Silver Bullet: Optimized Montgomery Multiplication on Various 64-bit ARM Platforms
Abstract
In this paper, we firstly presented optimized implementa- tions of Montgomery multiplication on 64-bit ARM processors by taking advantages of Karatsuba algorithm and efficient multiplication instruc- tion sets for ARM64 architectures. The implementation of Montgomery multiplication can improve the performance of (pre-quantum and post- quantum) public key cryptography (e.g. CSIDH, ECC, and RSA) imple- mentations on ARM64 architectures, directly. Last but not least, the per- formance of Karatsuba algorithm does not ensure the fastest speed record on various ARM architectures, while it is determined by the clock cycles per multiplication instruction of target ARM architectures. In particular, recent Apple processors based on ARM64 architecture show lower cycles per instruction of multiplication than that of ARM Cortex-A series. For this reason, the schoolbook method shows much better performance than the sophisticated Karatsuba algorithm on Apple processors. With this observation, we can determine the proper approach for multiplication of cryptography library (e.g. Microsoft-SIDH) on Apple processors and ARM Cortex-A process
Relevant:
-
https://eprint.iacr.org/2022/439.pdf - Efficient Multiplication of Somewhat Small Integers using Number-Theoretic Transforms
-
https://eprint.iacr.org/2021/1355.pdf - Curve448 on 32-bit ARM Cortex-M4
-
https://tches.iacr.org/index.php/TCHES/article/view/9295/8861 - Neon NTT: Faster Dilithium, Kyber, and Saber on Cortex-A72 and Apple M1
-
https://eprint.iacr.org/2021/561.pdf - Kyber on ARM64
-
https://eprint.iacr.org/2019/721.pdf - Optimized SIKE Round 2 on 64-bit ARM
-
https://github.com/Mbed-TLS/mbedtls/issues/5666 - Improve Montgomery multiplication strategy with UMAAL instruction for fused
{C|D} <- A*B + C + D
-
https://github.com/Mbed-TLS/mbedtls/issues/5360 - Improve inline assembly for Cortex-M + DSP
- https://eprint.iacr.org/2018/700.pdf - SIDH on ARM: Faster Modular Multiplications for Faster Post-Quantum Supersingular Isogeny Key Exchange
- slides: https://ches.iacr.org/2018/slides/ches2018-session5-talk3-slides.pdf
- https://eprint.iacr.org/2016/645.pdf - FourQNEON: Faster Elliptic Curve Scalar Multiplications on ARM Processors
- https://rielac.cujae.edu.cu/index.php/rieac/article/download/797/420 - Speeding up elliptic curve arithmetic on ARM processors using NEON instructions
- https://eprint.iacr.org/2015/465.pdf - Efficient Arithmetic on ARM-NEON and Its Application for High-Speed RSA Implementation
- https://eprint.iacr.org/2014/760.pdf - Montgomery Modular Multiplication on ARM-NEON Revisited
https://eprint.iacr.org/2021/185.pdf is particularly interesting regarding general ARM CPUs and Apple CPUs:
Multiplications are 3x slower than addition on Rpi4 but have sensibly the same speed on Apple CPUs.