flint
flint copied to clipboard
Assembly for Arm v8.5-A ISA
I'm sure it has gotten the attention of everyone that Apple's M-chips are basically as fast as the state-of-the-art x86 processors (see GMP's benchmark results). Therefore, I think we should implement assembly routines for these ones as well.
These are the current routines that should be implemented:
- [x] Hard(ish)coded multiplication (treated in #1808, works as a full replacement for
mpn_mul_basecase) - [x] Hardcoded squaring (treated in #1912)
- [x] Hardcoded high multiplication (treated in #1912)
- [x] Hardcoded high squaring (treated in #1912)
- [x] High multiplication, basecase (treated in #1912)
- [ ] High squaring, basecase
- [ ] Hardcoded low multiplication
- [ ] Hardcoded low squaring
- [ ] Low multiplication, basecase
- [ ] Low squaring, basecase
Useful links:
- https://dougallj.github.io/applecpu/firestorm.html
- https://dougallj.github.io/applecpu/firestorm-int.html
- https://dougallj.github.io/applecpu/firestorm-simd.html
- https://developer.arm.com/architectures/instruction-sets/intrinsics/
- https://developer.arm.com/documentation/ddi0602/2023-12?lang=en
- https://github.com/corsix/amx
- https://stackoverflow.com/questions/70717360/how-to-load-vector-registers-from-integer-registers-in-arm64-m1
Currently on my arm_assembly branch:
mpn_mul vs flint_mpn_mul
m = 1: 4.67
m = 2: 4.68 3.61
m = 3: 4.01 3.30 3.04
m = 4: 2.89 2.39 2.27 2.18
m = 5: 3.03 2.21 1.95 2.02 2.04
m = 6: 2.64 1.97 1.82 1.89 2.18 2.05
m = 7: 2.32 1.79 1.99 1.68 1.76 1.79 1.83
m = 8: 2.13 1.69 1.61 1.59 1.70 1.74 1.81 1.79
m = 9: 1.96 1.63 1.57 1.53 1.63 1.64 1.64 1.71 1.77
m = 10: 1.81 1.49 1.48 1.47 1.51 1.63 1.60 1.69 1.73 1.75
m = 11: 1.75 1.50 1.45 1.46 1.45 1.48 1.51 1.53 1.57 1.56 1.58
m = 12: 1.63 1.37 1.39 1.47 1.51 1.57 1.69 1.78 1.67 1.58 1.58 1.61
Tested on cfarm103 (Apple M1)