Assembly for Arm v8.5-A ISA

Open albinahlback opened this issue 1 year ago • 1 comments

I'm sure it has gotten the attention of everyone that Apple's M-chips are basically as fast as the state-of-the-art x86 processors (see GMP's benchmark results). Therefore, I think we should implement assembly routines for these ones as well.

These are the current routines that should be implemented:

[x] Hard(ish)coded multiplication (treated in #1808, works as a full replacement for mpn_mul_basecase)
[x] Hardcoded squaring (treated in #1912)
[x] Hardcoded high multiplication (treated in #1912)
[x] Hardcoded high squaring (treated in #1912)
[x] High multiplication, basecase (treated in #1912)
[ ] High squaring, basecase
[ ] Hardcoded low multiplication
[ ] Hardcoded low squaring
[ ] Low multiplication, basecase
[ ] Low squaring, basecase

Useful links:

https://dougallj.github.io/applecpu/firestorm.html
https://dougallj.github.io/applecpu/firestorm-int.html
https://dougallj.github.io/applecpu/firestorm-simd.html
https://developer.arm.com/architectures/instruction-sets/intrinsics/
https://developer.arm.com/documentation/ddi0602/2023-12?lang=en
https://github.com/corsix/amx
https://stackoverflow.com/questions/70717360/how-to-load-vector-registers-from-integer-registers-in-arm64-m1

Feb 27 '24 10:02 albinahlback

Currently on my arm_assembly branch:

mpn_mul vs flint_mpn_mul

m =   1: 4.67
m =   2: 4.68 3.61
m =   3: 4.01 3.30 3.04
m =   4: 2.89 2.39 2.27 2.18
m =   5: 3.03 2.21 1.95 2.02 2.04
m =   6: 2.64 1.97 1.82 1.89 2.18 2.05
m =   7: 2.32 1.79 1.99 1.68 1.76 1.79 1.83
m =   8: 2.13 1.69 1.61 1.59 1.70 1.74 1.81 1.79
m =   9: 1.96 1.63 1.57 1.53 1.63 1.64 1.64 1.71 1.77
m =  10: 1.81 1.49 1.48 1.47 1.51 1.63 1.60 1.69 1.73 1.75
m =  11: 1.75 1.50 1.45 1.46 1.45 1.48 1.51 1.53 1.57 1.56 1.58
m =  12: 1.63 1.37 1.39 1.47 1.51 1.57 1.69 1.78 1.67 1.58 1.58 1.61

Tested on cfarm103 (Apple M1)

Mar 01 '24 09:03 albinahlback