Optimizations for ARM64
I ended up needing more performance from ed25519.Verify on ARM64, so here's a bunch of optimizations.
Unfortunately, both feSquareGeneric and feMulGeneric still spill to the stack, but it should be possible to avoid that.
Full results for different platforms...
Results for Raspberry Pi 4:
goos: linux
goarch: arm64
pkg: filippo.io/edwards25519
│ A │ B │ C │ D │
│ sec/op │ sec/op vs base │ sec/op vs base │ sec/op vs base │
EncodingDecoding-4 163.8µ ± 0% 154.8µ ± 0% -5.48% (p=0.000 n=10) 143.0µ ± 0% -12.68% (p=0.000 n=10) 137.1µ ± 0% -16.31% (p=0.000 n=10)
MultiScalarMultSize8-4 3.447m ± 0% 3.352m ± 0% -2.76% (p=0.000 n=10) 3.175m ± 0% -7.89% (p=0.000 n=10) 3.130m ± 0% -9.20% (p=0.000 n=10)
ScalarAddition-4 58.62n ± 0% 58.61n ± 0% ~ (p=0.490 n=10) 58.66n ± 0% ~ (p=0.305 n=10) 58.76n ± 0% +0.24% (p=0.037 n=10)
ScalarMultiplication-4 420.8n ± 0% 420.4n ± 0% ~ (p=0.305 n=10) 420.8n ± 0% ~ (p=0.897 n=10) 421.2n ± 0% ~ (p=0.288 n=10)
ScalarInversion-4 119.8µ ± 0% 120.0µ ± 0% ~ (p=0.239 n=10) 119.7µ ± 0% ~ (p=0.912 n=10) 119.8µ ± 0% ~ (p=1.000 n=10)
ScalarBaseMult-4 281.7µ ± 0% 274.7µ ± 0% -2.48% (p=0.000 n=10) 261.2µ ± 0% -7.25% (p=0.000 n=10) 256.7µ ± 0% -8.85% (p=0.000 n=10)
ScalarMult-4 1017.4µ ± 0% 981.4µ ± 0% -3.54% (p=0.000 n=10) 915.5µ ± 0% -10.01% (p=0.000 n=10) 898.8µ ± 0% -11.66% (p=0.000 n=10)
VarTimeDoubleScalarBaseMult-4 1014.3µ ± 0% 976.8µ ± 0% -3.69% (p=0.000 n=10) 909.1µ ± 0% -10.37% (p=0.000 n=10) 890.4µ ± 0% -12.21% (p=0.000 n=10)
geomean 68.50µ 66.95µ -2.26% 64.29µ -6.15% 63.41µ -7.44%
pkg: filippo.io/edwards25519/field
│ A │ B │ C │ D │
│ sec/op │ sec/op vs base │ sec/op vs base │ sec/op vs base │
Add-4 43.05n ± 0% 43.34n ± 0% +0.69% (p=0.000 n=10) 35.22n ± 0% -18.19% (p=0.000 n=10) 35.21n ± 0% -18.20% (p=0.000 n=10)
Multiply-4 432.9n ± 0% 419.5n ± 0% -3.10% (p=0.000 n=10) 404.6n ± 0% -6.54% (p=0.000 n=10) 396.4n ± 0% -8.44% (p=0.000 n=10)
Square-4 277.9n ± 0% 259.3n ± 0% -6.68% (p=0.000 n=10) 241.8n ± 0% -13.01% (p=0.000 n=10) 232.9n ± 1% -16.19% (p=0.000 n=10)
Invert-4 76.01µ ± 0% 71.93µ ± 0% -5.36% (p=0.000 n=10) 66.29µ ± 0% -12.79% (p=0.000 n=10) 63.34µ ± 0% -16.66% (p=0.000 n=10)
Mult32-4 68.66n ± 0% 68.64n ± 0% ~ (p=0.928 n=10) 68.64n ± 0% ~ (p=0.542 n=10) 68.70n ± 0% ~ (p=0.323 n=10)
geomean 485.7n 471.4n -2.94% 435.6n -10.32% 426.7n -12.14%
Results for Mac M1:
goos: darwin
goarch: arm64
pkg: filippo.io/edwards25519
│ A │ B │ C │ D │
│ sec/op │ sec/op vs base │ sec/op vs base │ sec/op vs base │
EncodingDecoding-10 8.904µ ± 0% 9.286µ ± 0% +4.28% (p=0.000 n=10) 7.211µ ± 2% -19.02% (p=0.000 n=10) 7.022µ ± 1% -21.14% (p=0.000 n=10)
MultiScalarMultSize8-10 147.4µ ± 0% 147.0µ ± 0% -0.29% (p=0.001 n=10) 136.2µ ± 0% -7.58% (p=0.000 n=10) 134.4µ ± 0% -8.83% (p=0.000 n=10)
ScalarAddition-10 3.323n ± 0% 3.323n ± 0% ~ (p=0.470 n=10) 3.401n ± 0% +2.32% (p=0.000 n=10) 3.332n ± 1% +0.24% (p=0.024 n=10)
ScalarMultiplication-10 14.75n ± 0% 14.75n ± 0% ~ (p=0.511 n=10) 14.76n ± 2% ~ (p=0.084 n=10) 14.75n ± 0% ~ (p=0.995 n=10)
ScalarInversion-10 6.844µ ± 0% 6.847µ ± 0% ~ (p=0.323 n=10) 6.851µ ± 0% +0.09% (p=0.003 n=10) 6.843µ ± 0% ~ (p=0.809 n=10)
ScalarBaseMult-10 12.43µ ± 0% 12.42µ ± 0% -0.08% (p=0.012 n=10) 11.59µ ± 0% -6.73% (p=0.000 n=10) 11.38µ ± 0% -8.39% (p=0.000 n=10)
ScalarMult-10 42.15µ ± 0% 42.12µ ± 0% -0.06% (p=0.041 n=10) 37.60µ ± 0% -10.79% (p=0.000 n=10) 36.85µ ± 0% -12.56% (p=0.000 n=10)
VarTimeDoubleScalarBaseMult-10 40.23µ ± 0% 40.16µ ± 1% ~ (p=0.123 n=10) 36.09µ ± 0% -10.28% (p=0.000 n=10) 35.30µ ± 1% -12.26% (p=0.000 n=10)
geomean 3.133µ 3.147µ +0.45% 2.922µ -6.73% 2.877µ -8.15%
pkg: filippo.io/edwards25519/field
│ A │ B │ C │ D │
│ sec/op │ sec/op vs base │ sec/op vs base │ sec/op vs base │
Add-10 3.636n ± 0% 3.698n ± 0% +1.72% (p=0.000 n=10) 3.457n ± 0% -4.92% (p=0.000 n=10) 3.457n ± 1% -4.94% (p=0.000 n=10)
Multiply-10 19.07n ± 2% 19.65n ± 0% +3.01% (p=0.000 n=10) 16.01n ± 1% -16.07% (p=0.000 n=10) 16.91n ± 0% -11.35% (p=0.000 n=10)
Square-10 15.54n ± 1% 16.02n ± 0% +3.06% (p=0.000 n=10) 12.49n ± 0% -19.66% (p=0.000 n=10) 11.97n ± 0% -22.97% (p=0.000 n=10)
Invert-10 4.134µ ± 0% 4.312µ ± 0% +4.31% (p=0.000 n=10) 3.303µ ± 0% -20.11% (p=0.000 n=10) 3.225µ ± 0% -21.98% (p=0.000 n=10)
Mult32-10 4.511n ± 0% 4.512n ± 0% ~ (p=0.807 n=10) 4.512n ± 0% ~ (p=0.420 n=10) 4.518n ± 0% +0.16% (p=0.030 n=10)
geomean 28.88n 29.58n +2.41% 25.27n -12.52% 25.22n -12.69%
Results for amd64 (ThreadRipper 2950x), with -tags purego:
goos: windows
goarch: amd64
pkg: filippo.io/edwards25519
cpu: AMD Ryzen Threadripper 2950X 16-Core Processor
│ A │ B │ C │ D │
│ sec/op │ sec/op vs base │ sec/op vs base │ sec/op vs base │
EncodingDecoding-32 13.985µ ± 1% 13.522µ ± 1% -3.31% (p=0.000 n=10) 9.834µ ± 1% -29.68% (p=0.000 n=10) 9.093µ ± 1% -34.98% (p=0.000 n=10)
MultiScalarMultSize8-32 273.4µ ± 1% 264.3µ ± 2% -3.32% (p=0.000 n=10) 222.0µ ± 1% -18.78% (p=0.000 n=10) 220.0µ ± 2% -19.52% (p=0.000 n=10)
ScalarAddition-32 4.779n ± 1% 4.829n ± 1% +1.04% (p=0.004 n=10) 4.788n ± 1% ~ (p=0.381 n=10) 4.821n ± 1% +0.88% (p=0.009 n=10)
ScalarMultiplication-32 29.26n ± 1% 29.22n ± 1% ~ (p=0.343 n=10) 29.21n ± 1% ~ (p=0.813 n=10) 29.51n ± 1% ~ (p=0.325 n=10)
ScalarInversion-32 9.063µ ± 1% 9.071µ ± 1% ~ (p=0.469 n=10) 9.062µ ± 1% ~ (p=0.516 n=10) 9.006µ ± 1% -0.63% (p=0.027 n=10)
ScalarBaseMult-32 21.93µ ± 1% 21.25µ ± 2% -3.10% (p=0.000 n=10) 18.25µ ± 1% -16.80% (p=0.000 n=10) 18.25µ ± 0% -16.81% (p=0.000 n=10)
ScalarMult-32 78.24µ ± 1% 75.75µ ± 1% -3.18% (p=0.000 n=10) 62.37µ ± 0% -20.29% (p=0.000 n=10) 61.99µ ± 1% -20.76% (p=0.000 n=10)
VarTimeDoubleScalarBaseMult-32 77.42µ ± 1% 74.45µ ± 1% -3.84% (p=0.000 n=10) 60.90µ ± 1% -21.34% (p=0.000 n=10) 60.67µ ± 1% -21.63% (p=0.000 n=10)
geomean 5.322µ 5.216µ -1.99% 4.575µ -14.04% 4.525µ -14.97%
pkg: filippo.io/edwards25519/field
│ A │ B │ C │ D │
│ sec/op │ sec/op vs base │ sec/op vs base │ sec/op vs base │
Add-32 5.654n ± 1% 5.706n ± 1% +0.92% (p=0.034 n=10) 4.804n ± 3% -15.04% (p=0.000 n=10) 4.800n ± 1% -15.11% (p=0.000 n=10)
Multiply-32 29.94n ± 1% 29.64n ± 2% -1.00% (p=0.027 n=10) 23.30n ± 1% -22.18% (p=0.000 n=10) 21.60n ± 1% -27.86% (p=0.000 n=10)
Square-32 24.46n ± 1% 23.62n ± 0% -3.45% (p=0.000 n=10) 16.45n ± 1% -32.76% (p=0.000 n=10) 15.43n ± 1% -36.91% (p=0.000 n=10)
Invert-32 6.286µ ± 0% 6.093µ ± 1% -3.07% (p=0.000 n=10) 4.474µ ± 2% -28.83% (p=0.000 n=10) 4.100µ ± 1% -34.78% (p=0.000 n=10)
Mult32-32 5.167n ± 1% 5.244n ± 1% +1.50% (p=0.001 n=10) 5.168n ± 1% ~ (p=0.540 n=10) 5.170n ± 1% ~ (p=0.542 n=10)
geomean 42.24n 41.80n -1.04% 33.56n -20.55% 32.07n -24.08%
Nice! Always nice getting pure Go optimizations.
Since these are parts of the library that track upstream, would you consider mailing them as Go CLs? If not, have you signed the Google CLA and are ok with me submitting them on your behalf after we merge them here, to keep upstream in sync?
Sure, I can send them as Go CLs. And yeah, I have signed the CLA.
PS: I'll try to add *19 rule first, if I'm making it against Go, so one of the CL-s wouldn't be necessary -- i.e. https://github.com/golang/go/issues/67575
Unfortunately the general * 19 optimization turned out to be more complicated. The performance varies based on the device and context. Anyways, submitted the CL-s with manual *19 optimization https://go-review.googlesource.com/c/go/+/650277