perf: Consider getting rid of `field/uint128`
I'm not sure how bad the compiler behavior is on non-amd64 (due to lack of access to targets), and this only impacts non-amd64/arm64 (due to dedicated assembly), but https://github.com/golang/go/issues/29571 is (also) costing you a good amount of performance.
Unfortunately having giant walls of math/bits calls is less readable that the wrapper type, so depending on how you want to balance "readability" vs "going fast" this might be ok.
name old time/op new time/op delta
Add-4 24.1ns ± 0% 24.1ns ± 0% +0.17%
Multiply-4 135ns ± 0% 127ns ± 0% -6.13%
Mult32-4 26.1ns ± 0% 26.1ns ± 0% +0.11%
Square-4 102ns ± 0% 96ns ± 0% -5.84%
Invert-4 27.3µs ± 0% 25.8µs ± 0% -5.71%
name old time/op new time/op delta
MultiScalarMultSize8-4 1.23ms ± 0% 1.17ms ± 0% -4.26%
ScalarBaseMult-4 101µs ± 0% 96µs ± 0% -4.45%
ScalarMult-4 360µs ± 0% 342µs ± 0% -4.90%
VarTimeDoubleScalarBaseMult-4 351µs ± 0% 332µs ± 0% -5.46%
nb: Only did one iteration on an amd64 target with go 1.17beta1 + purego, so there's some noise in the comparison, but the difference is statistically significant and noticeable.
Hmm, this is a hard one, thank you for trying it out. I value the readability of those functions a lot, for both maintainability and education purposes. I'll make a PR, but probably won't merge it and will instead use it to push golang/go#29571. Hopefully by Go 1.18 it won't be a problem anymore.
this only impacts non-amd64/arm64 (due to dedicated assembly)
Note that the arm64 assembly is just a tiny carryPropagate core, not the full Square and Multiply.