ARROW-18010: [Go] Add ARM64 Neon impl for Casting
This also adds Casting benchmarks and then i tested them on Ursa's ARM64 macmini getting the following improvement with for the Neon implementation:
name old speed new speed delta
Casting/sz=32768/nullprob=0.00/from=int64/to=int32/safe=true-8 3.86GB/s ± 0% 4.29GB/s ± 3% +11.34% (p=0.016 n=4+5)
Casting/sz=32768/nullprob=0.10/from=int64/to=int32/safe=true-8 2.60GB/s ± 0% 2.80GB/s ± 0% +7.83% (p=0.008 n=5+5)
Casting/sz=32768/nullprob=0.50/from=int64/to=int32/safe=true-8 1.34GB/s ± 0% 1.46GB/s ± 0% +8.33% (p=0.008 n=5+5)
Casting/sz=32768/nullprob=0.90/from=int64/to=int32/safe=true-8 2.82GB/s ± 0% 3.05GB/s ± 1% +8.03% (p=0.008 n=5+5)
Casting/sz=32768/nullprob=1.00/from=int64/to=int32/safe=true-8 9.05GB/s ± 4% 11.52GB/s ± 8% +27.22% (p=0.008 n=5+5)
Casting/sz=32768/nullprob=0.00/from=int64/to=int32/safe=false-8 10.8GB/s ± 0% 13.3GB/s ± 7% +23.12% (p=0.016 n=4+5)
Casting/sz=32768/nullprob=0.10/from=int64/to=int32/safe=false-8 10.5GB/s ± 0% 13.1GB/s ± 1% +24.47% (p=0.029 n=4+4)
Casting/sz=32768/nullprob=0.50/from=int64/to=int32/safe=false-8 10.3GB/s ± 7% 13.1GB/s ± 1% +26.95% (p=0.016 n=5+4)
Casting/sz=32768/nullprob=0.90/from=int64/to=int32/safe=false-8 10.5GB/s ± 0% 13.1GB/s ± 1% +24.11% (p=0.029 n=4+4)
Casting/sz=32768/nullprob=1.00/from=int64/to=int32/safe=false-8 10.3GB/s ± 7% 13.0GB/s ± 0% +26.81% (p=0.016 n=5+4)
Casting/sz=65536/nullprob=0.00/from=uint32/to=int32/safe=true-8 2.06GB/s ± 1% 2.37GB/s ± 1% +14.99% (p=0.008 n=5+5)
Casting/sz=65536/nullprob=0.10/from=uint32/to=int32/safe=true-8 1.34GB/s ± 1% 1.44GB/s ± 1% +7.89% (p=0.008 n=5+5)
Casting/sz=65536/nullprob=0.50/from=uint32/to=int32/safe=true-8 674MB/s ± 0% 698MB/s ± 0% +3.58% (p=0.008 n=5+5)
Casting/sz=65536/nullprob=0.90/from=uint32/to=int32/safe=true-8 1.44GB/s ± 1% 1.57GB/s ± 1% +8.98% (p=0.008 n=5+5)
Casting/sz=65536/nullprob=1.00/from=uint32/to=int32/safe=true-8 5.05GB/s ± 3% 6.97GB/s ± 0% +37.95% (p=0.016 n=5+4)
Casting/sz=32768/nullprob=0.00/from=int64/to=float64/safe=true-8 3.46GB/s ± 1% 3.76GB/s ± 0% +8.88% (p=0.016 n=5+4)
Casting/sz=32768/nullprob=0.10/from=int64/to=float64/safe=true-8 2.40GB/s ± 0% 2.51GB/s ± 1% +4.77% (p=0.016 n=4+5)
Casting/sz=32768/nullprob=0.50/from=int64/to=float64/safe=true-8 1.28GB/s ± 1% 1.37GB/s ± 1% +7.00% (p=0.008 n=5+5)
Casting/sz=32768/nullprob=0.90/from=int64/to=float64/safe=true-8 2.58GB/s ± 1% 2.71GB/s ± 1% +5.05% (p=0.008 n=5+5)
Casting/sz=32768/nullprob=1.00/from=int64/to=float64/safe=true-8 6.61GB/s ± 1% 7.21GB/s ± 0% +9.06% (p=0.029 n=4+4)
Casting/sz=32768/nullprob=0.00/from=int64/to=float64/safe=false-8 7.38GB/s ± 0% 8.04GB/s ± 6% +8.93% (p=0.016 n=4+5)
Casting/sz=32768/nullprob=0.10/from=int64/to=float64/safe=false-8 6.82GB/s ± 7% 7.48GB/s ± 8% ~ (p=0.056 n=5+5)
Casting/sz=32768/nullprob=0.50/from=int64/to=float64/safe=false-8 6.97GB/s ± 1% 7.61GB/s ± 0% +9.25% (p=0.029 n=4+4)
Casting/sz=32768/nullprob=0.90/from=int64/to=float64/safe=false-8 6.97GB/s ± 0% 7.65GB/s ± 0% +9.84% (p=0.029 n=4+4)
Casting/sz=32768/nullprob=1.00/from=int64/to=float64/safe=false-8 7.00GB/s ± 0% 7.66GB/s ± 0% +9.48% (p=0.029 n=4+4)
Casting/sz=32768/nullprob=0.00/from=float64/to=int32/safe=true-8 5.67GB/s ± 1% 6.82GB/s ± 0% +20.21% (p=0.016 n=5+4)
Casting/sz=32768/nullprob=0.10/from=float64/to=int32/safe=true-8 2.90GB/s ± 1% 3.22GB/s ± 1% +11.21% (p=0.008 n=5+5)
Casting/sz=32768/nullprob=0.50/from=float64/to=int32/safe=true-8 1.45GB/s ± 0% 1.56GB/s ± 0% +8.24% (p=0.008 n=5+5)
Casting/sz=32768/nullprob=0.90/from=float64/to=int32/safe=true-8 2.70GB/s ± 1% 3.02GB/s ± 1% +11.78% (p=0.008 n=5+5)
Casting/sz=32768/nullprob=1.00/from=float64/to=int32/safe=true-8 9.21GB/s ± 1% 11.44GB/s ± 8% +24.27% (p=0.016 n=4+5)
Casting/sz=32768/nullprob=0.00/from=float64/to=int32/safe=false-8 10.7GB/s ± 0% 13.5GB/s ± 7% +26.37% (p=0.016 n=4+5)
Casting/sz=32768/nullprob=0.10/from=float64/to=int32/safe=false-8 10.0GB/s ± 7% 12.9GB/s ± 9% +29.26% (p=0.008 n=5+5)
Casting/sz=32768/nullprob=0.50/from=float64/to=int32/safe=false-8 10.1GB/s ± 7% 13.2GB/s ± 1% +30.49% (p=0.016 n=5+4)
Casting/sz=32768/nullprob=0.90/from=float64/to=int32/safe=false-8 10.3GB/s ± 1% 12.7GB/s ± 8% +23.12% (p=0.016 n=4+5)
CC @guyuqi I figured out how to make it work! :smile:
Benchmark runs are scheduled for baseline = 959a9d5deec05f5767be583e9c7bb6b2c1875887 and contender = f3327d2c37c375abdcd6299d4ea2cdbdcbc4cb62. f3327d2c37c375abdcd6299d4ea2cdbdcbc4cb62 is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Finished :arrow_down:0.0% :arrow_up:0.0%] ec2-t3-xlarge-us-east-2
[Failed :arrow_down:0.0% :arrow_up:0.0%] test-mac-arm
[Failed :arrow_down:0.27% :arrow_up:0.0%] ursa-i9-9960x
[Finished :arrow_down:0.21% :arrow_up:0.0%] ursa-thinkcentre-m75q
Buildkite builds:
[Finished] f3327d2c ec2-t3-xlarge-us-east-2
[Failed] f3327d2c test-mac-arm
[Failed] f3327d2c ursa-i9-9960x
[Finished] f3327d2c ursa-thinkcentre-m75q
[Finished] 959a9d5d ec2-t3-xlarge-us-east-2
[Failed] 959a9d5d test-mac-arm
[Failed] 959a9d5d ursa-i9-9960x
[Finished] 959a9d5d ursa-thinkcentre-m75q
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
test-mac-arm: Supported benchmark langs: C++, Python, R
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java