arrow ARROW-18010: [Go] Add ARM64 Neon impl for Casting

This also adds Casting benchmarks and then i tested them on Ursa's ARM64 macmini getting the following improvement with for the Neon implementation:

name                                                               old speed      new speed       delta
Casting/sz=32768/nullprob=0.00/from=int64/to=int32/safe=true-8     3.86GB/s ± 0%   4.29GB/s ± 3%  +11.34%  (p=0.016 n=4+5)
Casting/sz=32768/nullprob=0.10/from=int64/to=int32/safe=true-8     2.60GB/s ± 0%   2.80GB/s ± 0%   +7.83%  (p=0.008 n=5+5)
Casting/sz=32768/nullprob=0.50/from=int64/to=int32/safe=true-8     1.34GB/s ± 0%   1.46GB/s ± 0%   +8.33%  (p=0.008 n=5+5)
Casting/sz=32768/nullprob=0.90/from=int64/to=int32/safe=true-8     2.82GB/s ± 0%   3.05GB/s ± 1%   +8.03%  (p=0.008 n=5+5)
Casting/sz=32768/nullprob=1.00/from=int64/to=int32/safe=true-8     9.05GB/s ± 4%  11.52GB/s ± 8%  +27.22%  (p=0.008 n=5+5)
Casting/sz=32768/nullprob=0.00/from=int64/to=int32/safe=false-8    10.8GB/s ± 0%   13.3GB/s ± 7%  +23.12%  (p=0.016 n=4+5)
Casting/sz=32768/nullprob=0.10/from=int64/to=int32/safe=false-8    10.5GB/s ± 0%   13.1GB/s ± 1%  +24.47%  (p=0.029 n=4+4)
Casting/sz=32768/nullprob=0.50/from=int64/to=int32/safe=false-8    10.3GB/s ± 7%   13.1GB/s ± 1%  +26.95%  (p=0.016 n=5+4)
Casting/sz=32768/nullprob=0.90/from=int64/to=int32/safe=false-8    10.5GB/s ± 0%   13.1GB/s ± 1%  +24.11%  (p=0.029 n=4+4)
Casting/sz=32768/nullprob=1.00/from=int64/to=int32/safe=false-8    10.3GB/s ± 7%   13.0GB/s ± 0%  +26.81%  (p=0.016 n=5+4)
Casting/sz=65536/nullprob=0.00/from=uint32/to=int32/safe=true-8    2.06GB/s ± 1%   2.37GB/s ± 1%  +14.99%  (p=0.008 n=5+5)
Casting/sz=65536/nullprob=0.10/from=uint32/to=int32/safe=true-8    1.34GB/s ± 1%   1.44GB/s ± 1%   +7.89%  (p=0.008 n=5+5)
Casting/sz=65536/nullprob=0.50/from=uint32/to=int32/safe=true-8     674MB/s ± 0%    698MB/s ± 0%   +3.58%  (p=0.008 n=5+5)
Casting/sz=65536/nullprob=0.90/from=uint32/to=int32/safe=true-8    1.44GB/s ± 1%   1.57GB/s ± 1%   +8.98%  (p=0.008 n=5+5)
Casting/sz=65536/nullprob=1.00/from=uint32/to=int32/safe=true-8    5.05GB/s ± 3%   6.97GB/s ± 0%  +37.95%  (p=0.016 n=5+4)
Casting/sz=32768/nullprob=0.00/from=int64/to=float64/safe=true-8   3.46GB/s ± 1%   3.76GB/s ± 0%   +8.88%  (p=0.016 n=5+4)
Casting/sz=32768/nullprob=0.10/from=int64/to=float64/safe=true-8   2.40GB/s ± 0%   2.51GB/s ± 1%   +4.77%  (p=0.016 n=4+5)
Casting/sz=32768/nullprob=0.50/from=int64/to=float64/safe=true-8   1.28GB/s ± 1%   1.37GB/s ± 1%   +7.00%  (p=0.008 n=5+5)
Casting/sz=32768/nullprob=0.90/from=int64/to=float64/safe=true-8   2.58GB/s ± 1%   2.71GB/s ± 1%   +5.05%  (p=0.008 n=5+5)
Casting/sz=32768/nullprob=1.00/from=int64/to=float64/safe=true-8   6.61GB/s ± 1%   7.21GB/s ± 0%   +9.06%  (p=0.029 n=4+4)
Casting/sz=32768/nullprob=0.00/from=int64/to=float64/safe=false-8  7.38GB/s ± 0%   8.04GB/s ± 6%   +8.93%  (p=0.016 n=4+5)
Casting/sz=32768/nullprob=0.10/from=int64/to=float64/safe=false-8  6.82GB/s ± 7%   7.48GB/s ± 8%     ~     (p=0.056 n=5+5)
Casting/sz=32768/nullprob=0.50/from=int64/to=float64/safe=false-8  6.97GB/s ± 1%   7.61GB/s ± 0%   +9.25%  (p=0.029 n=4+4)
Casting/sz=32768/nullprob=0.90/from=int64/to=float64/safe=false-8  6.97GB/s ± 0%   7.65GB/s ± 0%   +9.84%  (p=0.029 n=4+4)
Casting/sz=32768/nullprob=1.00/from=int64/to=float64/safe=false-8  7.00GB/s ± 0%   7.66GB/s ± 0%   +9.48%  (p=0.029 n=4+4)
Casting/sz=32768/nullprob=0.00/from=float64/to=int32/safe=true-8   5.67GB/s ± 1%   6.82GB/s ± 0%  +20.21%  (p=0.016 n=5+4)
Casting/sz=32768/nullprob=0.10/from=float64/to=int32/safe=true-8   2.90GB/s ± 1%   3.22GB/s ± 1%  +11.21%  (p=0.008 n=5+5)
Casting/sz=32768/nullprob=0.50/from=float64/to=int32/safe=true-8   1.45GB/s ± 0%   1.56GB/s ± 0%   +8.24%  (p=0.008 n=5+5)
Casting/sz=32768/nullprob=0.90/from=float64/to=int32/safe=true-8   2.70GB/s ± 1%   3.02GB/s ± 1%  +11.78%  (p=0.008 n=5+5)
Casting/sz=32768/nullprob=1.00/from=float64/to=int32/safe=true-8   9.21GB/s ± 1%  11.44GB/s ± 8%  +24.27%  (p=0.016 n=4+5)
Casting/sz=32768/nullprob=0.00/from=float64/to=int32/safe=false-8  10.7GB/s ± 0%   13.5GB/s ± 7%  +26.37%  (p=0.016 n=4+5)
Casting/sz=32768/nullprob=0.10/from=float64/to=int32/safe=false-8  10.0GB/s ± 7%   12.9GB/s ± 9%  +29.26%  (p=0.008 n=5+5)
Casting/sz=32768/nullprob=0.50/from=float64/to=int32/safe=false-8  10.1GB/s ± 7%   13.2GB/s ± 1%  +30.49%  (p=0.016 n=5+4)
Casting/sz=32768/nullprob=0.90/from=float64/to=int32/safe=false-8  10.3GB/s ± 1%   12.7GB/s ± 8%  +23.12%  (p=0.016 n=4+5)

Oct 12 '22 16:10 zeroshade

CC @guyuqi I figured out how to make it work! :smile:

Oct 12 '22 16:10 zeroshade

Benchmark runs are scheduled for baseline = 959a9d5deec05f5767be583e9c7bb6b2c1875887 and contender = f3327d2c37c375abdcd6299d4ea2cdbdcbc4cb62. f3327d2c37c375abdcd6299d4ea2cdbdcbc4cb62 is a master commit associated with this PR. Results will be available as each benchmark for each run completes. Conbench compare runs links: [Finished :arrow_down:0.0% :arrow_up:0.0%] ec2-t3-xlarge-us-east-2 [Failed :arrow_down:0.0% :arrow_up:0.0%] test-mac-arm [Failed :arrow_down:0.27% :arrow_up:0.0%] ursa-i9-9960x [Finished :arrow_down:0.21% :arrow_up:0.0%] ursa-thinkcentre-m75q Buildkite builds: [Finished] f3327d2c ec2-t3-xlarge-us-east-2 [Failed] f3327d2c test-mac-arm [Failed] f3327d2c ursa-i9-9960x [Finished] f3327d2c ursa-thinkcentre-m75q [Finished] 959a9d5d ec2-t3-xlarge-us-east-2 [Failed] 959a9d5d test-mac-arm [Failed] 959a9d5d ursa-i9-9960x [Finished] 959a9d5d ursa-thinkcentre-m75q Supported benchmarks: ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True test-mac-arm: Supported benchmark langs: C++, Python, R ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

Oct 14 '22 17:10 ursabot