Added scaffolding for Oryon arch as in Snapdragon X Elite
All I know is that this builds and works fine with clangarm64 on my laptop. Unsure about performance improvement, but certainly no performance regression.
I am not an assembly wizard, so this still uses the neoverse kernels. I imagine there is much optimization to be had. Feel free to edit if I missed a spot.
https://www.hwcooling.net/en/oryon-arm-core-in-snapdragon-x-cpus-architecture-analysis/ for cache reference
Thanks - do you get markedly better performance with this change, compared to the default approach in 0.3.30 of autodetecting this cpu as a regular NEOVERSEN1 ? I would prefer to avoid the code and library size explosion from adding any and all arm64 design variant, so unless the exact model-specific cost tables make a serious difference to the compiler output I'd like to avoid mere duplication.
I need to do some benchmarking, so I'll report back on that. I have to imagine the significant difference in cache layout here is going to do something.
Extremely unscientific runthrough:
Stock Rblas.dll:
& "C:\Program Files\R-aarch64\R-4.5.2\bin\Rscript.exe" .\benchmark\scripts\R\deig.R && & "C:\Program Files\R-aarch64\R-4.5.2\bin\Rscript.exe" .\benchmark\scripts\R\dgemm.R && & "C:\Program Files\R-aarch64\R-4.5.2\bin\Rscript.exe" .\benchmark\scripts\R\dsolve.R
From 128 To 2048 Step=128 Loops=1
SIZE Flops Time
128x128 : 1863.67 MFlops 0.030000 sec
256x256 : 8945.61 MFlops 0.050000 sec
384x384 : 10063.81 MFlops 0.150000 sec
512x512 : 7613.29 MFlops 0.470000 sec
640x640 : 10277.59 MFlops 0.680000 sec
768x768 : 11286.52 MFlops 1.070000 sec
896x896 : 11911.28 MFlops 1.610000 sec
1024x1024 : 10641.62 MFlops 2.690000 sec
1152x1152 : 12426.35 MFlops 3.280000 sec
1280x1280 : 12424.46 MFlops 4.500000 sec
1408x1408 : 12677.39 MFlops 5.870000 sec
1536x1536 : 11379.58 MFlops 8.490000 sec
1664x1664 : 11822.37 MFlops 10.390000 sec
1792x1792 : 12099.15 MFlops 12.680000 sec
1920x1920 : 12672.70 MFlops 14.890000 sec
2048x2048 : 11780.23 MFlops 19.440000 sec
From 128 To 2048 Step=128 Loops=1
SIZE Flops Time
128x128 : 209.72 MFlops 0.020000 sec
256x256 : Inf MFlops 0.000000 sec
384x384 : Inf MFlops 0.000000 sec
512x512 : 13421.77 MFlops 0.020000 sec
640x640 : 10485.76 MFlops 0.050000 sec
768x768 : 12942.42 MFlops 0.070000 sec
896x896 : 11988.72 MFlops 0.120000 sec
1024x1024 : 11302.55 MFlops 0.190000 sec
1152x1152 : 10920.17 MFlops 0.280000 sec
1280x1280 : 10754.63 MFlops 0.390000 sec
1408x1408 : 10150.22 MFlops 0.550000 sec
1536x1536 : 9663.68 MFlops 0.750000 sec
1664x1664 : 9803.07 MFlops 0.940000 sec
1792x1792 : 10185.11 MFlops 1.130000 sec
1920x1920 : 10039.56 MFlops 1.410000 sec
2048x2048 : 10475.53 MFlops 1.640000 sec
From 128 To 2048 Step=128 Loops=1
SIZE Flops Time
128x128 : Inf MFlops 0.000000 sec
256x256 : 2236.96 MFlops 0.020000 sec
384x384 : 5033.16 MFlops 0.030000 sec
512x512 : 5113.06 MFlops 0.070000 sec
640x640 : 4993.22 MFlops 0.140000 sec
768x768 : 5252.00 MFlops 0.230000 sec
896x896 : 5184.31 MFlops 0.370000 sec
1024x1024 : 4936.74 MFlops 0.580000 sec
1152x1152 : 5226.75 MFlops 0.780000 sec
1280x1280 : 5038.20 MFlops 1.110000 sec
1408x1408 : 5133.44 MFlops 1.450000 sec
1536x1536 : 4831.84 MFlops 2.000000 sec
1664x1664 : 4818.24 MFlops 2.550000 sec
1792x1792 : 4721.71 MFlops 3.250000 sec
1920x1920 : 4683.47 MFlops 4.030000 sec
2048x2048 : 4379.83 MFlops 5.230000 sec
OpenBLAS 3.30.0.dev:
➜ & "C:\Program Files\R-aarch64\R-4.5.2\bin\Rscript.exe" .\benchmark\scripts\R\deig.R && & "C:\Program Files\R-aarch64\R-4.5.2\bin\Rscript.exe" .\benchmark\scripts\R\dgemm.R && & "C:\Program Files\R-aarch64\R-4.5.2\bin\Rscript.exe" .\benchmark\scripts\R\dsolve.R
From 128 To 2048 Step=128 Loops=1
SIZE Flops Time
128x128 : 2795.50 MFlops 0.020000 sec
256x256 : 7454.68 MFlops 0.060000 sec
384x384 : 13723.38 MFlops 0.110000 sec
512x512 : 8132.37 MFlops 0.440000 sec
640x640 : 19413.22 MFlops 0.360000 sec
768x768 : 20468.77 MFlops 0.590000 sec
896x896 : 27010.08 MFlops 0.710000 sec
1024x1024 : 17039.26 MFlops 1.680000 sec
1152x1152 : 38451.36 MFlops 1.060000 sec
1280x1280 : 36071.01 MFlops 1.550000 sec
1408x1408 : 41806.91 MFlops 1.780000 sec
1536x1536 : 28249.30 MFlops 3.420000 sec
1664x1664 : 46528.19 MFlops 2.640000 sec
1792x1792 : 44858.84 MFlops 3.420000 sec
1920x1920 : 51556.42 MFlops 3.660000 sec
2048x2048 : 32073.90 MFlops 7.140000 sec
From 128 To 2048 Step=128 Loops=1
SIZE Flops Time
128x128 : Inf MFlops 0.000000 sec
256x256 : Inf MFlops 0.000000 sec
384x384 : 5662.31 MFlops 0.020000 sec
512x512 : Inf MFlops 0.000000 sec
640x640 : Inf MFlops 0.000000 sec
768x768 : 45298.48 MFlops 0.020000 sec
896x896 : Inf MFlops 0.000000 sec
1024x1024 : Inf MFlops 0.000000 sec
1152x1152 : 152882.38 MFlops 0.020000 sec
1280x1280 : 419430.40 MFlops 0.010000 sec
1408x1408 : 279130.93 MFlops 0.020000 sec
1536x1536 : 241591.91 MFlops 0.030000 sec
1664x1664 : 230372.15 MFlops 0.040000 sec
1792x1792 : 287729.25 MFlops 0.040000 sec
1920x1920 : 283115.52 MFlops 0.050000 sec
2048x2048 : 343597.38 MFlops 0.050000 sec
From 128 To 2048 Step=128 Loops=1
SIZE Flops Time
128x128 : Inf MFlops 0.000000 sec
256x256 : 2236.96 MFlops 0.020000 sec
384x384 : Inf MFlops 0.000000 sec
512x512 : 17895.70 MFlops 0.020000 sec
640x640 : 34952.53 MFlops 0.020000 sec
768x768 : 120795.96 MFlops 0.010000 sec
896x896 : 95909.75 MFlops 0.020000 sec
1024x1024 : 57266.23 MFlops 0.050000 sec
1152x1152 : 58240.91 MFlops 0.070000 sec
1280x1280 : 111848.11 MFlops 0.050000 sec
1408x1408 : 93043.64 MFlops 0.080000 sec
1536x1536 : 87851.60 MFlops 0.110000 sec
1664x1664 : 122865.15 MFlops 0.100000 sec
1792x1792 : 127879.67 MFlops 0.120000 sec
1920x1920 : 134816.91 MFlops 0.140000 sec
2048x2048 : 95443.72 MFlops 0.240000 sec
OpenBLAS NeoverseN1 kernel w/ oryon cache sizes:
➜ & "C:\Program Files\R-aarch64\R-4.5.2\bin\Rscript.exe" .\benchmark\scripts\R\deig.R && & "C:\Program Files\R-aarch64\R-4.5.2\bin\Rscript.exe" .\benchmark\scripts\R\dgemm.R && & "C:\Program Files\R-aarch64\R-4.5.2\bin\Rscript.exe" .\benchmark\scripts\R\dsolve.R
From 128 To 2048 Step=128 Loops=1
SIZE Flops Time
128x128 : Inf MFlops 0.000000 sec
256x256 : 14909.35 MFlops 0.030000 sec
384x384 : 6038.29 MFlops 0.250000 sec
512x512 : 7454.68 MFlops 0.480000 sec
640x640 : 17471.90 MFlops 0.400000 sec
768x768 : 22785.99 MFlops 0.530000 sec
896x896 : 29964.30 MFlops 0.640000 sec
1024x1024 : 16643.00 MFlops 1.720000 sec
1152x1152 : 35442.12 MFlops 1.150000 sec
1280x1280 : 33279.80 MFlops 1.680000 sec
1408x1408 : 38758.49 MFlops 1.920000 sec
1536x1536 : 26182.28 MFlops 3.690000 sec
1664x1664 : 46705.11 MFlops 2.630000 sec
1792x1792 : 44085.41 MFlops 3.480000 sec
1920x1920 : 51415.94 MFlops 3.670000 sec
2048x2048 : 31242.52 MFlops 7.330000 sec
From 128 To 2048 Step=128 Loops=1
SIZE Flops Time
128x128 : Inf MFlops 0.000000 sec
256x256 : Inf MFlops 0.000000 sec
384x384 : 11324.62 MFlops 0.010000 sec
512x512 : Inf MFlops 0.000000 sec
640x640 : 26214.40 MFlops 0.020000 sec
768x768 : Inf MFlops 0.000000 sec
896x896 : Inf MFlops 0.000000 sec
1024x1024 : 107374.18 MFlops 0.020000 sec
1152x1152 : 152882.38 MFlops 0.020000 sec
1280x1280 : 209715.20 MFlops 0.020000 sec
1408x1408 : 279130.93 MFlops 0.020000 sec
1536x1536 : 241591.91 MFlops 0.030000 sec
1664x1664 : 307162.86 MFlops 0.030000 sec
1792x1792 : 383639.01 MFlops 0.030000 sec
1920x1920 : 471859.20 MFlops 0.030000 sec
2048x2048 : 245426.70 MFlops 0.070000 sec
From 128 To 2048 Step=128 Loops=1
SIZE Flops Time
128x128 : Inf MFlops 0.000000 sec
256x256 : Inf MFlops 0.000000 sec
384x384 : Inf MFlops 0.000000 sec
512x512 : Inf MFlops 0.000000 sec
640x640 : Inf MFlops 0.000000 sec
768x768 : 120795.96 MFlops 0.010000 sec
896x896 : 95909.75 MFlops 0.020000 sec
1024x1024 : 57266.23 MFlops 0.050000 sec
1152x1152 : 81537.27 MFlops 0.050000 sec
1280x1280 : 93206.76 MFlops 0.060000 sec
1408x1408 : 124058.19 MFlops 0.060000 sec
1536x1536 : 80530.64 MFlops 0.120000 sec
1664x1664 : 122865.15 MFlops 0.100000 sec
1792x1792 : 109611.14 MFlops 0.140000 sec
1920x1920 : 125829.12 MFlops 0.150000 sec
2048x2048 : 88101.89 MFlops 0.260000 sec
Gonna be completely honest here-I can't quite tell. Looks like there's some sizes for which it performs better and some for which it is worse.
Any recs for drilling down a bit deeper?
edit: just saw the openblas_loops setting, bear with
3.30.0dev
➜ & "C:\Program Files\R-aarch64\R-4.5.2\bin\Rscript.exe" .\benchmark\scripts\R\deig.R
From 128 To 2048 Step=128 Loops=10
SIZE Flops Time
128x128 : 6988.76 MFlops 0.080000 sec
256x256 : 9516.61 MFlops 0.470000 sec
384x384 : 12902.32 MFlops 1.170000 sec
512x512 : 8580.92 MFlops 4.170000 sec
640x640 : 21503.87 MFlops 3.250000 sec
768x768 : 25159.53 MFlops 4.800000 sec
896x896 : 31697.78 MFlops 6.050000 sec
1024x1024 : 18209.90 MFlops 15.720000 sec
1152x1152 : 40475.12 MFlops 10.070000 sec
1280x1280 : 36187.75 MFlops 15.450000 sec
1408x1408 : 39269.82 MFlops 18.950000 sec
1536x1536 : 26629.71 MFlops 36.280000 sec
1664x1664 : 46178.36 MFlops 26.600000 sec
1792x1792 : 45282.54 MFlops 33.880000 sec
1920x1920 : 51683.51 MFlops 36.510000 sec
2048x2048 : 31657.13 MFlops 72.340000 sec
➜ & "C:\Program Files\R-aarch64\R-4.5.2\bin\Rscript.exe" .\benchmark\scripts\R\dgemm.R
From 128 To 2048 Step=128 Loops=10
SIZE Flops Time
128x128 : 1048.58 MFlops 0.040000 sec
256x256 : Inf MFlops 0.000000 sec
384x384 : Inf MFlops 0.000000 sec
512x512 : 268435.46 MFlops 0.010000 sec
640x640 : 524288.00 MFlops 0.010000 sec
768x768 : 181193.93 MFlops 0.050000 sec
896x896 : 239774.38 MFlops 0.060000 sec
1024x1024 : 214748.36 MFlops 0.100000 sec
1152x1152 : 277967.97 MFlops 0.110000 sec
1280x1280 : 299593.14 MFlops 0.140000 sec
1408x1408 : 293822.03 MFlops 0.190000 sec
1536x1536 : 301989.89 MFlops 0.240000 sec
1664x1664 : 307162.86 MFlops 0.300000 sec
1792x1792 : 287729.25 MFlops 0.400000 sec
1920x1920 : 307734.26 MFlops 0.460000 sec
2048x2048 : 330382.10 MFlops 0.520000 sec
➜ & "C:\Program Files\R-aarch64\R-4.5.2\bin\Rscript.exe" .\benchmark\scripts\R\dsolve.R
From 128 To 2048 Step=128 Loops=10
SIZE Flops Time
128x128 : 5592.41 MFlops 0.010000 sec
256x256 : 44739.24 MFlops 0.010000 sec
384x384 : 50331.65 MFlops 0.030000 sec
512x512 : 39768.22 MFlops 0.090000 sec
640x640 : 63550.06 MFlops 0.110000 sec
768x768 : 71056.44 MFlops 0.170000 sec
896x896 : 79924.79 MFlops 0.240000 sec
1024x1024 : 60921.52 MFlops 0.470000 sec
1152x1152 : 86741.78 MFlops 0.470000 sec
1280x1280 : 91678.78 MFlops 0.610000 sec
1408x1408 : 99246.55 MFlops 0.750000 sec
1536x1536 : 82595.52 MFlops 1.170000 sec
1664x1664 : 110689.32 MFlops 1.110000 sec
1792x1792 : 117141.68 MFlops 1.310000 sec
1920x1920 : 124173.47 MFlops 1.520000 sec
2048x2048 : 98311.13 MFlops 2.330000 sec
Oryon-modded cache size
➜ & "C:\Program Files\R-aarch64\R-4.5.2\bin\Rscript.exe" .\benchmark\scripts\R\deig.R
From 128 To 2048 Step=128 Loops=10
SIZE Flops Time
128x128 : 6988.76 MFlops 0.080000 sec
256x256 : 10909.28 MFlops 0.410000 sec
384x384 : 12792.98 MFlops 1.180000 sec
512x512 : 8813.41 MFlops 4.060000 sec
640x640 : 19967.88 MFlops 3.500000 sec
768x768 : 23449.66 MFlops 5.150000 sec
896x896 : 30058.24 MFlops 6.380000 sec
1024x1024 : 18026.42 MFlops 15.880000 sec
1152x1152 : 35878.91 MFlops 11.360000 sec
1280x1280 : 34049.98 MFlops 16.420000 sec
1408x1408 : 29910.09 MFlops 24.880000 sec
1536x1536 : 28052.44 MFlops 34.440000 sec
1664x1664 : 48455.40 MFlops 25.350000 sec
1792x1792 : 47118.32 MFlops 32.560000 sec
1920x1920 : 52989.75 MFlops 35.610000 sec
2048x2048 : 32345.71 MFlops 70.800000 sec
➜ & "C:\Program Files\R-aarch64\R-4.5.2\bin\Rscript.exe" .\benchmark\scripts\R\dgemm.R
From 128 To 2048 Step=128 Loops=10
SIZE Flops Time
128x128 : Inf MFlops 0.000000 sec
256x256 : Inf MFlops 0.000000 sec
384x384 : 37748.74 MFlops 0.030000 sec
512x512 : 134217.73 MFlops 0.020000 sec
640x640 : 174762.67 MFlops 0.030000 sec
768x768 : 301989.89 MFlops 0.030000 sec
896x896 : 205520.90 MFlops 0.070000 sec
1024x1024 : 238609.29 MFlops 0.090000 sec
1152x1152 : 277967.97 MFlops 0.110000 sec
1280x1280 : 322638.77 MFlops 0.130000 sec
1408x1408 : 310145.48 MFlops 0.180000 sec
1536x1536 : 315119.88 MFlops 0.230000 sec
1664x1664 : 317754.69 MFlops 0.290000 sec
1792x1792 : 280711.47 MFlops 0.410000 sec
1920x1920 : 314572.80 MFlops 0.450000 sec
2048x2048 : 272696.34 MFlops 0.630000 sec
➜ & "C:\Program Files\R-aarch64\R-4.5.2\bin\Rscript.exe" .\benchmark\scripts\R\dsolve.R
From 128 To 2048 Step=128 Loops=10
SIZE Flops Time
128x128 : Inf MFlops 0.000000 sec
256x256 : 22369.62 MFlops 0.020000 sec
384x384 : 50331.65 MFlops 0.030000 sec
512x512 : 35791.39 MFlops 0.100000 sec
640x640 : 63550.06 MFlops 0.110000 sec
768x768 : 71056.44 MFlops 0.170000 sec
896x896 : 79924.79 MFlops 0.240000 sec
1024x1024 : 60921.52 MFlops 0.470000 sec
1152x1152 : 90596.97 MFlops 0.450000 sec
1280x1280 : 94786.53 MFlops 0.590000 sec
1408x1408 : 103381.83 MFlops 0.720000 sec
1536x1536 : 84031.97 MFlops 1.150000 sec
1664x1664 : 120456.02 MFlops 1.020000 sec
1792x1792 : 118042.77 MFlops 1.300000 sec
1920x1920 : 123361.88 MFlops 1.530000 sec
2048x2048 : 93495.89 MFlops 2.450000 sec
I think there's definitely something here, judging by the decent improvement at certain matrix sizes, but this is not it judging by the degraded performance at other matrix sizes.
May be worth having it as a full clone of neoverse n1 (ie-removing the cache changes i made here) pending further investigation.
.....I had an idea.
This is an 8-wide chip, neoverse is 5-wide.
I wonder what happens if i run the VORTEX target (which is 7-wide and should be otherwise compatible.
Because I get the feeling the optimization here isn't so much in the cache definitons as much as its in the kernels.
Scratch that, it would do nothing, as there's no difference.
Yes, right now VORTEX is also just ARMV8 with a bunch of NEOVERSEN1 kernels on top. Without dedicated kernels, I think the easiest fix would be to put the proper L1 and L2 cache sizes in cpuid_arm64.c when we're on Windows, to guide the block sizes for GEMM etc.
Unless the cost tables etc. requested by -mcpu=oryon have a dramatic influence on compilation - but I don't expect that, given that it should mainly affect the generic C parts of the (setup and interfacing) code
Yeah-and even if there is optimization here (and there almost certanily is) I don't even know that the cache sizes are an improvement.
Probably needs larger loops to get more stable benchmark results. I do have an Oryon system on loan from Qualcomm, it's just that I'm away from it at the moment but I'll try to run some experiments myself when I have more time for OpenBLAS again - hopefully soon
Probably needs larger loops to get more stable benchmark results. I do have an Oryon system on loan from Qualcomm, it's just that I'm away from it at the moment but I'll try to run some experiments myself when I have more time for OpenBLAS again - hopefully soon
Oh, I can absolutely just run them on my laptop. How large are we talking?
I'd guess a hundred instead of ten should help
I'd guess a hundred instead of ten should help
Will report back. With bonus ArmPL for comparison.
So, it turns out the issue was mostly that running BLAS on 12 cores well exceeds the heat capacity of my laptop. Fixed that one. Anyway:
Seems that there's a thousand to a few thousand megaflops difference in favor of the cache-tuned build at all sizes, which is more what I would have expected. Funnily enough, ArmPL seems to be on par with the n1 build and similarly behind the tuned build. Guess that does make sense, they did optimize for their own cores. Do we know if QC has an optimized implementation?
N1
➜ & "C:\Program Files\R-aarch64\R-4.5.2\bin\Rscript.exe" .\benchmark\scripts\R\deig.R
From 128 To 2048 Step=128 Loops=100
SIZE Flops Time
128x128 : 10165.47 MFlops 0.550000 sec
256x256 : 17540.41 MFlops 2.550000 sec
384x384 : 25328.39 MFlops 5.960000 sec
512x512 : 13662.64 MFlops 26.190000 sec
640x640 : 31047.35 MFlops 22.510000 sec
768x768 : 33140.99 MFlops 36.440000 sec
896x896 : 40733.12 MFlops 47.080000 sec
1024x1024 : 27487.96 MFlops 104.140000 sec
1152x1152 : 46153.82 MFlops 88.310000 sec
1280x1280 : 45388.92 MFlops 123.180000 sec
1408x1408 : 48692.21 MFlops 152.830000 sec
1536x1536 : 37665.73 MFlops 256.500000 sec
1664x1664 : 51015.21 MFlops 240.780000 sec
1792x1792 : 49826.97 MFlops 307.900000 sec
1920x1920 : 51755.81 MFlops 364.590000 sec
2048x2048 : 38978.70 MFlops 587.520000 sec
➜ & "C:\Program Files\R-aarch64\R-4.5.2\bin\Rscript.exe" .\benchmark\scripts\R\dgemm.R
From 128 To 2048 Step=128 Loops=100
SIZE Flops Time
128x128 : 20971.52 MFlops 0.020000 sec
256x256 : 41943.04 MFlops 0.080000 sec
384x384 : 47185.92 MFlops 0.240000 sec
512x512 : 48806.45 MFlops 0.550000 sec
640x640 : 49932.19 MFlops 1.050000 sec
768x768 : 52067.22 MFlops 1.740000 sec
896x896 : 51749.87 MFlops 2.780000 sec
1024x1024 : 50174.85 MFlops 4.280000 sec
1152x1152 : 50539.63 MFlops 6.050000 sec
1280x1280 : 50655.85 MFlops 8.280000 sec
1408x1408 : 50613.04 MFlops 11.030000 sec
1536x1536 : 50719.09 MFlops 14.290000 sec
1664x1664 : 50967.29 MFlops 18.080000 sec
1792x1792 : 50835.56 MFlops 22.640000 sec
1920x1920 : 51159.29 MFlops 27.670000 sec
2048x2048 : 50858.11 MFlops 33.780000 sec
➜ & "C:\Program Files\R-aarch64\R-4.5.2\bin\Rscript.exe" .\benchmark\scripts\R\dsolve.R
From 128 To 2048 Step=128 Loops=100
SIZE Flops Time
128x128 : 11184.81 MFlops 0.050000 sec
256x256 : 26317.20 MFlops 0.170000 sec
384x384 : 30198.99 MFlops 0.500000 sec
512x512 : 25205.21 MFlops 1.420000 sec
640x640 : 35848.75 MFlops 1.950000 sec
768x768 : 36940.66 MFlops 3.270000 sec
896x896 : 38595.47 MFlops 4.970000 sec
1024x1024 : 32135.93 MFlops 8.910000 sec
1152x1152 : 39542.81 MFlops 10.310000 sec
1280x1280 : 39550.25 MFlops 14.140000 sec
1408x1408 : 40741.61 MFlops 18.270000 sec
1536x1536 : 36315.96 MFlops 26.610000 sec
1664x1664 : 41663.32 MFlops 29.490000 sec
1792x1792 : 41373.85 MFlops 37.090000 sec
1920x1920 : 42490.70 MFlops 44.420000 sec
2048x2048 : 38126.65 MFlops 60.080000 sec
Oryon
➜ & "C:\Program Files\R-aarch64\R-4.5.2\bin\Rscript.exe" .\benchmark\scripts\R\deig.R
From 128 To 2048 Step=128 Loops=100
SIZE Flops Time
128x128 : 10751.94 MFlops 0.520000 sec
256x256 : 19617.57 MFlops 2.280000 sec
384x384 : 27101.83 MFlops 5.570000 sec
512x512 : 14878.36 MFlops 24.050000 sec
640x640 : 31019.79 MFlops 22.530000 sec
768x768 : 32568.97 MFlops 37.080000 sec
896x896 : 40672.65 MFlops 47.150000 sec
1024x1024 : 28059.16 MFlops 102.020000 sec
1152x1152 : 45749.74 MFlops 89.090000 sec
1280x1280 : 44817.69 MFlops 124.750000 sec
1408x1408 : 48803.98 MFlops 152.480000 sec
1536x1536 : 37978.15 MFlops 254.390000 sec
1664x1664 : 50716.11 MFlops 242.200000 sec
1792x1792 : 49471.88 MFlops 310.110000 sec
1920x1920 : 52463.78 MFlops 359.670000 sec
2048x2048 : 39061.81 MFlops 586.270000 sec
➜ & "C:\Program Files\R-aarch64\R-4.5.2\bin\Rscript.exe" .\benchmark\scripts\R\dgemm.R
From 128 To 2048 Step=128 Loops=100
SIZE Flops Time
128x128 : 20971.52 MFlops 0.020000 sec
256x256 : 41943.04 MFlops 0.080000 sec
384x384 : 53926.77 MFlops 0.210000 sec
512x512 : 45497.53 MFlops 0.590000 sec
640x640 : 48998.88 MFlops 1.070000 sec
768x768 : 53607.67 MFlops 1.690000 sec
896x896 : 54084.45 MFlops 2.660000 sec
1024x1024 : 52377.65 MFlops 4.100000 sec
1152x1152 : 52627.33 MFlops 5.810000 sec
1280x1280 : 52626.15 MFlops 7.970000 sec
1408x1408 : 52369.78 MFlops 10.660000 sec
1536x1536 : 51769.70 MFlops 14.000000 sec
1664x1664 : 51914.85 MFlops 17.750000 sec
1792x1792 : 51518.22 MFlops 22.340000 sec
1920x1920 : 51569.31 MFlops 27.450000 sec
2048x2048 : 50410.41 MFlops 34.080000 sec
➜ & "C:\Program Files\R-aarch64\R-4.5.2\bin\Rscript.exe" .\benchmark\scripts\R\dsolve.R
From 128 To 2048 Step=128 Loops=100
SIZE Flops Time
128x128 : 18641.35 MFlops 0.030000 sec
256x256 : 26317.20 MFlops 0.170000 sec
384x384 : 32126.58 MFlops 0.470000 sec
512x512 : 25935.79 MFlops 1.380000 sec
640x640 : 38199.49 MFlops 1.830000 sec
768x768 : 39092.54 MFlops 3.090000 sec
896x896 : 40899.68 MFlops 4.690000 sec
1024x1024 : 33255.65 MFlops 8.610000 sec
1152x1152 : 40768.63 MFlops 10.000000 sec
1280x1280 : 40554.06 MFlops 13.790000 sec
1408x1408 : 41583.75 MFlops 17.900000 sec
1536x1536 : 37011.40 MFlops 26.110000 sec
1664x1664 : 42178.22 MFlops 29.130000 sec
1792x1792 : 41859.14 MFlops 36.660000 sec
1920x1920 : 42385.74 MFlops 44.530000 sec
2048x2048 : 38107.62 MFlops 60.110000 sec
ArmPL for comparison
➜ & "C:\Program Files\R-aarch64\R-4.5.2\bin\Rscript.exe" .\benchmark\scripts\R\deig.R
From 128 To 2048 Step=128 Loops=100
SIZE Flops Time
128x128 : 10549.07 MFlops 0.530000 sec
256x256 : 19114.55 MFlops 2.340000 sec
384x384 : 26253.43 MFlops 5.750000 sec
512x512 : 14971.73 MFlops 23.900000 sec
640x640 : 31102.62 MFlops 22.470000 sec
768x768 : 29455.06 MFlops 41.000000 sec
896x896 : 35925.73 MFlops 53.380000 sec
1024x1024 : 24104.04 MFlops 118.760000 sec
1152x1152 : 40918.02 MFlops 99.610000 sec
1280x1280 : 42426.83 MFlops 131.780000 sec
1408x1408 : 48951.66 MFlops 152.020000 sec
1536x1536 : 40124.85 MFlops 240.780000 sec
1664x1664 : 55783.12 MFlops 220.200000 sec
1792x1792 : 53676.17 MFlops 285.820000 sec
1920x1920 : 51701.92 MFlops 364.970000 sec
2048x2048 : 38894.62 MFlops 588.790000 sec
➜ & "C:\Program Files\R-aarch64\R-4.5.2\bin\Rscript.exe" .\benchmark\scripts\R\dgemm.R
From 128 To 2048 Step=128 Loops=100
SIZE Flops Time
128x128 : Inf MFlops 0.000000 sec
256x256 : 41943.04 MFlops 0.080000 sec
384x384 : 47185.92 MFlops 0.240000 sec
512x512 : 51622.20 MFlops 0.520000 sec
640x640 : 54050.31 MFlops 0.970000 sec
768x768 : 54576.49 MFlops 1.660000 sec
896x896 : 52697.67 MFlops 2.730000 sec
1024x1024 : 51871.59 MFlops 4.140000 sec
1152x1152 : 52900.48 MFlops 5.780000 sec
1280x1280 : 53294.84 MFlops 7.870000 sec
1408x1408 : 54624.45 MFlops 10.220000 sec
1536x1536 : 53449.54 MFlops 13.560000 sec
1664x1664 : 55477.94 MFlops 16.610000 sec
1792x1792 : 55279.40 MFlops 20.820000 sec
1920x1920 : 55951.68 MFlops 25.300000 sec
2048x2048 : 54730.39 MFlops 31.390000 sec
➜ & "C:\Program Files\R-aarch64\R-4.5.2\bin\Rscript.exe" .\benchmark\scripts\R\dsolve.R
From 128 To 2048 Step=128 Loops=100
SIZE Flops Time
128x128 : 11184.81 MFlops 0.050000 sec
256x256 : 23546.97 MFlops 0.190000 sec
384x384 : 32126.58 MFlops 0.470000 sec
512x512 : 23091.22 MFlops 1.550000 sec
640x640 : 38199.49 MFlops 1.830000 sec
768x768 : 36828.04 MFlops 3.280000 sec
896x896 : 41973.63 MFlops 4.570000 sec
1024x1024 : 28519.04 MFlops 10.040000 sec
1152x1152 : 41856.91 MFlops 9.740000 sec
1280x1280 : 39217.43 MFlops 14.260000 sec
1408x1408 : 42803.29 MFlops 17.390000 sec
1536x1536 : 34686.56 MFlops 27.860000 sec
1664x1664 : 43050.16 MFlops 28.540000 sec
1792x1792 : 40921.49 MFlops 37.500000 sec
1920x1920 : 43792.04 MFlops 43.100000 sec
2048x2048 : 32144.95 MFlops 71.260000 sec
Hmm. I'm still not that convinced - looks like there is still a lot of noise in the data, and where it looks like there is an improvement from using the correct cache sizes, it is around 2 percent at most ?
Hmm. I'm still not that convinced - looks like there is still a lot of noise in the data, and where it looks like there is an improvement from using the correct cache sizes, it is around 2 percent at most ?
By noise do you mean the fluctuating MFlops as size increases? That's actually fairly reproducible.
And yes, around 2%. I think the bottleneck here isn't so much cache locality as much as it is the difference in execution pipeline size (5-wide vs 8-wide).
edit: looking at the block diagrams it appears the correct way of looking at it is 2 NEON/FP units on the N1 and 4 on oryon
Hi @theAeon
Out of curiosity, are you going to add/modifying/optimize any kernels for this arch in future?
unfortunately this is not exactly my strong suit, so while I will take a look i am...not expecting to, no.
I can add a small hack to the cpu detection code to put the correct cache sizes in the config file, as that bit of performance gain it is low-hanging (if fairly small) fruit. But frankly I expect the upcoming X2 Elite cpu with its SVE+SME capability to be a markedly more attractive platform for any kind of numerical workload, and it should be quite adequately covered by the ARMV9SME target already.
That sounds like the way to go.