Andrew Robbins

Results 113 comments of Andrew Robbins

Extremely unscientific runthrough: Stock Rblas.dll: ``` & "C:\Program Files\R-aarch64\R-4.5.2\bin\Rscript.exe" .\benchmark\scripts\R\deig.R && & "C:\Program Files\R-aarch64\R-4.5.2\bin\Rscript.exe" .\benchmark\scripts\R\dgemm.R && & "C:\Program Files\R-aarch64\R-4.5.2\bin\Rscript.exe" .\benchmark\scripts\R\dsolve.R From 128 To 2048 Step=128 Loops=1 SIZE Flops Time 128x128...

Gonna be completely honest here-I can't quite tell. Looks like there's some sizes for which it performs better and some for which it is worse. Any recs for drilling down...

3.30.0dev ``` ➜ & "C:\Program Files\R-aarch64\R-4.5.2\bin\Rscript.exe" .\benchmark\scripts\R\deig.R From 128 To 2048 Step=128 Loops=10 SIZE Flops Time 128x128 : 6988.76 MFlops 0.080000 sec 256x256 : 9516.61 MFlops 0.470000 sec 384x384 :...

I think there's definitely _something_ here, judging by the decent improvement at certain matrix sizes, but this is not it judging by the degraded performance at *other* matrix sizes. May...

.....I had an idea. This is an 8-wide chip, neoverse is 5-wide. I wonder what happens if i run the VORTEX target (which is 7-wide and should be otherwise compatible....

Scratch that, it would do nothing, as there's no difference.

Yeah-and even if there is optimization here (and there almost certanily is) I don't even know that the cache sizes are an improvement.

> Probably needs larger loops to get more stable benchmark results. I do have an Oryon system on loan from Qualcomm, it's just that I'm away from it at the...

> I'd _guess_ a hundred instead of ten should help Will report back. With bonus ArmPL for comparison.

So, it turns out the issue was mostly that running BLAS on 12 cores _well_ exceeds the heat capacity of my laptop. Fixed that one. Anyway: Seems that there's a...