Mikhail Ablakatov

Results 16 comments of Mikhail Ablakatov

@kunalspathak @dotnet/arm64-contrib @a74nh

> Why are you adding across lanes every time around the loop? You could maintain all of the lanes and then merge the lanes in the tail. @theRealAph , thank...

>> I can re-check and post the performance numbers here per a request. > Please do. Please also post the code. @theRealAph , you may find the performance numbers and...

> You only need one load, add, and multiply per iteration. > You don't need to add across columns until the end. > > This is an example of how...

> A high-performance AArch64 implementation can issue four multiply-accumulate vector instructions per cycle, with a 3-clock latency. @theRealAph , hmph, could you elaborate on what spec you refer to here?

> You only need one load, add, and multiply per iteration. > You don't need to add across columns until the end. @theRealAph , I've tried to follow the suggested...

Hi @theRealAph , following your suggestions I've got this working for ints and can confirm that it improves the performance. I don't have enough time at the moment to finish...

Just as a note to not miss it later: the implementation might be affected by https://bugs.openjdk.org/browse/JDK-8139457

I'm finishing up a patch, hopefully I'll push it later today.