easyaspi314
easyaspi314
By the way I did some research on the A64FX, and from what it appears, NEON has been *severely* performance-deprecated (as in everything but the trivial instructions having 6-12 cycles...
- Just fix it. - Fix it. But also change major version to indicate breaking change / incompatibility in semver way. - We're still in 0.x though. - Don't fix...
I think that for now we should only do SVE-512. Looking at the optimization guide, c7g is a tradeoff because while SVE can process 2x the data, NEON always has...
> Yes, I agree on it. SVE don't improve a lot performance on SVE-128 & SVE-256. > > On SVE-256 (V1 core), I tried to tune assembly code. The latest...
Yes, and the reason it is favorable is that instead of requiring the `uzp1/uzp2` setup, it can be done with `rev64`. The complicated shuffle is what makes NEON less efficient...
Ah, you are confused because the uzp trick is for two vectors at once. This is for only one. Come to think of it this would actually have literally zero...
That difference might solely be from it being handwritten assembly. However, even if it wasn't, I'd say that even if it is interleaved with scalar it clearly isn't going to...
I'd say yes, although I would recommend the following priority: 1. C intrinsics if possible — The limitation to SVE512 or larger can probably improve performance due to fewer checks...
I'll investigate. It is very much possible that this is due to MMX.
Ok, this is not related to MMX. Doing some tests, it seems that this is a GCC bug specific to GCC 12 that has been fixed in GCC 12.2.1. https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106322...