Robert Muir
Robert Muir
Do we even need to use intrinsics? function is so simple that the compiler seems to do the right thing, e.g. use `SDOT` dot production instruction, given the correct flags:...
I haven't benchmarked, just seems `SDOT` is the one to optimize for, and GCC can both recognize the code shape and autovectorize to it without hassle. my cheap 2021 phone...
> With the updated compile flags, the performance of auto-vectorized code is slightly better than explicitly vectorized code (see results). Interesting thing to note is that both C-based implementations have...
> I avoided it at the time given the toolchain that we were using, but it's a good option which I'll reevaluate. It should work well with any modern gcc...
Here is my proposal visually: https://godbolt.org/z/6fcjPWojf As you can see, by passing `-march=cascadelake` it generates VNNI instructions. IMO, no need for any intrinsics anywhere, for x86 nor ARM. Just a...
And i see from playing around with compiler versions, the advantage of intrinsics approach: although I worry how many variants we'd maintain. it would give stability across releasing lucene without...
I definitely want to play around more with @goankur 's PR here and see what performance looks like across machines, but will be out of town for a bit. There...
go @goankur, awesome progress here. It is clear we gotta do something :) I left comments just to try to help. Do you mind avoiding rebase for updates? I am...
Attached is a patch to get x86 support working. It makes some changes to the build: specifically the java code statically picks the best MethodHandle (SVE, Neon, Generic), and its...
TODO: need to examine avx256 difference of auto-vectorized C with vs java vector api for the integers here. This isn't nearly as bad as the ARM case (where we understand...