[raymath] Considering adding some SIMD optimizations
Recently a new software renderer backend (rlsw) was added to raylib and some performance considerations arised to improve speed.
rlsw added support for some SIMD optimizations and I'm considering adding something similar to raymath, at least for the most critical math functions.
As explained in a previous PR (#4599), using SIMD has some implications, target devices hardware MUST support the required SIMD vector instructions (SSE, AVX, RVV...) and only users aware of their target systems should use those vector extensions... but I think it could be really useful for the special cases where software rendering is required.
It could really improve performance considerably.
The TL;DR: Any SIMD dependence in raymath should be with alternative, similarly-named, functions, not by replacing existing ones. There should also be a function, probably platform-dependent, for detecting whether a processor has SIMD functions or not. This approach may need to be extended to other functions of raylib.
That makes it possible for developers to craft agnostic releases if that is important for the intended distribution of a raymath-dependent product.
-
-
-
-
-
-
-
-
- My analysis - - - - - - - -
-
-
-
-
-
-
-
There is a worrisome pattern here, and a potentially promising, but careful, solution.
The limitation of Windows 11 to computers that have SIMD extensions has created a considerable problem. At least there was a way to learn whether a computer was eligible for upgrade or not. The end of Windows 10 support is not going well, however.
Not so up-front were two difficulties I observed. Ableton Live 12 (and the Live Lite 12 version) all require SIMD extensions. But they would install on eligible systems (that is, of people who already owned the version 11 product) but failed to run. They finally fixed things so 12 would not install. But the explanation to ordinary users was and is not helpful. Apparently, no one wants to simply say that Windows 11 (eligibility) is required.
The same happened with the social world browser, Firestorm, for Second Life. That approach was to issue two versions of recent updates. One that used the extensions, and reported their absence, and one that continues to not use them. The explanation to users is also inept, but installing the wrong one will lead to a report. The problem is users needed to know more about the processor on their computer in terms that most users will not understand (the usual problem of developers thinking users should be like themselves).
Someone else may be doing the obvious, but challenging thing, but I am not aware of it. For building a redistributable consumer-facing project, the obvious solution is for installers (or start-up) to detect the presence or absence of SIMD features and adjust the running code appropriately.
Thinking about raymath, that's tricky at a fine-grained level. It's relatively easier to swap between shared libraries, both of which are installed. It's also easier at an object-oriented level with run-time (i.e., binary) binding. So C++ (or the C level equivalent of COM APIs) works better.
I think for raymath, it does not serve to switch it to dependence on SIMD, or even have two versions of it. Perhaps the solution is to have both kind of implementations, functions using SIMD and functions that do not, with distinct (but similar names), and also provide some raymath code for detecting the presence of SIMD features.
The challenge,, at a higher level, is for a someone who wants to exploit SIMD but not require it to not make a commitment at compile time, but to do something more creative. This will make a testing setup a bit more difficult.
And if one wants to go SIMD-required in a redistributed executable, there must be a careful way to detect the absence of SIMD and fail in a responsible soft manner that an user can understand.
Well, SSE2 is required for every single x86_64 processor so it's a decent baseline.
Allowing compilers to use these instruction sets in autovectorization with -march=xyz or -mavx2 etc means the compilers will emit these instructions THROUGHOUT your code, even for mundane things like memcpy, and absolutely anything in your code can crash on old processors.
The only way to make it work at runtime while retaining compatibility on older processors is to probe what's supported at runtime by using the cpuid instruction, and manually write intrinsics or assembly gated behind this.
Other ISAs like arm have similar instructions, but they can only be used in kernel mode (why???) Instead, the support for things like neon and sve needs to be probed using OS-specific functions: Windows: IsProcessorFeaturePresent Linux: getauxval
GCC has features to make some of this easier: https://gcc.gnu.org/onlinedocs/gcc/Function-Multiversioning.html It can do the dispatch automatically for you.
I think clang cannot. And MSVC doesn't support this at all.
I think any usage of simd should be opt-in, then there's no need to have a "careful way to detect the absense of SIMD" and the programmers who go out of their way to enable it should be experienced enough to know the consequences.
@Peter0x44
I think clang cannot. And MSVC doesn't support this at all.
I think any usage of simd should be opt-in, then there's no need to have a "careful way to detect the absense of SIMD" and the programmers who go out of their way to enable it should be experienced enough to know the consequences.
I missed this comment earlier. VC/C++ and the Microsoft Library have two intrinsics for cpuid and there's a demonstration of it in C++ at https://learn.microsoft.com/en-us/cpp/intrinsics/cpuid-cpuidex. I'm going to compile it to see if it produces what I expect on an older i7-intel and a newer one on a Surface Laptop.
PS: I did get it running with no difficulty. Also, in my rummaging around, it seems that running in a virtual machine may present problems for cpuid. I'n not going to worry about that.
MSVC does have a CPUID intrinsic. What it cannot do is the automatic dispatch with __attribute__((target)) that I linked in gcc's documentation.
You have to write the dispatching code yourself.
I believe clang's version of the attribute works at compile time, depending on whatever features you enable with -march or -mcpu Msvc doesn't support this attribute in any form. That's all I meant.
I think any usage of simd should be opt-in, then there's no need to have a "careful way to detect the absense of SIMD" and the programmers who go out of their way to enable it should be experienced enough to know the consequences.
Yes, I think that's the best approach in case some SIMD functionality is added to raymath.
@Peter0x44 @raysan5
MSVC does have a CPUID intrinsic. What it cannot do is the automatic dispatch with
__attribute__((target))that I linked in gcc's documentation. You have to write the dispatching code yourself.
Ah, I get it. This is not an ISO C Language provision either, I presume. So this really needs to be done outside of raylib somehow.
I did find this informative: https://johnnysswlab.com/cpu-dispatching-make-your-code-both-portable-and-fast/. It seems one can't count on optimizers so, even if it is done "manually" in a C Library, there needs to be a way (say with intrinsics) to get the extended processor features used in code to use when the feature is present.
I don't quite see how this can work at the raymath.h level. I guess accelerated functions would need different names, as well as depending on some platform-dependent features for the extended operations.
This is not an ISO C Language provision either, I presume.
Neither is __cpuid or cpuid.h
Though of course, something completely compiler extension is less portable than the concept. I just wanted to provide an idea of the sorts of facilities the compilers provide in case you want to utilize the simd in your own code.
@Peter0x44 Thanks. I am unlikely to use extended instruction sets, I got curious about how one could have them as an elective rather than hard dependency in an app. because I have been seeing messes created around depending on them in commercial code.
Funny enough, I have been doing the manual equivalent in some code involving testing of random number generators.
@Peter0x44 @raysan5
This is not an ISO C Language provision either, I presume.
Neither is __cpuid or cpuid.h
I was looking around at the ISO C11 specification, wanting to see if something like the stdint.h option for Fast cases would makes sense around functions that could be accelerated depending on the processor instruction set, but at run-time, not compile time.
I was reminded that the standards are about conforming programs, and conforming processors are treated more liberally.
From the section on Conformance,
"A strictly conforming program shall use only those features of the language and library specified in this International Standard.) It shall not produce output dependent on any unspecified, undefined, or implementation-defined behavior, and shall not exceed any minimum implementation limit."
And further down
"A conforming hosted implementation shall accept any strictly conforming program. ... A conforming implementation may have extensions (including additional library functions), provided they do not alter the behavior of any strictly conforming program."
Of course, explicit platform dependencies do create interoperability issues, and it is a bit like what raylib has done to integrate with different back-ends.
This led me to think about having things like ...Fast alternatives to some important functions, including in raymath.h but then there should not be header-only implementation. So the ...Fast ones need to be in a different header file and need to be backed up with platform-specific source code, something like raymathX-Win32.c, and so on. Of course, they are permitted to simply use the raymath implementations, and would if the particular feature were not present at run-time or it is not implemented (yet).
I am not in love with ...Fast as a suffix. ...F maybe. Or ...X. (Not ...Ex because the API isn't different.)
Thoughts?
C++ has an std::simd upcoming https://en.cppreference.com/w/cpp/experimental/simd.html
gcc has some WIP implementations of it. Unsure of status for other compilers (either way - not relevant to raymath, but could provide API design ideas)
@Peter0x44
C++ has an std::simd upcoming https://en.cppreference.com/w/cpp/experimental/simd.html
That's interesting. It might supply some ideas about what the different functions are.
My impression is that these are all compile-time functions. They would be handy but not addressing the business of some sort of raylibX.h that provide the best available at run-time.
For some of those functions, the non-SIMD alternative could be pretty awful That requires some deep thinking.
I also learned that VC/C++ compiler option /Oi gives permission to use intrinsics wherever possible. My impression is that this already happens for X64 compiles with math.h functions.