melonDS Avoid 2 memcpies in UpdateClipMatrix

Small optimization of UpdateClipMatrix. There is of course the possibility that at O3 the compiler can tell those copies are superfluous and optimize them away anyway. Although I haven't done super rigorous tests, comparing the uncapped framerates of versions compiled with and without the change the new version does seem to be generally a bit faster depending on the scene. On the startup screen of Pokemon Black the new version seemed to be about a couple% faster than the old one (I'm saying seems because I may be influenced by confirmation bias), but about the same on starter selection. This is an optimization to an already small part of overall performance, so you're not gonna 10x the speed, a few% at best seems like as much of a speedup as you could reasonably hope for. MatrixMult4x4 was also rewritten for better reuse, but unfortunately it's no longer in style with the other matrix multiplication functions.

Fun observation, because there are 52 bits of mantissa in a double and 64-12=52, it's possible to convert these integers to doubles, perform the calculation on them and then convert back again without losing any precision. This may sound completely goofy, but compilers are better at vectorizing floating point arithmetic as you can see on e.g. compiler explorer, so such a thing could possibly improve performance in some scenarios. I tried it, and alas, even though the improved vectorization from using floating points made a valorous effort, it just wasn't enough to make up for the extra copies and conversions. I even tried using BLAS, but that didn't do any better (as you might expect. BLAS may be great but it's not magic). But that does raise an interesting thought, if the compilers don't vectorize the integer instructions that much, could the matrix multiplications be sped up with explicit use of SIMD intrinsics? Possibly with the vectorclass library? Another thing I haven't tried is getting rid of the memcpy in the multiplications. This would probably require heap allocating the arrays so that pointers could be reassigned, and that extra heap faffery would probably more than negate the improvement from getting rid of the memcpy. If you add some intermediate arrays maybe there's some way to shuffle stack allocated arrays around to get around the heap.

Another thing I tried was multithreading, although I only tried the standard threading library. To begin with, since these matrices aren't that big I suspected the overhead from threading would negate or overwhelm any possible speedup. And indeed, using that approach ground the emulator to a near halt, even on a computer that should be fairly overspecced. But bear in mind that I'm a total multithreading n00b and was only using std::thread, a multithreading whiz using something like openmp or pthreads could maybe come up with something that's at least on par with the single threaded version.

Aug 30 '22 13:08 fixgoats

The reason the memcpy was there is so the function is still correct when both source and destination point to the same address. From a glance this seems to never happen, but yeah.
Doubles can't be used, because they don't round correctly. Because it's the cheapest DS GPU always rounds torwards minus infinity (which is what an arithmetic right shift does), while most floating point environments by default round torwards even. We could of course change this, but this sounds like just a hassle which isn't really worth it (FP <-> conversions are not for free either).
BLASs are the wrong tool for this job. They are usually optimised for big matrices which can vary in size, not tiny fixed size ones like you can find in computer graphics.
The same applies for multithreading, the synchronisation overhead for such a tiny matrix multiplication will kill all the advantage it brings.
What's to my knowledge is the fastest in a situation like this is to just write a custom SIMD implementation, which is what I did the in the Switch port utilising NEON vector instructions via intrinsics (it used to be in inline assembler because GCC used to generate bad code for it, but now it only inserts a few unnecessary movs so I changed it to intrinsics). This also avoids the need for the memcpy because it first loads both matrices into SIMD registers.

Also if you haven't already join us on IRC or Discord: https://melonds.kuribo64.net/board/thread.php?id=3

Aug 30 '22 22:08 RSDuck

Wep, I tried implementing the multiplications with avx2 intrinsics and performance was still pretty much the same (code in case the implementation is just suboptimal). avx512 is kind of interesting because then you can fit all 16 ints in a single register, but I still doubt that would really help any. Those NEON intrinsics are gorgeous compared to x86.

Edit: Ok, comparing the assembly, the intrinsic version has considerably fewer instructions. I guess it's just such a small performance impact in the grand scheme of things that it doesn't really show up. And anyway, actually integrating intrinsics into the project would take some debate about portability and probably isn't really worth it.

Sep 01 '22 15:09 fixgoats

I'm not that familiar with SSE or AVX that I can say how optimal your solution is. NEON (which is what I do know pretty well) is indeed a lot nicer though, especiall for integer stuff.

The reason I created the NEON version though was because profiling on Switch did reveal it to be an issue and vectorising it got it from 1 ms in some games down to something around 0.1 or so.

Sep 01 '22 16:09 RSDuck