Add `bounds_t` with pre-aligned `mins` and `maxs`
This is not urgent at all. This is complementary to:
- https://github.com/DaemonEngine/Daemon/pull/1142
It also improves over:
- https://github.com/DaemonEngine/Daemon/pull/1043
The BoxOnPlaneSide() function is known to be a hot spot, being called by multiple recursive functions. Right now we spend a lot of time in _mm_loadu_ps, we have to call sseLoadVec3Unsafe() explicitly because we can't guess if the data is aligned or not. This comes with multiple downside:
-
_mm_loadu_psis said to be slower than_mm_load_ps, and that fits my experience¹. - The compiler doesn't optimize
_mm_loadu_psand will always call it if we ask for it explicitely, even if the data is already aligned, and by experience, even if no copy is needed.
So the idea is to introduce some nice bounds_t struct that uses aligned mins and maxs, same for cplane_t with an aligned normal. When doing that, we can write an explicit _mm_load_ps that is faster than _mm_loadu_ps, and most of the cases, because the compiler knows the data is const and already aligned, the compiler just removes the _mm_load_ps and process the data without any copy.
See also:
- https://github.com/Unvanquished/Unvanquished/pull/3265
¹ Some times ago I tried to write optimized SSE code for some other functions, but the code was slower because of the explicit _mm_loadu_ps call. I even noticed copying an unaligned vec3 to an aligned vec3 before calling _mm_load_ps could make the code faster when the compiler notices the data is already aligned and skips the copy and calls _mm_load_ps (or even doesn't need to call it at all).
For now there is a bug somewhere I have not found yet. The bug can be seen by loading the plat23 map and looking to the left (some surfaces will disappear).
In my first iteration of the branch I made a stupid mistakes by passing the bounds_t by value instead of by reference, meaning functions modifying it would not modify it. I assume this is now fixed but I guess I left somewhere another mistake as stupid as that.
Note: the percentage of time spent is not reliable in those screenshots because I use a viewpos on some acid tubes spamming acid gas and the amount of particles is very variable.
Before (C), a lot of time is spent in sseLoadVec3Unsafe():
After (C), the first loads are just completely skipped:
Before (Asm), we see three movups instructions:
After (Asm), se see only two movaps instructions:
So, as I said, this is not urgent. But I would appreciate people re-reading what I did in hope someone finds the stupid mistake I may have done. The code change should be straightforward because it's basically rewriting with a different data layout, there is no algorithm change to be expected. Unfortunately this data struct is used in many places so the patch is massive.
There are many functions in CM not using the new structure yet that can be migrated, but that can be done later as the patch is already heavy as it is. Migrating other functions would be mostly code clean-up and only about code purity, it is not required to make it work. Though maybe it can also help the compiler to optimize in other places. For the same reasons, I only modified the game to be compatible with the new structs and shared functions, using some wrapping around the new struct, to keep the patch light on game side (porting the game to the new struct for the sake of purity can be done later as well).
I also added some constness to some function input, which may help the compiler to optimize a bit more some other places of the code. I doubt this extra-constness is the cause of the bug I'm facing because the compiler doesn't complain I'm writing to some const structs, so hopefully I only added some const keywords to structs that are only read and not written.
So I reread everything and found the bug I was looking for. I missed the conversion of one line (I just did not re-added the new one, that's why there was no compile type error).
It works.
I found another bug affecting BSP movers like doors. One can test it in the Vega map. The doors disappear according to the viewing angle / point of view.
Any fps difference?
I think you can make a drop-in aligned replacement for vec3_t like this:
struct alignas(16) alignedVec3_t : public std::array<float, 3> {};
This can be used in arrays so then you don't need bounds_t with the ugly at function.
I think you can make a drop-in aligned replacement for vec3_t like this:
struct alignas(16) alignedVec3_t : public std::array<float, 3> {};This can be used in arrays so then you don't need
bounds_twith the uglyatfunction.
Interesting, how such type can be used and where?
As a direct replacement for vec3_t. alignedVec3_t localBounds[ 2 ]; This would automatically work with indexing operations and functions like VectorCopy. But you would need a .data() or whatever to pass it to functions taking a float* (this includes functions with vec3_t parameter)
I like the idea of having have an alignedVec3_t but I don't see the benefit of making bounds_t an array of it. There is only a few usage of at(), while using .data() everywhere would be much verbose. Also I like the idea of explicit mins and maxs naming, I guess we may be able to do some mins() and maxs() and I tried to implement that with an alignedVec3_t but that because much more complex.
I just rewrote at() in a way it is now guaranteed to be branchless.
I like the idea of having have an
alignedVec3_tbut I don't see the benefit of makingbounds_tan array of it.
I'm suggesting that no bounds_t is needed. As I wrote in the previous comment, it could be used like alignedVec3_t localBounds[ 2 ];
But I like having a explicit bounds_t! It being implemented the way I did it or doing it like using bounds_t = alignedVec3_t[2], I want to do it…