Daemon
Daemon copied to clipboard
cg_api: implement `trap_CM_BatchMarkFragments` to process all impact marks in one call
Engine counterpart of:
- https://github.com/Unvanquished/Unvanquished/pull/2981
Hmm, something very annoying is happening, the code is much slower with a single trap call than with many.
Is performance strongly tied to size of IPC messages?
I currently send array<MAX>, should I send vector(num) instead?
The PR is ready, commits will be squashed, no need to review them one by one.
Is having a 4KB large fragmentBuffer_t struct really faster than using variable-size arrays? Allocation can be slow but having to split up the socket reads/writes (since the amount of data is bigger than the kernel buffer) could be slower.
There can definitely be further optimizations done here. I have verified that on my end using fixed-size arrays for input and output of the single trap call would make the engine go from 500 fps to 300 fps when rendering ATCHS default alien spectator scene in 2K on my system, and then being even more slower than doing single trap calls for every mark individually. So I measured in terme of slowness: single trap call with large array > may trap calls with small data > single trap call with small array.
So yes IPC message size is critical, so anything reducing message size is welcome.
Note: The rewriting of re.MarkFragments to not uselessly expects global constants as input like 4, MAX_MARK_POINTS, and MAX_MARK_FRAGMENTS will be done in a future PR after NUKING the legacy trap_CM_MarkFragments function where the constants were written in game and sent to engine instead.
Is having a 4KB large fragmentBuffer_t struct really faster than using variable-size arrays? Allocation can be slow but having to split up the socket reads/writes (since the amount of data is bigger than the kernel buffer) could be slower.
The content of fragmentBuffer_t is the return from re.MarkFragments, optimizing the size of this return will also be done after rewriting of re.MarkFragments, which should be done after NUKING the legacy trap_CM_MarkFragments function first, and I expect such NUKING PR to come later.
We can also rewrite re.MarkFragments to take markPoints.data() as input instead of markPoints.data()[ 0 ].data(), but yet again this would happen only after legacy trap_CM_MarkFragments is NUKED.
This works well for me, there's no longer a significant performance difference between /cg_marks on and /cg_marks off (still depends on the amount of alien buildings in view of course, but that's expected).
Yeah, this is still dependent on the amount of marks because as soon as the array is too large to fit into the kernel buffer the message will be split, but the split will only be happening when such packet size is outsized. But this means usual scene will likely only do 1 or 2 message instead of doing 1 message per mark.
Every player blob shadow is a mark, every alien creep is a mark, every bullet impact is a mark, every grenade and flamer burnt areas are marks. The number of marks increase very quickly in a game, and this reduce a lot the amount of IPC messages done for them!
It would be better to entirely avoid the engine-client discussion there by fully moving the related code into the cgame, but doing this kind of vectorization is already a big win while not being too much intrusive, and we still rely on well-tested code.
This sounds like it could use the shmem command queue too in the future so maybe everything could fit in one area?
Is having a 4KB large fragmentBuffer_t struct really faster than using variable-size arrays? Allocation can be slow but having to split up the socket reads/writes (since the amount of data is bigger than the kernel buffer) could be slower.
There can definitely be further optimizations done here. I have verified that on my end using fixed-size arrays for input and output of the single trap call would make the engine go from 500 fps to 300 fps when rendering ATCHS default alien spectator scene in 2K on my system, and then being even more slower than doing single trap calls for every mark individually. So I measured in terme of slowness:
single trap call with large array > may trap calls with small data > single trap call with small array.So yes IPC message size is critical, so anything reducing message size is welcome.
The obvious thing to do would be
struct markMsgOutput_t
{
std::vector<std::array<float, 3>> markPoints;
std::vector<markFragment_t> markFragments;
};
To be lazy and avoid writing a serializer you could write it as an std::pair instead of a struct.
To be lazy and avoid writing a serializer you could write it as an
std::pairinstead of a struct.
Done.
This sounds like it could use the shmem command queue too in the future so maybe everything could fit in one area?
There are two forward paths:
- Implement
trap_CM_BatchMarkFragmentsin game, never queries the engine. It may be possible that the cgame has all knowledge required to do this. - Implement this and other code called after that in engine, send a single async message.
It looks like the return of
trap_CM_BatchMarkFragmentsis used in a computation whose result is then sent again to engine withtrap_R_AddPolyToScene.
For now this batched call looks good enough as it is less intrusive and less prone to error to achieve, while providing an already large speed up. This is a first good step that is more reachable and it brings almost all the benefit for a smaller development cost (time, effort…).
I squashed the commits. This looks ready to me.
As mentionned in previous comments, there are further improvements doable, including some reducing the message size, but this requires to rewrite parts of R_MarkFragments which is also used by the legacy trap_CM_MarkFragments and to rewrite parts of the legacy trap_CM_MarkFragments function itself. So those further improvement are expected to be done after trap_CM_MarkFragments is NUKED and I plan to NUKE trap_CM_MarkFragments after this is merged and submodules being synced.
Absolutely, we should merge this if @slipher doesn't have any more comments. it seems ok to me.
Since last review, along the implementation of std::pair I did things like this to simplify the readibility of the code:
const markMsgInput_t& input = markMsgInput[ m ];
const markOriginalPoints_t& originalPoints = input.first;
const markProjection_t& projection = input.second;
To avoid writing meaningless things like markMsgInput[ m ].first but originalPoints and things like that. I assume the compiler knows how to optimize everything out of this and to avoid any useless pointer copy and dereference.
To be lazy and avoid writing a serializer you could write it as an
std::pairinstead of a struct.Done.
That's not what I meant. If you have a struct of POD types, it serializes automatically and replacing with a pair just makes it less readable for no reason. The point is that you can use vectors instead of arrays of length 300 or whatever. (And it the struct contains vectors it won't serialize automatically.) Probably there's only a couple triangles most of the time so these large arrays are really wasteful in terms of I/O. This is important if the goal is to reduce the number of socket writes.
Then I don't know what you meant about using pairs. I don't know what POD means.
The point is that you can use vectors instead of arrays of length 300 or whatever.
I don't want to change this data layout before NUKING the legacy code, and this PR is already large enough. We will have plenty of time to improve over this later. Using vectors where there are remaining arrays is something I want to do but doing that right now would overcomplexifies the PR. I'm well aware we can use vectors instead of arrays, it's just too early to do it.
I reverted the pairs (to be squashed).
I tweaked it to use varable-size vectors on the slipher/batch-marks branches in this repo and Unvanquished.
I rebased and squashed the vector change.
LGTM