Daemon icon indicating copy to clipboard operation
Daemon copied to clipboard

cg_api: implement `trap_CM_BatchMarkFragments` to process all impact marks in one call

Open illwieckz opened this issue 1 year ago • 21 comments

Engine counterpart of:

  • https://github.com/Unvanquished/Unvanquished/pull/2981

illwieckz avatar May 04 '24 08:05 illwieckz

Hmm, something very annoying is happening, the code is much slower with a single trap call than with many.

Is performance strongly tied to size of IPC messages?

I currently send array<MAX>, should I send vector(num) instead?

illwieckz avatar May 05 '24 05:05 illwieckz

The PR is ready, commits will be squashed, no need to review them one by one.

illwieckz avatar May 05 '24 06:05 illwieckz

Is having a 4KB large fragmentBuffer_t struct really faster than using variable-size arrays? Allocation can be slow but having to split up the socket reads/writes (since the amount of data is bigger than the kernel buffer) could be slower.

There can definitely be further optimizations done here. I have verified that on my end using fixed-size arrays for input and output of the single trap call would make the engine go from 500 fps to 300 fps when rendering ATCHS default alien spectator scene in 2K on my system, and then being even more slower than doing single trap calls for every mark individually. So I measured in terme of slowness: single trap call with large array > may trap calls with small data > single trap call with small array.

So yes IPC message size is critical, so anything reducing message size is welcome.

illwieckz avatar May 05 '24 22:05 illwieckz

Note: The rewriting of re.MarkFragments to not uselessly expects global constants as input like 4, MAX_MARK_POINTS, and MAX_MARK_FRAGMENTS will be done in a future PR after NUKING the legacy trap_CM_MarkFragments function where the constants were written in game and sent to engine instead.

illwieckz avatar May 05 '24 23:05 illwieckz

Is having a 4KB large fragmentBuffer_t struct really faster than using variable-size arrays? Allocation can be slow but having to split up the socket reads/writes (since the amount of data is bigger than the kernel buffer) could be slower.

The content of fragmentBuffer_t is the return from re.MarkFragments, optimizing the size of this return will also be done after rewriting of re.MarkFragments, which should be done after NUKING the legacy trap_CM_MarkFragments function first, and I expect such NUKING PR to come later.

illwieckz avatar May 05 '24 23:05 illwieckz

We can also rewrite re.MarkFragments to take markPoints.data() as input instead of markPoints.data()[ 0 ].data(), but yet again this would happen only after legacy trap_CM_MarkFragments is NUKED.

illwieckz avatar May 05 '24 23:05 illwieckz

This works well for me, there's no longer a significant performance difference between /cg_marks on and /cg_marks off (still depends on the amount of alien buildings in view of course, but that's expected).

VReaperV avatar May 09 '24 08:05 VReaperV

Yeah, this is still dependent on the amount of marks because as soon as the array is too large to fit into the kernel buffer the message will be split, but the split will only be happening when such packet size is outsized. But this means usual scene will likely only do 1 or 2 message instead of doing 1 message per mark.

Every player blob shadow is a mark, every alien creep is a mark, every bullet impact is a mark, every grenade and flamer burnt areas are marks. The number of marks increase very quickly in a game, and this reduce a lot the amount of IPC messages done for them!

It would be better to entirely avoid the engine-client discussion there by fully moving the related code into the cgame, but doing this kind of vectorization is already a big win while not being too much intrusive, and we still rely on well-tested code.

illwieckz avatar May 09 '24 08:05 illwieckz

This sounds like it could use the shmem command queue too in the future so maybe everything could fit in one area?

DolceTriade avatar May 09 '24 08:05 DolceTriade

Is having a 4KB large fragmentBuffer_t struct really faster than using variable-size arrays? Allocation can be slow but having to split up the socket reads/writes (since the amount of data is bigger than the kernel buffer) could be slower.

There can definitely be further optimizations done here. I have verified that on my end using fixed-size arrays for input and output of the single trap call would make the engine go from 500 fps to 300 fps when rendering ATCHS default alien spectator scene in 2K on my system, and then being even more slower than doing single trap calls for every mark individually. So I measured in terme of slowness: single trap call with large array > may trap calls with small data > single trap call with small array.

So yes IPC message size is critical, so anything reducing message size is welcome.

The obvious thing to do would be

struct markMsgOutput_t
{
	std::vector<std::array<float, 3>> markPoints;
	std::vector<markFragment_t> markFragments;
};

To be lazy and avoid writing a serializer you could write it as an std::pair instead of a struct.

slipher avatar May 10 '24 02:05 slipher

To be lazy and avoid writing a serializer you could write it as an std::pair instead of a struct.

Done.

illwieckz avatar May 10 '24 06:05 illwieckz

This sounds like it could use the shmem command queue too in the future so maybe everything could fit in one area?

There are two forward paths:

  • Implement trap_CM_BatchMarkFragments in game, never queries the engine. It may be possible that the cgame has all knowledge required to do this.
  • Implement this and other code called after that in engine, send a single async message. It looks like the return of trap_CM_BatchMarkFragments is used in a computation whose result is then sent again to engine with trap_R_AddPolyToScene.

For now this batched call looks good enough as it is less intrusive and less prone to error to achieve, while providing an already large speed up. This is a first good step that is more reachable and it brings almost all the benefit for a smaller development cost (time, effort…).

illwieckz avatar May 10 '24 07:05 illwieckz

I squashed the commits. This looks ready to me.

illwieckz avatar May 10 '24 07:05 illwieckz

As mentionned in previous comments, there are further improvements doable, including some reducing the message size, but this requires to rewrite parts of R_MarkFragments which is also used by the legacy trap_CM_MarkFragments and to rewrite parts of the legacy trap_CM_MarkFragments function itself. So those further improvement are expected to be done after trap_CM_MarkFragments is NUKED and I plan to NUKE trap_CM_MarkFragments after this is merged and submodules being synced.

illwieckz avatar May 10 '24 07:05 illwieckz

Absolutely, we should merge this if @slipher doesn't have any more comments. it seems ok to me.

DolceTriade avatar May 10 '24 07:05 DolceTriade

Since last review, along the implementation of std::pair I did things like this to simplify the readibility of the code:

const markMsgInput_t& input = markMsgInput[ m ];
const markOriginalPoints_t& originalPoints = input.first;
const markProjection_t& projection = input.second;

To avoid writing meaningless things like markMsgInput[ m ].first but originalPoints and things like that. I assume the compiler knows how to optimize everything out of this and to avoid any useless pointer copy and dereference.

illwieckz avatar May 10 '24 07:05 illwieckz

To be lazy and avoid writing a serializer you could write it as an std::pair instead of a struct.

Done.

That's not what I meant. If you have a struct of POD types, it serializes automatically and replacing with a pair just makes it less readable for no reason. The point is that you can use vectors instead of arrays of length 300 or whatever. (And it the struct contains vectors it won't serialize automatically.) Probably there's only a couple triangles most of the time so these large arrays are really wasteful in terms of I/O. This is important if the goal is to reduce the number of socket writes.

slipher avatar May 10 '24 14:05 slipher

Then I don't know what you meant about using pairs. I don't know what POD means.

The point is that you can use vectors instead of arrays of length 300 or whatever.

I don't want to change this data layout before NUKING the legacy code, and this PR is already large enough. We will have plenty of time to improve over this later. Using vectors where there are remaining arrays is something I want to do but doing that right now would overcomplexifies the PR. I'm well aware we can use vectors instead of arrays, it's just too early to do it.

illwieckz avatar May 10 '24 15:05 illwieckz

I reverted the pairs (to be squashed).

illwieckz avatar May 10 '24 15:05 illwieckz

I tweaked it to use varable-size vectors on the slipher/batch-marks branches in this repo and Unvanquished.

slipher avatar May 10 '24 17:05 slipher

I rebased and squashed the vector change.

illwieckz avatar May 10 '24 18:05 illwieckz

LGTM

slipher avatar May 11 '24 19:05 slipher