godot icon indicating copy to clipboard operation
godot copied to clipboard

Fix race conditions in PhysicsServers wrappers

Open LeaoLuciano opened this issue 2 years ago • 9 comments

When Physics runs in a separated thread, a race condition may occur in GodotCollisionObjectXD::_update_shapes(), causing two nodes of the same shape to be created in BVH tree.

This PR solves this by adding PhysicsServerXDWrapMT::physics_server_xd->body_test_motion() in the PhysicsServerXDWrapMT::command_queue, avoiding _update_shapes() to run in the main thread.

Fixes #70491

LeaoLuciano avatar Feb 01 '23 04:02 LeaoLuciano

I agree the current code is unsafe.

Now, I have a couple of thoughts:

  • I don't get why keeping limiting caller threads to the main one, now the method is totally thread-safe.
  • I wonder why these methods were excluded from being queued in the first place. Maybe the forced sync was too big an enemy of the intended performance gains of threaded physics?

RandomShaper avatar Jun 20 '23 11:06 RandomShaper

  • I don't get why keeping limiting caller threads to the main one, now the method is totally thread-safe.

Good question, I've tracked the origin of ERR_FAIL_COND_V(main_thread != Thread::get_caller_id(), false); and it is there since the creation of the file (9df77d276593ef7).

Maybe this is to ensure the function is only called by the gdscript.

  • I wonder why these methods were excluded from being queued in the first place. Maybe the forced sync was too big an enemy of the intended performance gains of threaded physics?

My guess is the author of 9df77d276593ef7 didn't manage to make FUNC3R() call ERR_FAIL_COND_V(main_thread != Thread::get_caller_id(), false);.

bool body_test_motion(RID p_body, const MotionParameters &p_parameters, MotionResult *r_result = nullptr) override {
	ERR_FAIL_COND_V(main_thread != Thread::get_caller_id(), false);
	return physics_server_2d->body_test_motion(p_body, p_parameters, r_result);
}

LeaoLuciano avatar Jun 23 '23 00:06 LeaoLuciano

My guess is the author of 9df77d2 didn't manage to make FUNC3R() call ERR_FAIL_COND_V(main_thread != Thread::get_caller_id(), false);.

I'm not very fond of that hypothesis. First of all, the author is @reduz, to whom I'm assume enough skills to do so. Second, that check would be pointless with FUNC3R(), since, as discussed above, the call would be thread-safe.

That said, I can't think of any reasonable alternative approach to fixing this issue that what this PR does. However, the thread check (i.e., all the WRITE_ACTION stuff) is not needed. Once that change is done, I approve on my part, but I'd still like to know the opinion of a physics maintainer.

RandomShaper avatar Jun 23 '23 11:06 RandomShaper

@RandomShaper You promised to approve on your part? 🙃 Is everything like you think it should be now?

YuriSizov avatar Oct 30 '23 16:10 YuriSizov

Posting when tired never a good idea: :smile:

BVH is not compiled thread safe on master as far as I remember. So any test_motion and access to change the BVH has to happen on the same thread.

Update

Sorry looking at this with fresh eyes this morning, and it makes more sense:

This does seem okay as far as safety because FUNC3R either flushes the command queue and directly calls the function if it the server thread, or else pushes to the command queue and syncs, then returns the result if not. So this should hopefully fulfill the requirement that all the physics processing happens on the physics thread. I'm assuming that when running physics not in multithread mode, the server_thread is the main thread.

As far as performance is concerned, this PR looks like it will flush the command queue before calling body_test_motion, whereas the old version didn't flush the command queue. This may or may not be desirable (really needs a physics guy).

I'm guessing the previous version allowed test_body_motion() to be run from the main thread, while the server was simultaneously doing something else on its own thread.

Summary

So to summarise, as far as I can see, it looks safe and may fix the bug, which is good. But it will flush the command queue in normal mode (which didn't happen before). This is probably fine and in some ways makes more sense (although I wonder why it didn't flush before maybe for some performance reason?).

It will also cause a sync in multithread mode (but that is nearly always the penalty of reading back from servers, and the physics server does this a lot). But I agree with @RandomShaper that some care has to be taken here that the benefit of running multithread physics isn't wiped out by the need to keep synchronising the main and physics thread (so one is waiting for the other etc).

So probably fine imo but would welcome further input. I'm only commenting here because the race condition may have been in the BVH which is my area, but is simply caused by using single threaded code from multiple threads.

If a physics maintainer is not available who is familiar with threading model, this may be worth checking with @reduz as he is most familiar with how much physics server is susceptible to stalls, to check this looks right as it will probably only take a couple mins.

lawnjelly avatar Oct 30 '23 19:10 lawnjelly

This needs a proper fix, as this this function is eventually intended to work multithreaded, hence this won't do. Need to discuss with other physics maintainers how to properly implement this functionality.

reduz avatar Jan 04 '24 18:01 reduz

I tested this fix, and confirm it fixes both the repro.zip MRP in #70491 and the GDQuest TPS demo that I reported crashing in #90689.

This needs a proper fix, as this this function is eventually intended to work multithreaded, hence this won't do. Need to discuss with other physics maintainers how to properly implement this functionality.

Would be good to make this discussion happen :)

akien-mga avatar Apr 15 '24 10:04 akien-mga

I had a similar issue and tested this fix, which did stop the crashing, but it absolutely killed performance.

I ended up restructuring the system to not rely on multi threaded physics but this change seemed to have a massive impact which shouldn't be overlooked.

Sorry I don't have a repro or numbers on hand, I can dig through Git if anyone needs.

cridenour avatar Apr 15 '24 14:04 cridenour

New thoughts:

  • Should maybe these functions have the same constraint as body_get_direct_state()?, that is, make it only usable from physics process. That way, the fact that the command queue has to be flushed is known.
  • In case we want to keep the fast path of running it without a flush, we could do the following (but is that what we want?):
    • A) In safe threading mode, just run the query, pretty much as it's happening with the current code. (By the way, I have a PR I'm working on that will remove the queue for single-safe threading.)
    • B) In separate thread mode, add some mechanism to stop the thread, run the query and re-start the thread.
  • Not directly related, but I'm wondering if MotionParameters should be RefCounted to avoid potentially expensive copies.

RandomShaper avatar Apr 24 '24 10:04 RandomShaper