PhysX icon indicating copy to clipboard operation
PhysX copied to clipboard

Spin lock on volatile variable without an explicit barrier is extremely unsafe

Open jamesdolan opened this issue 5 years ago • 5 comments

https://github.com/NVIDIAGameWorks/PhysX/blob/4050bbfdc2699dfab7edbf0393df8ff96bbe06c5/physx/source/foundation/src/windows/PsWindowsMutex.cpp#L148

Both the Unix and Windows ReadWriteLock implementation from 3.4 to 4.1 spins on a volatile variable without an explicit barrier is incredibly unsafe.

Volatile only guarantees the compiler won't optimize it away, says nothing about thread safety. Some compilers will insert fences for you but thats a really fragile assumption.

Spinning is also pretty bad in cases where read locks are held for more than trivial amounts of time.

Almost certainly better use use the windows slim read-write lock and pthread_rwlock instead of the current implementation.

I bring this up because we are seeing periodic hangs when trying to acquire scene locks where both lockRead() and lockWrite() on the PxScene stall for several hundred milliseconds, and it seems to be linked to CPU context switches.

jamesdolan avatar Apr 26 '19 01:04 jamesdolan

This isn't the first time we've had this questioned by users.

While it has been a while since we benchmarked, the Slim read-write lock on Windows was dramatically slower than this implementation when we benchmarked it last. The context in which we benchmarked was our expected use-case for game developers: lots of short-held read/write locks on a scene coming from multiple threads. In this context, SlimRWLock added an enormous amount of overhead. The story very likely changes if you have threads holding read/write locks for extended periods of time - at this point, the frequency at which locks change hands is low and the overhead of the locking mechanism becomes irrelevant. If the number of sw threads is significantly higher than the number of available hw threads and a lot of threads attempt to acquire a write lock, the spin lock could cause problems and the forced context switches by the more heavyweight locks may become beneficial. However, only 1 write lock will acquire the mutex and spin - the others will be stuck waiting for that mutex. Similarly, no further read locks will be successfully acquired (and those threads will context switch away on the mutex), leaving us to wait for the remaining read threads to complete and yield.

Do you know what the cause is of the 100ms+ stall? Is there a thread genuinely holding a read lock for ages? How many sw threads vs hw threads are being used/available on your target platform? Are you using variable thread priorities and limiting thread affinities? This is often a recipe for these nasty spikes depending on how the particular OS scheduler works. Have you tried putting explicit Ps::yield() calls in the while loop (perhaps after a minimum number of times round the loop)? This should yield the remainder of this thread's time-slice to the next thread.

You are absolutely right that volatile memory operations do not guarantee thread safety on all platforms. However, in this particular context, I don't think there is a risk. The read lock operation includes a mutex acquisition and a mutex release. These operations should include memory fences on all platforms AFAIK, so any data written by another thread should be visible to the reading thread and, if a read lock is acquired, no further writes are permitted. It is true that the code releasing the read lock does not include a memory fence and just atomically decrements a counter. However, the thread holding the read lock was not allowed to modify shared state, so there shouldn't be a risk with this. Only the thread holding a write lock is permitted to modify shared data, and this is guarded using mutexes which should guarantee writes are visible before the mutex is released.

kstorey-nvidia avatar Apr 26 '19 08:04 kstorey-nvidia

We are still trying to fully understand the problem, needless to say we have a very atypical setup on both hardware and software.

HW wise, basically a dual socket CPU with a large number of cores.

SW wide, a large number of threads (Core count - 2) performing a large number of scene queries and thus holding read locks for about 10ms each. The write lock is held for a very short period of time typically as the scene is effectively static except for a single moving object. So typically less than 1ms.

But once in a blue moon, both the read and write locks take 100-500ms.

I setup some code to automatically capture a 30 frame window of capture data when we detect the issue, so hoping to learn more soon.

jamesdolan avatar Apr 26 '19 19:04 jamesdolan

It would be interesting to know what is going on that causes this behaviour.

From an implementation stand-point, only one thread should ever be in the spin-lock waiting for all the read locks to be released. We could potentially introduce yields to force a context switch to free up this thread, but I doubt this would make much of a difference in your problem case.

What you're describing sounds like a thread holding a read lock has either context switched out for 100+ ms or is doing something extremely expensive. At the same time, the thread acquiring the write lock is spinning waiting for all read locks to be released. While this is happening, no further read locks can be acquired, so effectively everything is blocked waiting for one of the threads to complete some work and release the read lock.

As the thread acquiring the write lock has already acquired the mutex, no further read locks can be acquired; these threads should be context switched out by the OS while waiting for the platform mutex, so, theoretically, this shouldn't be a case of there being a thread paged out for 100+ms waiting for a processing resource to become available.

kstorey-nvidia avatar Apr 29 '19 08:04 kstorey-nvidia

Regarding this comment

You are absolutely right that volatile memory operations do not guarantee thread safety on all platforms. However, in this particular context, I don't think there is a risk.

I don't have the expertise to say you are mistaken (actually I'm fairly sure you're right) but I'd like to draw your attention to this quote from the visual studio documentation

The volatile keyword in C++11 ISO Standard code is to be used only for hardware access; do not use it for inter-thread communication. For inter-thread communication, use mechanisms such as std::atomic<T> from the C++ Standard Library.

To make volatile slightly scarier for msvc users there's the fact that microsoft recently changed its meaning in their compiler (to the actual meaning from the standards). Except only if you specify a compiler flag. Except if you target ARM.

These two facts were scary enough to make me reevaluate the, probably invalid, use of volatile in my own code.

shaneasd avatar May 11 '19 14:05 shaneasd

I think we met the same issue as @jamesdolan reported. We were doing some raytrace and overlap tests from 128 threads concurrently on AMD 3990X. Switching to use Windows SRWLock results in 10x performance boost.

zhaijialong avatar Nov 02 '21 06:11 zhaijialong