gamescope icon indicating copy to clipboard operation
gamescope copied to clipboard

steamcompmgr, wlserver: fix various data races found w/ thread sanitizer

Open sharkautarch opened this issue 7 months ago • 8 comments

For testing w/ TSAN, I added on the meson flags -Db_sanitize=thread --buildtype=debugoptimized when compiling gamescope

I've only done tests w/ running gamescope in nested mode on my X11 host desktop, so this PR might need testing on a wayland host w/ the gamescope wayland backend.

I used the following command to test the latest commit from master w/ TSAN: TSAN_OPTIONS="suppressions=${HOME}/gamescope/suppress_mangoapp" build/src/gamescope -- vkcube |& tee gamescope_tsan.log note that I had to suppress mangoapp_update() (the suppress_mangoapp files is only one line: race:mangoapp_update) to avoid the log getting filled with messages about it

I observed at least three unique data race messages, occurring after gamescope + vkcube were actively running (so after initialization), that would reliably get triggered whenever I did the following in order:

  1. Cause gamescope's window(s) to go out-of-focus
  2. wait
  3. bring the window back into focus
  4. move the cursor around a bit

SUMMARY: ThreadSanitizer: data race ../src/wlserver.cpp:337 in bump_input_counter

SUMMARY: ThreadSanitizer: data race ../src/wlserver.cpp:2398 in wlserver_mousebutton(int, bool, unsigned int)

SUMMARY: ThreadSanitizer: data race ../src/wlserver.cpp:2379 in wlserver_mousewarp(double, double, unsigned int, bool)

Full log: https://gist.github.com/sharkautarch/4a8cf292d8b70d133904708ac29465ed

While there were a lot of TSAN warnings which could be false positives and/or non-issues, the data races addressed by this PR seemed like a bigger deal

~~EDIT:~~ ~~I just realized that if (bPrevState & suspended) should be if (!bPrevState & suspended) OOPSIES~~ ~~Will fix that soon~~ ^FIXED

Also, some explanations for some of the weird things in this PR:

  • I had to make bCursorHidden be std::atomic<char> in order to be able to use atomic fetch_or. This should be safe because I made sure that bCursorHidden is always initialized and assigned bool values which means that the variable will always be either 1 or 0
  • ~~I use atomic_ref w/ inputCounter instead of a normal atomic because inputCounter is passed into an Xlib function. That means that the Xlib function is doing a non-atomic write to inputCounter, but TSAN doesn’t seem to complain, so either it’s a false negative or the threads just never actually race at that spot… I’ve been using gcc’s thread sanitizer, so maybe this might warrant checking w/ clang’s thread sanitizer…!~~
  • This PR uses a lot of acquire/release memory ordering. This is just to minimize any performance impact, since this PR adds a significant amount of atomic ops. I’m pretty sure that only two threads could simultaneously access memory (at least in the areas that this PR deals with) so, from what I understand, using acquire/release should be fine

~~EDIT: for the second bullet point, now that I think about it, Xlib is a shared library, which means that the only way for TSAN to catch races from Xlib function calls would be to rebuild Xlib w/ TSAN. So I think I’ll need to edit this PR again~~ ~~^FIXED~~ ^Just realized that XChangeProperty doesn't modify its input data (or at least I don't think it does). I updated the PR again

~~last update: to deal with the one remaining edge case I saw, I changed all instances of~~ ~~wlserver.bCursorHidden.store(!wlserver.bCursorHasImage, std::memory_order_release);~~ to: ~~std::atomic<char> atomicTrue{true};~~ ~~atomicTrue.compare_exchange_strong(wlserver.bCursorHidden, wlserver.bCursorHasImage);~~

~~which makes the line a single CAS RMW operation~~

~~Not sure if the change is actually necessary,~~ ~~and the compare-and-exchange is definitely slower than a load+store.~~ ~~So let me know whether you think I should undo that...~~ ~~(edit: for now I've decided to change compare_exchange_strong to compare_exchange_weak, tho this only makes a difference w/ non-x86 architectures)~~ ^ I undid the CAS thing because I realized it didn't work the way I was thinking it would

sharkautarch avatar Jul 04 '24 22:07 sharkautarch