XRT icon indicating copy to clipboard operation
XRT copied to clipboard

Fix RC when accessing singleton's sessions vector

Open Qizot opened this issue 2 years ago • 3 comments

Problem solved by the commit

There are 2 threads xma_thread1 and xma_thread2 reading from a vector of sessions inside the g_xma_singleton. The access was not guarded by any mutex while other threads could access the sessions vector and push a new element, this sometimes was causing a race condition where the thread was looping over elements and a new elements was being added causing a vector resize. This automatically invalidated any existing vector iterator (which the threads were using for reading from vector) which resulted in a segfault when trying to dereference it.

Bug / issue (if any) fixed, which PR introduced the bug, how it was discovered

The bug has been discovered when running a lot of quite short-lived sessions which caused the application to exit due to segfaults.

How problem was solved, alternative solutions (if any) and why they were rejected

Locking a mutex before accessing a list of sessions.

Risks (if any) associated with the changes in the commit

Probably none, the mutex could be replaced with std::shared_mutex as there are probably more reads than writes but with that low frequency of locking it, it is probably not worth it.

What has been tested and how, request additional testing if necessary

Documentation impact (if any)

Qizot avatar Sep 19 '22 07:09 Qizot

Can one of the admins verify this patch?

gbuildx avatar Sep 19 '22 07:09 gbuildx

@vboggara-xilinx This is causing segfaults for us and blocking us from using Xilinx in prod. Do you happen to know when this may be merged and then released?

Thanks!

vstreame avatar Oct 06 '22 13:10 vstreame

@dbenusov-xilinx I've moved the lock acquisition before the cpu_mode read as you wished. But when reading the comment of the XMA_CPU_MODE2 flag being #define XMA_CPU_MODE2 2 //High cpu load + high performance I'm not sure if that is good idea to lock and then sleep for 3 milliseconds. Maybe it should lock, check the flag, unlock, sleep and lock afterwards?

Qizot avatar Oct 10 '22 05:10 Qizot

That is a really good point I did not notice that, good catch. Try making the cpu_check an atomic variable and moving the lock_guard back to its original spot. Hopefully everything should work the same :)

dbenusov avatar Oct 19 '22 18:10 dbenusov

retest this please.

dayeh-xilinx avatar Oct 26 '22 20:10 dayeh-xilinx

Build Passed!

gbuildx avatar Oct 27 '22 01:10 gbuildx

changes looks good to me. we can merge this change.

vboggara-xilinx avatar Nov 01 '22 09:11 vboggara-xilinx

Can this finally be merged if all tests are passing?

Qizot avatar Nov 17 '22 10:11 Qizot