depthai-core inquiry into mutex locking changes in `Device` and `Queue`

I'm unsure of some mutex locking changes in commit https://github.com/luxonis/depthai-core/commit/55066c5078f47eea588433945c0fec2840cbe12a

`dai::DeviceBase::closed` changed from `atomic<bool>` to pair of `std::mutex` + `bool`

The two places I find these vars are used DeviceBase::isClosed() and DeviceBase::close() correctly implement the mutex+bool pair. My concern is that there is no known case for this code change and no functional bug. This change adds code complexity, more cpu, and additional memory use when not needed. And finally, it makes a substantial behavior change. This change creates a blocking behavior where the original code did not. Now multiple threads which call these functions will block rather than quickly returning. Since this is a blocking behavior change, this is a risky change for no known need.

When I refactored this code months ago, I had my head around the locking needs of the multiple threads running. At that time I felt the atomic<bool> was the better choice for simplicity and its non-blocking behavior. It didn't matter which thread closed a device; it only mattered that one thread ran the close code. I acknowledge that I can be wrong and write bugs. But that isn't the case here as there is no known bug and no imagined case.

int DeviceBase::getXLinkChunkSize() {
    checkClosed();

    return pimpl->rpcClient->call("getXLinkChunkSize").as<int>();
}

checkClosed() there is useless. It calls isClosed() which locks the mutex and returns the bool value of closed. If it is true then checkClosed() throws which then prevents the call to pimpl->rpcClient->call(). But the checking of the value and the throw are disconnected making the test useless.

Time	Thread 1	Thread 2
1	DeviceBase::getXLinkChunkSize()
2	checkClosed() returns `false`
3	…os pauses thread…	DeviceBase::close()
4	…os pauses thread…	std::unique_lockstd::mutex lock(closedMtx);
5	…os pauses thread…	if(!closed) {
6	…os pauses thread…	closeImpl();
7	…os pauses thread…	closed = true;
8	pimpl->rpcClient->call("getXLinkChunkSize").as();
9	call(…)
10	💥 or something fails/bad/errors/etc

Changing the behavior of the closed/closedMtx to be blocking doesn't help any of the cases that I can readily find. Instead, the above commit causes mutex locking and blocking of threads in the multithreaded world of depthai+xlink sending/receiving packets and making RPC calls. I think this change should be reverted.

`dai::XLinkConnection::closed` changed from `atomic<bool>` to pair of `std::mutex` + `bool`

Same everything as above. No known need, more complex, and behavior change to blocking. XLinkConnection::isClosed() is now blocking. Therefore XLinkConnection::checkClosed() is now blocking. Suprisingly, I didn't find any code calling XLinkConnection::checkClosed() so I don't have a code example. I also think this change should be reverted.

`ai::LockingQueue::destructed` changed from `std::atomic<bool>` to to pair of `std::mutex` + `bool` plus redundant if() tests

Same everything as above. No known need, more complex, and (less concerning) behavior change to blocking. I intentionally removed the if(destructed) and you agreed in my commit fc4b3148cf9be85d0e3d82217d344417de2327b1. Comments start at https://github.com/luxonis/depthai-core/issues/366#issuecomment-1031481519 I don't now have the entirety of that codebase in my head. I would need time to recreate (ugh) my knowledge. But I trust the decision we both made in Feb...unless you have a specific bug or imagined scenario. 🤔

Jul 12 '22 02:07 diablodale

Thanks for all the details here!

Jul 12 '22 02:07 Luxonis-Brandon

Hi @diablodale

I'll only address the last one currently: https://github.com/luxonis/depthai-core/pull/509#issuecomment-1166161764 Let me know if that describes the issue at hand.

Will revisit first two cases later today

Jul 12 '22 09:07 themarpe

I agree on the first point. Check closed should take the mutex as well & then throw if its closed to work as expected.

Yes, both 1st and 2nd cases could be reverted. The imagined scenario doesn't actually require this, could be done with the previous impl (which is a known time after close actually closes).

Jul 12 '22 10:07 themarpe

Hi. Revisiting this as I've returned from my travels. I don't see there was activity in the codebase related to this. Can I make a PR to revert some of these changes?

Dec 08 '22 15:12 diablodale

Hi @diablodale

You may - first and second cases don't have any specific rationale behind it apart from only returning when it was closed after its actually closed

Dec 09 '22 14:12 themarpe

@themarpe , hi. I've got time this week to work on this. I see checkClosed() has spread and in the majority of cases it is useless. None of those cases are thread-safe and the check/throw has a random outcome. I give one of many possible threading scenarios in the OP above.

I recommend the code be fixed as any safety intended by checkClosed() doesn't work. Often, the authors intended to implement..."I need to run some code, but my code can only run if the device is closed, so check that the device is closed and throw if it is open". To implement that...a mutex lock is needed. A lock needs to be held during the entire block of code that has the restriction "...my code can't run if the device is open...". All call locations may not have this same intention. Each call location needs to be analyzed to see if there are other intentions. I see ~50 locations. 🫤

I need to look more, but I have a gut worry around the restriction: "the device is closed" or is it "the device is not open". This might be coded differently in the underlying booleans and mutex. I have to review the code.

If that worry is false, then I might be able to reduce refactoring by using a movable lock like unique_lock and change checkClosed() to return an r-value of that lock and use [[nodiscard]]. The latter is c++17 only so this is error prone...even more since core devs have gotten used to errant usage. Since you want to support c++14, it is probably better to delete the bad checkClosed() function and create a new one like getClosedLock() that returns the r-value and heavily hints there is a return.

Comments before I start coding?

Mar 13 '23 15:03 diablodale

Also problematic is that isClosed() is a public function on several classes; and it is exposed to both C++ and Python. isClosed() is not thread-safe and I recommend removing it completely. Until you approve the removal, I'll doc/mark the API strongly to discourage usage and provide warnings.

Mar 13 '23 18:03 diablodale

Hi @diablodale

I need to look more, but I have a gut worry around the restriction: "the device is closed" or is it "the device is not open". This might be coded differently in the underlying booleans and mutex. I have to review the code.

The intention is to error out if device was already closed/disconnected.

I agree on it not being thread safe, at the time was mostly a quick add, to see if device went down already.

I think would be best that underlying RPC calls themselfs would check for that & properly error out, that way there would be no need to hold the lock until rpc is finished, etc...

Mar 14 '23 10:03 themarpe

WRT isClosed - main intention is to check wheter or not the device connection is still live. The removal of it is not planned, and would be better to address the shortcomings of it at the current state

Mar 14 '23 10:03 themarpe

WRT isClosed - main intention is to check wheter or not the device connection is still live. The removal of it is not planned, and would be better to address the shortcomings of it at the current state

ok. Callers that spin on isClosed() with a wait(1s) as seen in some examples are not a concern. The concern is the other callers that do active work with the expectation that the device in up and running. Pseudocode...

while (not device.isClosed()) {
  device.makeStateChanges()
  device.push(data)
  device.pull(data)
}

Imagine that above in python. There's nothing that prevents this scenario...

python interpreter	some other thread
call isClosed()

receive FALSE from isClosed()
	set running=false, deallocate all device/queue resources, etc.
logical NOT(FALSE) --> TRUE

while(TRUE) loop

device.makeStateChanges()
💣 throws, crashes etc.

Given isClosed() is unreliable/random, I almost think it is better to write the loop as while(true) and let the throw/crash cause the loop exit.

Mar 14 '23 11:03 diablodale

I also want to highlight a scenario that I see repeatedly, provide a suggestion, and ask your feedback so that I know what to do throughout the codebase.

https://github.com/luxonis/depthai-core/blob/e5dae8fd2f9e543bd99f9d1a73959398ef5bd0b4/include/depthai/device/DataQueue.hpp#L144-L155

Two issues in this code:

if(!running) throw... is not thread-safe. The running state can change any nanosecond after getting the low-level boolean like in the negation, in the if evaluation, in between it and the throw, etc.
The whole function may not be thread-safe. What is the intention with regards to DataOutputQueue state while this function is called? Is it... a. "Allow close/deallocating resources while I try to retrieve message T from those resources (e.g. queue)" b. "Prevent close/deallocating resources while I try to retrieve message T from those resources (e.g. queue)"

If the intention is 2b...then a mutex lock is needed so that close() can't proceed at the same time tryGet() is running. I don't think this specific scenario will cause a crash since the queue has its own internal mutex locks. Instead, this is more a semantic/logical inquiry so that I understand the intention

Mar 14 '23 11:03 diablodale

I think would be best that underlying RPC calls themselfs would check for that & properly error out, that way there would be no need to hold the lock until rpc is finished, etc...

Reviewing the develop codebase, I think the rpc functionality protects itself and throws when bad things happen. Here's what I've seen in code.

DeviceBase::init2() creates the pimpl->rpcStream and pimpl->rpcClient
pimpl->rpcStream retains a std::shared_ptr<dai::XLinkConnection> which it copied during construct. The XLinkConnection won't ~destroy itself due to this shared ownership. Naturally, a connection can fail due to unexpected problems.
pimpl->rpcClient lambda retains a copy of the pimpl->rpcStream. That specific xlinkstream will not ~destroy itself due to this shared ownership. And remember that specific xlinkstream contains the shared XLinkConnection.
pimpl->rpcClient lambda calls rpcStream->write() and ->read(). Both of them will throw if they encounter any low-level XLink errors. So if at any time the XLinkConnection and/or XLinkStream die, then these two function calls should throw. The throw will cascade up the call hierarchy. First out of the lambda and to the code that called rpcClient->call().
Many (not all) of the places checkClosed() is used are within functions that use RPC. For example, DeviceBase::setTimesync() calls pimpl->rpcClient->call("setTimesync", .... The throw I describe directly above will cascade up from within the rpcClient->call() and then further up. This throw will be throw XLinkWriteError(status, streamName) instead of the checkClosed throw std::invalid_argument("Device already closed or disconnected");. Personally, I'm ok with the low level throw and I also don't think invalid_argument is appropriate since the timesync argument was valid...it was something else that failed. But if you want, it is possible to wrap all the pimpl->rpcClient->call() call locations with a try/catch and throw whatever you want.

There is likely an issue with blindly calling pimpl->rpcClient->call(). It is the basic issue of calling a function through a nullptr. If some thread or code somewhere calls one of these DeviceBase functions like DeviceBase::setTimesync() while something else has closed() the Device and the C++ runtime is destructing the pimpl, then depending on when setTimesync() is called then the rpcClient and/or the pimpl itself can be invalid. A hack for that is to check if pimpl != nullptr before ->rpcClient and then again before ->call(). But that's a race condition as another thread can again be destructing pimpl or DeviceBase. To safely call, I think a mutex is needed to prevent the pimpl and rpcClient pointers from becoming invalid.

Mar 14 '23 16:03 diablodale

The draft PR passed all test+examples. And it ran continuously for ~4hrs serving depth+color. 👍

I've also been thinking about two things:

Checking for `pimpl != nullptr` is wrong

The only time pimpl could be nullptr is when the DeviceBase is destructed. To create that scenario, a client app has to call an API on a destructed object. ❌ This is a basic c++ bug in the client app. Not a depthai-core bug. Even if we wanted to be helpful, a class's code can not reliably self-inspect a destructed instance.

As example...one of my own apps. I create a Device (which contains a DeviceBase) and then spawn several threads which call APIs on that Device. When I want to end my app, I stop/join all my threads, and then I allow the Device to be destructed. It is the client app's responsibly to not use a destructed object.

I am removing this pimpl check. Pushing a commit, I'll rebase later.

All those `auto lock = getLockPreventClose();`

The majority are calls to pimpl->rpcClient->call(...).

pimpl and the pimpl::rpcClient will always be valid. Same thinking as above. Both are created during init and are valid until destruct.
call() has code within it to check and throw when the underlying XLink stream or connection have problems. I wrote of this above https://github.com/luxonis/depthai-core/issues/520#issuecomment-1468461065

If something internally fails during Device construct or during running, then the Device might internally close() itself. That close() tears down XLink and therefore all the rpcClient->call() will correctly throw. If there is a hardware/cable/usb/ether failure, the XLink infrastructure will fail and again will correctly throw. All the throw cases are the responsibility of the client app to catch and manage...or not and let their app crash.

I see no need for this lock or an additional check+throw for "closed" for these cases.

Calls to the logger `pimpl->setLogLevel(...)` and `pimpl->getLogLevel()`

These are the only other API calls that had checkClosed. Both of these APIs use DeviceBase::Impl::logger and some static code. Therefore, the same thinking applies. pimp and pimpl::logger will always be valid until destruct. And the static code can always be run. If a client wants to call these log apis while the DeviceBase is closed...it is no harm. It will correctly adjust a logger that will never be used.

I see no need for this lock or an additional check+throw for "closed" for these cases.

🤔

Is there a test case that I can review or run that helps me see the need for checkClosed()? Why did the code have all those?

As an experiment, I removed the lock+check+throw functionality of getLockPreventClose(). Recompiled. All test+examples passed 100%.

Mar 15 '23 12:03 diablodale

Thanks for the thorough dive into this

I'll start at the "closed" event. The closed event is thought to be a one time event that puts a "closed" from 0->1. It signifies that underneath the device is not available anymore. If more operations are done on Device obj, that is fine. The underlying communication will just error out in that case. However, to not retain a user in some codepath where a device is still thought to be "connected", the check is done to exit from that before hand.

a. "Allow close/deallocating resources while I try to retrieve message T from those resources (e.g. queue)"

I think the thinking falls inline more with an "event" than a "can deallocate & make things invalid".

So I think the overall thread safety WRT the isClosed isn't "important" from perspective of being exact with when the comms fall down, but to be still correct if eg an explicit close is called and at the same time isClosed is checked. Sooner or later any functionality requiring underlying comms, would fail, which would address the new state that the device is in.

Imagine that above in python. There's nothing that prevents this scenario...

And just to apply it to this scenario as well, the final 💣 💥 (device.makeStateChanges() ->💣 throws, crashes etc) should never be a "boom" per se. If device is closed / link falls down, the device itself is still in an okay state, to have calls be made against it, etc..., but throws will notify the user that any "comm" ops are now out of the picture as the link is down. I think this is valid / have no other means of how to do otherwise for such an "async" event.

while something else has closed() the Device and the C++ runtime is destructing the pimpl, then depending on when setTimesync() is called then the rpcClient and/or the pimpl itself can be invalid.

Wouldn't this case also be covered by essentially the Device object staying alive unless a caller has made a mistake of letting it destruct and then calling one of its functions? The pimpl should remain alive?

I am removing this pimpl check. Pushing a commit, I'll rebase later.

Sounds fair on the reasoning, the pimpl should survive the whole lifetime of the parent obj.

I see no need for this lock or an additional check+throw for "closed" for these cases.

I agree (I tried picturing this point above)

Is there a test case that I can review or run that helps me see the need for checkClosed()? Why did the code have all those?

None unfortunately, apart from a either "close" or cable disconnect while doing calls to comms functions. Reasoning, mostly legacy - just throwing a better error at the end of the comms function - or - as said - catching that one and throwing a more appropriate one in Device, is the way to go here IMO.

(I went through these not in the greatest detail, sorry in advance - let me know if certain things should be expanded upon)

Mar 16 '23 00:03 themarpe

I removed all the checkClosed() and my experimental getLockPreventClose(). Pushed to the draft PR.

I clearly hear you want isClosed() 👂👍. Do you want me to remove the [[deprecated]] warning I put on isClosed()?

In basic single-threaded code (probably most python scripts), explicit calling closed(), then checking isClosed() works. I am concerned about unpredictable hardware/cable failures, the rare multithreaded python script, or multithreaded C++ programs. These do not have clean close() followed by clean isClosed() and then a sleep() loop. Instead, they are random in order and could be simultaneous.

	synchronous calls, single-threaded	asynchronous calls, hw failures, or multi-threaded
isClosed()	✅	❌
try/except	✅	✅

Well written code should wrap Device activity with code to catch exceptions. isClosed() only handles basic synchronous clean-close single-threaded scenarios. The try/except handles all cases...it handles the async hardware events and it handles the synchronous scenarious since they will throw. Maybe we should add more doxygen text to isClosed() to further warning and maybe give example?

Mar 16 '23 13:03 diablodale

Do you want me to remove the [[deprecated]] warning I put on isClosed()?

Yes, please do.

The distinction made is point on - sentence or two can be added to docs of this function +1.

Mar 16 '23 19:03 themarpe

depthai-core depthai-core copied to clipboard

inquiry into mutex locking changes in `Device` and `Queue`

dai::DeviceBase::closed changed from atomic<bool> to pair of std::mutex + bool

dai::XLinkConnection::closed changed from atomic<bool> to pair of std::mutex + bool

ai::LockingQueue::destructed changed from std::atomic<bool> to to pair of std::mutex + bool plus redundant if() tests

Checking for pimpl != nullptr is wrong

All those auto lock = getLockPreventClose();

Calls to the logger pimpl->setLogLevel(...) and pimpl->getLogLevel()

🤔

depthai-core
depthai-core copied to clipboard

`dai::DeviceBase::closed` changed from `atomic<bool>` to pair of `std::mutex` + `bool`

`dai::XLinkConnection::closed` changed from `atomic<bool>` to pair of `std::mutex` + `bool`

`ai::LockingQueue::destructed` changed from `std::atomic<bool>` to to pair of `std::mutex` + `bool` plus redundant if() tests

Checking for `pimpl != nullptr` is wrong

All those `auto lock = getLockPreventClose();`

Calls to the logger `pimpl->setLogLevel(...)` and `pimpl->getLogLevel()`