depthai-core deadlocked in semaphores with OAK-D-Pro-PoE after 100+ connections,
See https://github.com/luxonis/depthai-core/issues/415 When that repro case is run with depthai-core v2.27.0 for hundreds of connects, the app will eventually hang and not respond.
This is an improvement from #415 when the test failed with only 6 connects. One test deadlocked in 135 connects. Another test it took 402 connects to deadlock.
The OAK-D-Pro-PoE responds to pings. The device itself could be ok.
The problem appears to be a deadlock in XLink semaphores. By running the test in a debugger I can see 6+ threads and they are all infinitely waiting on XLink semaphores in sem_wait(). Something is not signaling them.
I have isolated and fixed a group of XLink bugs within its Windows implementation for semaphores, pthread conditions, and clocks. After applying multiple fixes, OAK PoE failures declined by a magnitude 🌠
A test run with fixes was able to make 2897 connections in 6.5 hours before failure. At the point of failure, VSCode itself failed and therefore I did not have access to the debugger. I am unclear if VSCode failed and killed the test process, or if the test process died and affected VSCode. Still, the test wrote a CSV log and I see its results...2897 successful test runs in 6.5 hours.
The OAK-D-Pro-PoE in the test has the recent bootloader 0.0.28 from https://github.com/luxonis/depthai-core/releases/tag/v2.26.0. Applying this firmware alone did not have any measureable affect in connection reliability. The magnitude improvement was due to Xlink bug fixes.
There may still be an OAK firmware/bootloader problem. The OAK-D-Pro-PoE after the test failure did not pass spot testing, even after it having no client communicating to it for 2 hours.
- can be IP pinged and responds to that ping
- XLink example
list_devicesreportsstatus: X_LINK_SUCCESS, name: 192.168.2.23, mxid: 18443010318EF50800, state: X_LINK_BOOTLOADER, protocol: X_LINK_TCP_IP, platform: X_LINK_MYRIAD_X - But it may not be healthy. depthai-core test
xlink_roundtrip_testfails withC:\njs\depthai-core\build\tests>xlink_roundtrip_test.exe Randomness seeded to: 3520474196 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ xlink_roundtrip_test.exe is a Catch2 v3.4.0 host application. Run with -? for options ------------------------------------------------------------------------------- Test XLinkIn->XLinkOut passthrough with random 4000x3000 frame ------------------------------------------------------------------------------- C:\njs\depthai-core\tests\src\xlink_roundtrip_test.cpp(60) ............................................................................... C:\njs\depthai-core\tests\src\xlink_roundtrip_test.cpp(60): FAILED: due to unexpected exception with message: No available devices (1 connected, but in use) =============================================================================== test cases: 3 | 2 passed | 1 failed assertions: 3 | 2 passed | 1 failed
I repeat that "may not" because the xlink_roundtrip_test test itself has a bug. The depthai device search method occassionally fails with an OAK PoE device. The OAK PoE can not always complete its reboot fast enough for this roundtrip test. I can readily reproduce random failures of this roundtrip test on my OAK PoE sensor even after power-cycling it. Changing envvar DEPTHAI_SEARCH_TIMEOUT=10000 does seem to help...I was not able to readily reproduce a fail of this roundtrip test with the longer timeout.
That's great to hear @diablodale!
Would you be willing to open a PR to XLink repository with the changes you've made, so we can verify&mainline the fixes?
No PR. Same answer as in March
I don't provide code or detailed bug reports to Luxonis anymore. Your team didn't move on my high-quality PRs and bugs so I retracted/closed many of them and not doing that anymore. themarpe can bring you up-to-speed if you need details.
Fixes are passing my reliability tests. Last test ran 4126 iterations of continuous connect, get data, disconnect, repeat with an OAK-d-pro-poe. Zero delays, errors, faults, or freezes. All data streams valid. The sensor also continued correctly with manual testing after this 4k run with a few casual tests.
- fixed a few more bugs in the xlink platform-wide semaphore code
- added an xlink TRACE log level, and moved the env var read of XLINK_LEVEL from depthai-core to xlink itself.
- tidyed some xlink log levels and text to use TRACE to make the logs more usable in deep work like I did this week
- xlink is very bad at returning correct result values from functions. Often incorrectly mix
xLinkPlatformErrorCode_t,XLinkError_t, POSIX, and native OS result codes. They are not the same integer values and can not be mixed without conversion. Zero and non-zero is not good enough -- xlink branches based on specific error int values. I found dozens of these bugs when I re-wrote the xlink USB code. And found more this week in the semaphore code.
This issue should give your team enough info to look and fix your code.