depthai-core icon indicating copy to clipboard operation
depthai-core copied to clipboard

OAK-D-Pro-PoE fails after 6 connections and requires power-cycle

Open diablodale opened this issue 2 years ago • 9 comments

OAK-D-Pro-PoE fails after 6 connections and requires power-cycle. I now have repro code. 🙃 When OAK enters this dead state, it does not respond to pings or appear active on the ethernet network (though my LAN diagnostic tools are limited)

A PR is forthcoming that holds the repro code. It is a new "performance" test which I would like to eventually merge so that Luxonis and others can instrument/evaluate the OAK hardware CPU+temperature stats with different builds and different permutations of Pipeline settings.

Setup

  • OAK-D-Pro-PoE preproduction with bootloader v0.0.17
  • Ubiquiti POE Injector PoE+ Gigabit 802.3af -- a full Ubi switch is forever not available in Europe 😢
  • TP-Link 1gpb ethernet switch
  • Microsoft Windows [Version 10.0.19044.1526]
  • depthai-core v2.15.0
  • VS2019 v16.11.10 (a c++17 compiler is required for the repro code)

Repro

  1. git clone from branch in forthcoming PR. The PR will hold the repro code
  2. config, build for: x64, Debug, shared libs, with all examples and tests
  3. run the new test build\tests\perf_hw_cputemp.exe

Result

You will see 6 lines, each showing the permutation of device settings device conf: .... You will see 1 or 2 lines of the samples pathname where the test is saving sample data. Sometime between the 6th and 7th run, the OAK-D-Pro-PoE will die and can no longer be used until power-cycled.

This 7th attempt will hang in code until it finally times out ~1 minute. You will receive Catch2 errors, messages, exceptions, etc.

device conf: 1646611432,0,400,0,false,false,false,false,5,30.000000,tcp,
samples: C:\njs\depthai-core\build\tests\hw-cputemp.csv
device conf: 1646611451,0,400,0,false,false,false,true,5,30.000000,tcp,
device conf: 1646611474,0,400,0,false,true,false,false,5,30.000000,tcp,
device conf: 1646611496,0,400,0,false,true,false,true,5,30.000000,tcp,
device conf: 1646611519,0,400,0,true,false,false,false,5,30.000000,tcp,
device conf: 1646611542,0,400,0,true,false,false,true,5,30.000000,tcp,
device conf: 1646611565,0,400,0,true,true,false,false,5,30.000000,tcp,

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
perf_hw_cputemp.exe is a Catch v2.13.7 host application.
Run with -? for options

-------------------------------------------------------------------------------
Hardware CPU and Temp
-------------------------------------------------------------------------------
..\tests\src\hardware_cputemp_perf.cpp(107)
...............................................................................

..\tests\src\hardware_cputemp_perf.cpp(281): FAILED:
  REQUIRE( std::get<bool>(targetDevice) )
with expansion:
  false
with message:
  epoch,color,depth,mono,imu,subpixel,align,lrcheck,filter_median,fps,protocol 
  1646611565,0,400,0,true,true,false,false,5,30.000000,tcp,

..\tests\src\hardware_cputemp_perf.cpp(281): FAILED:
  REQUIRE_NOTHROW( [&]() { const auto targetDevice = getAvailableDevice(std::chrono::seconds(protocol == X_LINK_TCP_IP ? 
60 : 15), protocol); do { ; Catch::AssertionHandler catchAssertionHandler( "REQUIRE"_catch_sr, ::Catch::SourceLineInfo( "..\\tests\\src\\hardware_cputemp_perf.cpp", static_cast<std::size_t>( 281 ) ), "std::get<bool>(targetDevice)", Catch::ResultDisposition::Normal ); try { __pragma( warning(push) ) catchAssertionHandler.handleExpr( Catch::Decomposer() <= std::get<bool>(targetDevice) ); __pragma( warning(pop) ) } catch(...) { catchAssertionHandler.handleUnexpectedInflightException(); } catchAssertionHandler.complete(); } while( (void)0, (false) && static_cast<bool>( !!(std::get<bool>(targetDevice)) 
) ); device = std::make_unique<dai::Device>(p, std::get<dai::DeviceInfo>(targetDevice)); }() )
due to unexpected exception with messages:
  epoch,color,depth,mono,imu,subpixel,align,lrcheck,filter_median,fps,protocol
  1646611565,0,400,0,true,true,false,false,5,30.000000,tcp,
  Exception translation was disabled by CATCH_CONFIG_FAST_COMPILE

Ignore the misleading "Exception translation was disabled by CATCH_CONFIG_FAST_COMPILE". It is a known Catch2 bug that is not relevant to this depthai issue.

Expected

784 test iterations to run (probably 24+ hours)

Notes

I can use my OAK-D for this same test and it is successful running all 784. I have lots of data that I want to compare against PoE. If you want to use a USB sensor, you will need to change the test code auto protocol = GENERATE(X_LINK_TCP_IP);.

This performance test is written for C++17 -- can't be bothered to make it work with earlier. The PR also includes updates so that tests can specify the c++ standard they need and the cmake harness will do the right thing.

diablodale avatar Mar 06 '22 20:03 diablodale

@diablodale Could you check what bootloader version is installed, and upgrade it to the latest 0.0.17, it should have some fixes for this problem. Can use this app to upgrade: https://github.com/luxonis/depthai-core/blob/main/examples/bootloader/flash_bootloader.cpp

alex-luxonis avatar Mar 06 '22 20:03 alex-luxonis

My device was Version: 0.0.15 and using that tool is now Version: 0.0.17 Unfortunately, that didn't help. It continues to die between the 6 and 7th iteration. Which is a good thing...it is a consistent repro of a fail. I just need to rebase some things for the PR

diablodale avatar Mar 06 '22 20:03 diablodale

PR is posted. I'm interested to see what you experience with your similar hardware.

diablodale avatar Mar 07 '22 00:03 diablodale

Thanks @diablodale.

I'm able to reproduce the failure on OAK-D-Pro-PoE, with the same behavior as on your side: fails at the 7th iteration, and the device no longer responds to pings until power-cycled.

I also tested on OAK-FFC-PoE-3P and it failed, but after a longer time: 54 mins, 141 iterations. Logs: https://gist.github.com/alex-luxonis/366bfd117cc0bb25632eacd3edecdb01/9a1c7f3510dbd779a3033c6aaeeda5c567149686 It may be the same issue observed on https://github.com/luxonis/depthai-core/issues/406 (CC @chkarl)

We are investigating.

alex-luxonis avatar Mar 07 '22 13:03 alex-luxonis

@diablodale - do you happen to have the exact part number for the TP-Link 1gpb ethernet switch so that we can order the same one (if it's still available)?

This may just be advantageous for debugging any other future issues as well, as likely it is better at bringing out the crash(es) we are seeing here, faster.

Thanks again for the help!

Luxonis-Brandon avatar Mar 10 '22 16:03 Luxonis-Brandon

TP-Link 5-port Gigabit Desktop Switch Model: TL-SG1005D Ver: 4.0

I bought this in 2011, I do not see the exact model avail today. The model# is the same but the look of the device is radically different. Here is mine... tp-switch

Ubiquiti POE Injector PoE+ Gigabit 802.3af PN: U-POE-af

Some of my customers are artists making performances/exhibits; probably will not have an IT dept with managed racks of PoE switches. They will use instead injectors and a friendly consumer ethernet switch like mine. And, there will be some set that do have IT depts + racks...for them I'll eventually use a Ubi switch for my testing (when they are again avail in Germany).

diablodale avatar Mar 10 '22 17:03 diablodale

Thank you!

Oh also @diablodale we went away from an internal power-supply solution after our first prototypes. We instead re-designed around a totally different approach.

And actually this re-design resulted in a re-design of the mechanical design as well. So when you have a second could you share a photo of the front of your OAK-D-Pro-PoE? As we'll be able to tell which version you have based on that.

We're wanting to know this as there could potentially be something specific to that initial design (which you might have) that is exacerbating the problem. (And the initial design will never see production - as it is undesirable in a variety of ways.)

Thanks again, Brandon

Luxonis-Brandon avatar Mar 10 '22 17:03 Luxonis-Brandon

My pre-production oak-d-pro-poe image

diablodale avatar Mar 10 '22 17:03 diablodale

Thank you!

Luxonis-Brandon avatar Mar 10 '22 17:03 Luxonis-Brandon

Resolved by: https://github.com/luxonis/depthai-core/releases/tag/v2.17.1 onwards (At least the exact specific scenario as described in OP)

themarpe avatar Oct 21 '22 16:10 themarpe

Yay! Will test when I again have access to my OAK PoE in November. :-)

diablodale avatar Oct 21 '22 19:10 diablodale