depthai-core icon indicating copy to clipboard operation
depthai-core copied to clipboard

OAK POE: 'Couldn't read data from stream: 'right' (X_LINK_ERROR)'

Open chkarl opened this issue 2 years ago • 15 comments

We have stability issues running the OAK-FFC-PoE-3P in our test bench. After some hours, typically 12-18 hours, image delivery will crash with the error message: Communication exception - possible device error/misconfiguration. Original message 'Couldn't read data from stream: 'right' (X_LINK_ERROR)'

Catching the exception and initializing the pipeline again works 2-3 times, then a power cycle is always needed to connect again.

At booting it also have issues with finding the device: X_LINK_DEVICE_NOT_FOUND, see log below.

SW config: Testing with the latest 2.15 release candidate The test script just grabs images 800P from all 3 cameras: right, left, rgb and show the images on the screen. FPS: 10 FPS. Ubuntu 20.04

HW config: Board connected to standard Arducam cameras (OAK-FFC-OV9282-M12 (Mono, 22 pin), and OAK-FFC-IMX477-M12) The host also have a M2 card with two additional Movidius chip. (not used)

See log below

~/code$ python3 test_cameras.py 
2.15.0.0.dev+777440adcd592f9c5dadb15c86fdda8eade08e8b
2022-02-24 07:42:35.038265
Found:  [<depthai.DeviceInfo object at 0x7fd7bfa431f0>, <depthai.DeviceInfo object at 0x7fd7c0a95fb0>, <depthai.DeviceInfo object at 0x7fd7bfa6b430>]
Exception------------------------------------------
2022-02-24 07:43:13.137641
Failed to find device after booting, error message: X_LINK_DEVICE_NOT_FOUND
Found:  [<depthai.DeviceInfo object at 0x7fd7bfaa5830>, <depthai.DeviceInfo object at 0x7fd7c0a91530>, <depthai.DeviceInfo object at 0x7fd7bfaaad30>]
Exception------------------------------------------
2022-02-24 07:44:04.565841
Failed to find device after booting, error message: X_LINK_DEVICE_NOT_FOUND
Found:  [<depthai.DeviceInfo object at 0x7fd7c0ad26f0>, <depthai.DeviceInfo object at 0x7fd7bfa3c8f0>, <depthai.DeviceInfo object at 0x7fd7bfa68370>]
Exception------------------------------------------
2022-02-24 07:44:56.550247
Failed to find device after booting, error message: X_LINK_DEVICE_NOT_FOUND
Found:  [<depthai.DeviceInfo object at 0x7fd7c0a95fb0>, <depthai.DeviceInfo object at 0x7fd7c0a959b0>, <depthai.DeviceInfo object at 0x7fd7bfa68870>]
No image from Left
2022-02-24 07:45:32.321267
No image from right
2022-02-24 07:45:32.321292
No image from Left
2022-02-24 07:45:32.489247
No image from right
2022-02-24 07:45:32.489298
Exception------------------------------------------
2022-02-24 17:10:24.965580
Communication exception - possible device error/misconfiguration. Original message 'Couldn't read data from stream: 'right' (X_LINK_ERROR)'
Found:  [<depthai.DeviceInfo object at 0x7fd7bfa33570>, <depthai.DeviceInfo object at 0x7fd7bfaa5830>, <depthai.DeviceInfo object at 0x7fd7bfa43ab0>]
Exception------------------------------------------
2022-02-24 17:11:16.450010
Failed to find device after booting, error message: X_LINK_DEVICE_NOT_FOUND
Found:  [<depthai.DeviceInfo object at 0x7fd7c0ad26f0>, <depthai.DeviceInfo object at 0x7fd7bfae05b0>, <depthai.DeviceInfo object at 0x7fd7bfa33d70>]
No image from Left
2022-02-24 17:11:50.231063
No image from right
2022-02-24 17:11:50.231084
No image from right
2022-02-24 17:11:50.292075
No image from Left
2022-02-24 17:11:50.363400
Exception------------------------------------------
2022-02-24 23:34:23.171944
Communication exception - possible device error/misconfiguration. Original message 'Couldn't read data from stream: 'right' (X_LINK_ERROR)'
Found:  [<depthai.DeviceInfo object at 0x7fd7bfa434f0>, <depthai.DeviceInfo object at 0x7fd7c0a959b0>]
Found:  [<depthai.DeviceInfo object at 0x7fd7c0ad26f0>, <depthai.DeviceInfo object at 0x7fd7bfaaad30>]
Found:  [<depthai.DeviceInfo object at 0x7fd7bfa434f0>, <depthai.DeviceInfo object at 0x7fd7c0a959b0>]
Found:  [<depthai.DeviceInfo object at 0x7fd7c0ad26f0>, <depthai.DeviceInfo object at 0x7fd7bfaaad30>]

It will now forever try to find the OAK device.

chkarl avatar Feb 25 '22 08:02 chkarl

Sorry about the trouble.

Could you monitor the temperature on the device? I'm wondering if there is some over-heating condition given that this is the FFC version and it has just a standalone heatsink (unlike the other products that have an enclosure which provide effectively a bigger heatsink).

Also how much data is being transferred. This model has on onboard Ethernet Switch which does not have a heatsink. And I'm wondering if there is some thermal lockup on it depending on the installation conditions.

Are you able to share any photos of the setup and the powering device/etc.?

Thanks, Brandon

Luxonis-Brandon avatar Feb 25 '22 23:02 Luxonis-Brandon

System information reports about 52-54 degrees for all chip temperatures. Manually measuring the Ethernet switch shows maximum 68 degrees. Ran for about 6 hours before crashing.

Currently we transfer about 400Mbit/s. (800P on all 3 cameras. 10 FPS). Do we need (active) cooling also on the builtin switch?

Currently we are both testing the device as is without any casing (in the office) and also outside in an enclosure ( outside temp is about 0 degree celsius) Powered by a standard POE switch.

We now tried also with the 0.017 bootloader. Unfortunately this did not help that much. After some hours we get the same X_LINK_ERROR. A power cycle is still the only way to reconnect with the board after a crash.

chkarl avatar Mar 02 '22 07:03 chkarl

FYI, I have seen one device crash on long-term run of my OAK-D-Pro-POE. Cause unknown. I lost the data at the time of crash and only saw looping XLINK errors the next day. I've yet to be able to recreate it (now with better data collection).

As comparison, I just finished a 12 hour test of other sensors, OAK-D-Lite is considerably hotter -- exceeding "stable image temperature of the camera sensor" of 70c. https://docs.luxonis.com/projects/hardware/en/latest/pages/DM9095.html#operating-temperature.

OAK-D Temperatures - Average: 49.59 °C, CSS: 50.70 °C, MSS 49.59 °C, UPA: 49.59 °C, DSS: 48.47 °C OAK-D-Lite Temperatures - Average: 78.15 °C, CSS: 80.08 °C, MSS 77.20 °C, UPA: 80.65 °C, DSS: 74.67 °C

diablodale avatar Mar 02 '22 15:03 diablodale

So @diablodale - Thanks. So on the OAK-D-Lite temperatures = those temperature measurement are of the the die temperature internal to the Myriad X - so it is significantly hotter than elsewhere on the device itself. And even some ~10°C hotter than even the junction of the Myriad X to the enclosure. So this should not impact the image sensors. As they will be much much lower temperature, and closer to the enclosure temperature - which does not exceed 60C when operating at room temperature.

@chkarl - sorry about the trouble. I'm not sure what is going wrong here. We are trying to get to the bottom of it though. But since the crashes take so long to repeat, it's a slow go of it. And for our OAK-D-PoE it's taking more than 48 hours and we're not seeing a crash, so we start over and try other permutations.

So I think this crash is accelerated by something that we don't have in our lab. So actually if you can share all the equipment you are using in your setup we can purchase it and replicate the setup so hopefully we can see the crash in less time.

Do we need (active) cooling also on the builtin switch?

It may be worth trying this. Buying one of the little Raspberry Pi stick-on heatsinks and putting it on the switch. That said, I'm thinking this is something more to do with how the networking infrastructure is interacting with our firmware - bringing out the crash.

And so I think the key will be for Luxonis to be able to replicate your networking setup so that we can see the crash as well. We're using UniFi Switches (US-8-150W) internally and not seeing it.

Thoughts?

Thanks, Brandon

Luxonis-Brandon avatar Mar 02 '22 23:03 Luxonis-Brandon

@Luxonis-Brandon Note that we have not seen the crash on the standard OAK-D-PoE, only on the OAK-FFC-PoE-3P so far. The standard OAK-D-PoE runs stable for days without any issues.

We also tried putting a RPi stick-on-heatsink on the switch. This lowered the measured temperature from 69 to about 58 degrees C on the switch. However, despite the heatsink we still see crashes.

At the moment we have tested with a Teltonika industrial POE switch TSW100 and also a D-link DGS-1005 POE+-gigabitswitch.

chkarl avatar Mar 03 '22 08:03 chkarl

@chkarl Interesting observation - I'm also running the OAK-D PoE for ~25h+ now without any hiccups. If the OAK-FFC-PoE-3P does experience more of these issues, then I'll get my hands on that specific model and test it out.

themarpe avatar Mar 03 '22 15:03 themarpe

Thanks @chkarl . This is very good to know that OAK-FFC-PoE-3P is the only one with the problem. So this means something about that design is either (1) directly the problem (hardware design error, for example - it is a less-qualified/tested design and we've only made some 20pcs of OAK-FFC-PoE-3P vs. thousands of OAK-D-PoE) or (2) has something that's not the problem, but exacerbates an underlying firmware (or otherwise problem).

So we'll be digging in to both to figure this out.

Sorry again about the trouble on this.

Luxonis-Brandon avatar Mar 03 '22 21:03 Luxonis-Brandon

Hi @chkarl ,

I'm trying to reproduce the issue with OAK-FFC-PoE-3P, on the 2.15.0.0 release, running your test script https://gist.github.com/chkarl/26aff3fbbfaf4f4b28b26ee59224e3e3 I started 2 devices yesterday:

  1. Still running, for 24 hours now. No exceptions happened
  2. Was manually stopped after 21 hours (accidentally disconnected), still no exceptions happened up to that point. The test was restarted.

Was just seeing rarely the frame drop message No image from <camera>.

For the issue at startup you're observing (Failed to find device after booting), could you try increasing some timeouts and see if it works better, export these environment variables (values in milliseconds): DEPTHAI_WATCHDOG_INITIAL_DELAY=60000 DEPTHAI_BOOTUP_TIMEOUT=60000

And besides that, could you start a test where additionally the device watchdog is disabled, with: DEPTHAI_WATCHDOG=0 That would be for the case some network congestion happens, and the host can't ping the device for 4 seconds (a bit unlikely though). Note though, if the connection is not properly closed, a power-cycle is needed in this case: [warning] Watchdog disabled! In case of unclean exit, the device needs reset or power-cycle for next run The test would be to check if this issue still happens: Communication exception - possible device error/misconfiguration

alex-luxonis avatar Mar 04 '22 16:03 alex-luxonis

Hi @alex-luxonis

When I set these

DEPTHAI_WATCHDOG_INITIAL_DELAY=60000
DEPTHAI_BOOTUP_TIMEOUT=60000

The bootup is now much smoother. Thanks for that.

I now also started another test with DEPTHAI_WATCHDOG=0 set. Additionally I also started a test booting from USB with this board to see if this change the behavior.

Thanks for the support so far

chkarl avatar Mar 04 '22 16:03 chkarl

Thanks, both! Very curious to get to the bottom of this one!

Luxonis-Brandon avatar Mar 04 '22 22:03 Luxonis-Brandon

Hi,

Status update: Running with the watchdog disabled I did not see any Communication exception - possible device error/misconfiguration However, I still got the crash after 8 hours this time.

Booting the OAK-FFC-PoE-3P from USB was stable for the whole weekend (72 hours+). Not even a dropped frame.

I have now also started another test with a different PoE switch (Planet IPOE-162).

chkarl avatar Mar 07 '22 10:03 chkarl

Thanks for the data points here @chkarl . We're investigating.

Luxonis-Brandon avatar Mar 08 '22 02:03 Luxonis-Brandon

Hi,

Status update: The test setup using a different PoE switch/injector (Planet) seems to improve the situation. We have now run the boards for 72h+ without any crash. On one of the boards we lost left and right camera streams after about 48 h. RGB still going strong so far. Have you seen issues with using different PoE switch brands? We have tried with both Teltonika and DLink before testing the Planet switch. Which PoE standard is the OAK-FFC-PoE-3P negotiating, 802.3.af or at?

chkarl avatar Mar 10 '22 06:03 chkarl

Thanks @chkarl !

This points in the direction of my (personal, no-data-behind-it) hunch that actually there is some underlying bug in our firmware which is only brought out/revealed with certain Ethernet switches/equipment. Or more specifically, I think some switches/network gear just cause the bug to happen faster.

And I think that's why there's been such a huge variance between our internal testing (where sometimes we can never get a crash) to then other's testing (yourself included) and also @diablodale in https://github.com/luxonis/depthai-core/issues/415#issue-1160703415 .

So we're getting additional network equipment to try to find which equipment causes the crash the fastest. Likely we should get the equipment @diablodale as his crash seems the fastest and most reliable.

And also we just got confirmation on Discord of a similar thing:

Heads up to anyone that had stability issues. I upgraded my PoE switch and my devices have stopped crashing https://www.bhphotovideo.com/c/product/1519056-REG/ubiquiti_networks_usw_pro_24_poe_configurable_gigabit_layer2_and.html

before:

It was this cisco switch: https://www.cisco.com/c/en/us/support/switches/catalyst-cdb-8u-switch/model.html

Internally we all use Ubiquiti switches, and this is likely why we haven't seen the issue (Ubiquiti switches usually have internal chipsets which are higher-end than most switches on the market, and specifically have significantly better hardware offload and high-speed caches - which we know as many on our team are from Ubiquiti, and still huge fans).

Which PoE standard is the OAK-FFC-PoE-3P negotiating, 802.3.af or at?

I think it is representing itself as 802.3at since it's pass-through, so when daisy-chain is used it can take close to full 802.3at. But nonetheless if only using one of them, and 802.3af switch should be sufficient. The thing to look out for is the total PoE budget of the switch though. Some 802.3af switches have quite low PoE power budgets, so if running multiple devices on the switch the power can be exceeded.

Anyway, we're getting more switches to try to reproduce this faster - given that with the UniFi/Ubiquiti switches (that we have) the crash seems to practically not happen.

Thanks, Brandon

Luxonis-Brandon avatar Mar 10 '22 16:03 Luxonis-Brandon

From offline communication on this:

Status update: We have now run multiple OAK-FFC-PoE-3P for over a week without any crashes. The only thing we changed in the setup was the PoE injector/switch.

We have now confirmed good performance with these two units: IPOE-470 / IPOE-470-12V - PoE Injector/Splitter/Extender - PLANET Technology IPOE-162 - PoE Injector/Splitter/Extender - PLANET Technology

Luxonis-Brandon avatar Mar 18 '22 23:03 Luxonis-Brandon