Azure-Kinect-Sensor-SDK icon indicating copy to clipboard operation
Azure-Kinect-Sensor-SDK copied to clipboard

Corrupted JPEG Stream

Open jmachowinski opened this issue 3 years ago • 41 comments

Describe the bug We are experiencing a rather strange bug with the color stream of the kinect. On an AMD EPYC 7302 machine, the color stream is corrupted. This is reproducible using the k4aviewer and the ros wrapper.

I patched in a better debug message in case the MJPEG decoding fails and it gave me this: DecodeMJPEGtoBGRA32(). MJPEG decode failed, dropping image: Not a JPEG file: starts with 0xfe 0x77 DecodeMJPEGtoBGRA32(). MJPEG decode failed, dropping image: Not a JPEG file: starts with 0xf8 0x54 DecodeMJPEGtoBGRA32(). MJPEG decode failed, dropping image: Corrupt JPEG data: premature end of data segment DecodeMJPEGtoBGRA32(). MJPEG decode failed, dropping image: Corrupt JPEG data: 1283 extraneous bytes before marker 0xd3 DecodeMJPEGtoBGRA32(). MJPEG decode failed, dropping image: Corrupt JPEG data: 161 extraneous bytes before marker 0xd4 DecodeMJPEGtoBGRA32(). MJPEG decode failed, dropping image: Not a JPEG file: starts with 0xfb 0xbf DecodeMJPEGtoBGRA32(). MJPEG decode failed, dropping image: Corrupt JPEG data: 9005 extraneous bytes before marker 0xd6 DecodeMJPEGtoBGRA32(). MJPEG decode failed, dropping image: Corrupt JPEG data: found marker 0xd5 instead of RST3 DecodeMJPEGtoBGRA32(). MJPEG decode failed, dropping image: Corrupt JPEG data: 841 extraneous bytes before marker 0xd6

This then leads to the error : replace_sample(). capturesync_drop, releasing capture early due to full queue TS: 57300211 type:Depth as there is no color image available for the depth measurement.

Note usbfs_memory_mb is set to 1024

To Reproduce On an AMD EPYC 7302 :

  1. Start k4aviewer
  2. Enable color camera
  3. Disable all other streams
  4. Choose any resolution / fps (720p / 5 FPS also fails)
  5. Start
  6. Log shows a bunch of MJPEG decode failed, dropping image messages
  7. After a few seconds a popup 'Camera failure : timed out!' appears

Desktop

  • OS with Version: Ubuntu 18.04
  • Kernel : 5.4.0-47-generic
  • SDK Version: 1.4.1
  • Firmware version: 1.6.110079014
  • GPU : NVidia 2070 Super
  • CPU AMD EPIC 7302
  • Mainboard : Supermicro Mainboard H11DSi-NT

Additional context Note, on our normal desktop machines everything is working fine.

We also tried using an external PCI-E USB Host Controller (DeLock 90492) but this did not work either. The controller also gave us a bunch of wired kernel warnings, so it might be a driver issue with this card.

Is there any PCI-E USB Host Controller with at least two ports, that is known to work under linux ?

Grabbing only the depth stream seems to work. Out of curiosity, is there some sort of CRC on the depth data, to detect corruption ?

jmachowinski avatar Sep 15 '20 09:09 jmachowinski

We debugged this issue further, it seems, the Azure is extremely sensitive to the used USB host controller. So far we tested : Asmedia 3142

  • Not working

Renesas/NEC - µPD720202

  • Not working

Renesas/NEC - µPD720201

  • Working
  • Only one Azure per pcie card
  • Minor frame drops
  • Ros driver failure after ~20 minutes

Fresco Logic FL1100

  • Working
  • Two Azure per pcie card
    • Ros driver failure after ~10 minutes

jmachowinski avatar Sep 17 '20 12:09 jmachowinski

Yes it definetly is. Also the choice of USB cabel matters. I hab cables labeled as USB 3.0 that did fail after some minutes. The cable that comes with the kinect itself works the best but it is a bit short.

Have already seen this page? https://docs.microsoft.com/en-us/azure/kinect-dk/troubleshooting

"For the Azure Kinect DK on Windows, Intel, Texas Instruments (TI), and Renesas are the only host controllers that are supported. The Azure Kinect SDK on Windows platforms relies on a unified container ID, and it must span USB 2.0 and 3.0 devices so that the SDK can find the depth, color, and audio devices that are physically located on the same device. On Linux, more host controllers may be supported as that platform relies less on the container ID and more on device serial numbers."

RoseFlunder avatar Sep 17 '20 13:09 RoseFlunder

Yes, only select USB controllers are supported per text quoted by @RoseFlunder.

Also make sure you are using Sensor SDK 1.4.1 as we made some changes to improve resilience to dropped color frames. See issue #1194.

qm13 avatar Sep 17 '20 20:09 qm13

I was aware of the troubleshooting page. But lets just say the information regarding linux is sparse...

We are using the latest firmware and SDK 1.4.1

As for our problem, we are currently testing the Fresco Logic FL1100 with only one camera per host controller. This ran stable for 3 hours with one camera, yesterday. We are trying multiple cards with a camera each now.

One thing I noticed while testing, is that only the color stream is unstable. The depth and imu stream seems to be stable all the time.

Also I suspect a bug in the ros node, as it does not detect a broken color stream.

jmachowinski avatar Sep 18 '20 09:09 jmachowinski

I haven't measured the time at our PCs running the kinects but I think we kept them running for 24h without an error on Ubuntu, with default cable and connected to USB Ports on the Mainboard with Intel Chipset. Not the ROS Node though but we don't do anything different regarding getting captures compared to the ROS node. Just calling "getCapture" using infinite timeout. But we also use only one Kinect per PC which send their data to a central PC for processing.

RoseFlunder avatar Sep 18 '20 09:09 RoseFlunder

@jmachowinski I have passed you ROS node issue to the Microsoft ROS team.

qm13 avatar Sep 18 '20 19:09 qm13

@jmachowinski When you say the ROS node fails after x amount of time, can you post the failure? Does it crash or start throwing errors?

ooeygui avatar Sep 18 '20 19:09 ooeygui

@ooeygui The node does not crash, but it stops publishing any data. The log is full of

[2020-09-18 08:52:14.087] [error] [t=8620] /__w/1/s/extern/Azure-Kinect-Sensor-SDK/src/capturesync/capturesync.c (142): replace_sample(). capturesync_drop, releasing capture early due to full queue TS:3874968522 type:Depth [2020-09-18 08:52:14.119] [error] [t=8620] /__w/1/s/extern/Azure-Kinect-Sensor-SDK/src/capturesync/capturesync.c (142): replace_sample(). capturesync_drop, releasing capture early due to full queue TS:3875001866 type:Depth

As the color stream broke down completely.

I already looked into the code, and have the guess (only a guess, not verified yet) that if k4a::device::get_capture() is called with timeout infinite the function never returns in case the color stream broke down.

I just looked at the code again, and now I am pretty sure this is the case. With an infinite timeout, we wait for a push to the capture queue. As the color stream broke down, this is never happening and the ros node just stalls. I think a good workaround for this, would be to give a timeout of double the fps time, and shutdown the node if we did not receive X consecutive captures.

jmachowinski avatar Sep 21 '20 08:09 jmachowinski

As for the long time tests : 2 of 3 cameras stopped sending color images after some hours. I'll patch in some additional debug messages, to determine the exact cause of this.

I also noticed, that this is CPU load dependent. Without subscribing to the topics of the node, the color stream did not drop out (I verified in the code, that the capture from the camera actually happens in this case). This behavior puzzles me, as I run the tests on a 64 core machine, and every thread has basically its own cpu.

jmachowinski avatar Sep 21 '20 08:09 jmachowinski

@jmachowinski Thank you for the investigation! I need to profile the code to see where the slowdown is occurring. We do know that there are some problems with the OpenCV version being used with Melodic, and there are some inefficiencies in the Kinect ROS node. Additionally, since it is using ROS pub/sub without a quality of service monitor, the publishing rate itself may be too fast for the network stack - We may need to throttle upstream of it.

I'll post findings when I am able to carve off time to work on this ROS node.

ooeygui avatar Sep 21 '20 17:09 ooeygui

To answer my own question :

Out of curiosity, is there some sort of CRC on the depth data, to detect corruption ?

The depth data is transmitted in usb bulk mode, so it should be always complete.

I debugged this further and I am pretty sure, we are dealing with a firmware issue here. My findings so far : The color stream is an usb isochronous transfer. Depth and IMU are transmitted as usb bulk transfer.

The isochronous transfer is started in the libuvc part and keeps on going. Using wireshark, one can see, that the usb transfers are send out to the device. At start the color device correctly fills in the payload of the transfer, and everything is working. At some point, the color device just stops doing this. Also note, there is no error transmitted to the usb host. If one adds debug to the function : https://github.com/microsoft/libuvc/blob/5fc483d596c63f1bcd36be35d512468c0b75c5f3/src/stream.c#L601 this behavior is visible.

Any ideas ?

jmachowinski avatar Sep 22 '20 17:09 jmachowinski

As it might be perhaps useful to someone : We now found a 4x FL1100 PCIe card from Basler (2000036233). This one works flawless under linux if only the depth and imu stream is acquired.

We performed a one day test with 5 cameras, without encountering any problems.

jmachowinski avatar Sep 23 '20 14:09 jmachowinski

@qm13 can your reproduce my findings ? If not, can I do anything to help to speed this up ? Provide test cases etc...

jmachowinski avatar Sep 28 '20 14:09 jmachowinski

Do you have a workaround to this problem using the above USB Host controllers? A workaround meaning K4AViewer runs without stopping the stream? DecodeMJPEGtoBGRA32 errors are to be expected when USB is congested.

Renesas/NEC - µPD720201 Fresco Logic FL1100

I am curious if this issue repro's with firmware 1.6.102075014? Can you try this older firmware? We did make a minor change in UVC that may not have shown up in our testing.

I am also curious how your setup runs with other PC's. As you have determined, we don't have a lot of data on USB Host controllers on Ubuntu. It would be great if you could test with a mother board that has TI/Intel on it.

If that doesn't work then, you will need to dig into LibUvc more unfortunately, as we can't repro this.

The isochronous transfer is started in the libuvc part and keeps on going. Using wireshark, one can see, that the usb transfers are send out to the device.

If you can see that the ISOCH transfer on WireShark, then I wonder if firmware is fine and if the bug is in LibUvc. At the completion of each packet you should be able to see that LibUvc calling libusb_submit_transfer, that has to keep happening for new packets to be received from the driver.

wes-b avatar Oct 02 '20 15:10 wes-b

Do you have a workaround to this problem using the above USB Host controllers? A workaround meaning K4AViewer runs without stopping the stream? DecodeMJPEGtoBGRA32 errors are to be expected when USB is congested.

No, using the FL1100 controller made this more stable, but after around 40 minutes the color stream drops out. We use a setup, were each camera has its own USB Host controller, so the load per bus is neglectable (~1.8 mb/sec for depth and around 3.8 mb/sec for color). Therefor the dropouts in the color stream are already strange.

I am curious if this issue repro's with firmware 1.6.102075014? Can you try this older firmware? We did make a minor change in UVC that may not have shown up in our testing.

We'll perform tests on monday, and will report back.

I am also curious how your setup runs with other PC's. As you have determined, we don't have a lot of data on USB Host controllers on Ubuntu. It would be great if you could test with a mother board that has TI/Intel on it.

We performed tests with a 10th gen intel usb controller (Q470), and it ran stable for more than a day. We also have a production system, using 7th gen intel usb controller (Q170) running stable with 2 cameras for weeks now.

As for the TI based controllers, are you referring to the TUSB7340 host controller ? This one seems to be out of production, and we can't buy TI based PCIe cards anywhere.

For our current project we are bound to the AMD platform and sadly there are no Intel based PCIe extension cards, that we are aware of.

If you can see that the ISOCH transfer on WireShark, then I wonder if firmware is fine and if the bug is in LibUvc. At the completion of each packet you should be able to see that LibUvc calling libusb_submit_transfer, that has to keep happening for new packets to be received from the driver.

I added debug code, in case any error shows up, especially in the error path, were the transfer would NOT be resubmitted. I also added debug code that writes a message to cout for every 10000 resubmitted transfers.

The behavior I get from this is, that there are no errors, and I get a continuous stream of messages, that the transfers get resubmitted. At some point in time, I only get transfers back, were the pkt->actual_length is zero. This is consistent with the data shown in WireShark. This lead me to the suspicion, it might be the firmware.

I must also say, I have only limited knowledge of USB internals. If I understood it correct, the bus is completely host controlled. Any endpoint may only submit data, after is was granted a 'timeslot' by the host. This is done by sending an IN packet to the endpoint. My wild guess here would be, that the frequency of the polling is unstable, and that this somehow upsets the device.

jmachowinski avatar Oct 02 '20 17:10 jmachowinski

Your results are inline with what we have seen on Windows, where we have seen more of a variety of controllers. The unfortunate fact (perhaps fortunate depending on how you look at it) is that typically using an unsupported host controller fails miserable right away. In this case the controller seems to run really well until the error hits.

wes-b avatar Oct 02 '20 20:10 wes-b

I am curious if this issue repro's with firmware 1.6.102075014? Can you try this older firmware? We did make a minor change in UVC that may not have shown up in our testing.

We performed the tests with the older firmware, the behavior is the same as with the newer firmware.

Is there a particular reason, why the color image is transmitted in isochronous mode ? The bulk transfer seems to be stable, and would likely solve this issue.

jmachowinski avatar Oct 06 '20 15:10 jmachowinski

I started a second test yesterday, just letting a webcam viewer run. Turns out, using the in kernel uvc driver (e.g. opening /dev/video0) seems stable. The viewer has now been running for ~12 hours and the video stream is still working.

jmachowinski avatar Oct 07 '20 07:10 jmachowinski

Any update on this issue ?

jmachowinski avatar Oct 28 '20 11:10 jmachowinski

We are experimenting a similar behaviour in the following configuration:

  • Windows 10
  • PCIe expansion card with 2xUSB type C port with ASM3142 host controller
  • Connection via a single USB Type C cable

In this configuration:

  • Power and USB 3.0 data connection are fine (solid white led on the back)
  • The devices are enumerated correctly in device manager under USB devices
  • Azure Kinect is correctly detected by the Viewer
  • When i start the streams no error is displayed in the log windows and the streams are correctly received

The problem is that the RGB stream is unstable and produces drops in FPS and the image sometimes is corrupted (some part of the RGB image is displayed shifted with respect to the correct position).

Does anyone know if the unstable behaviour of the RGB stream is caused exclusively by USB host controller that is not ufficially supported by AKDK? as stated here https://docs.microsoft.com/it-it/azure/kinect-dk/troubleshooting under host controller issues

@jmachowinski which type of host controller have you used in your tests that has detected the unstability of RGB stream?

SimoSR avatar Nov 22 '20 09:11 SimoSR

We tested almost every USB controller that you can buy as an extension card at the moment. The color stream fails with all of them. But as I said above, I don't believe this is an issue of the USB Controllers, but rather an issue with the Azure.

jmachowinski avatar Nov 25 '20 16:11 jmachowinski

Happy new year ! Any news on this issue ?

jmachowinski avatar Jan 11 '21 07:01 jmachowinski

@jmachowinski we have been investigating this issue using an ASMedia 3142 card and can see a bunch of device level issues in the USB traces. However whilst these issues result in dropped frames they should not result in k4viewer stopping. We would like to get a little more coverage on this and ask you to perform the following test.

Repeat you test using the built in webcam application instead of k4aviewer. Does the webcam application fail after a short time?

qm13 avatar Feb 10 '21 22:02 qm13

@qm13 Are you refering to webcam by Gerd Knorr ? If yes, shall I use a special configuration or just the default one ?

For my last test I used guvcview

jmachowinski avatar Feb 12 '21 12:02 jmachowinski

I just repeated the test using guvcview, the camera fails after 5-10 minutes.

Kernel : 5.4.0-65-generic RGB camera firmware: 1.6.110 Resolution: 4K

jmachowinski avatar Feb 12 '21 12:02 jmachowinski

I also repeated the test a second time using guvcview with reduced resolution to 720P. This one is running now for 3 hours without problems.

jmachowinski avatar Feb 12 '21 16:02 jmachowinski

The 720P test ran over the weekend (~60 Hours in total) and is still fine.

jmachowinski avatar Feb 15 '21 08:02 jmachowinski

@jmachowinski this seems to point to low level issues with some USB controllers under high bandwidth utilization. Some apps do better at handling the issues than others. How does the k4aviewer run with 720P?

qm13 avatar Feb 16 '21 03:02 qm13

@qm13 I still don't think the USB controller is the issue here. The bandwidth usage is actually very low, usbtop shows a maximum of 8 MB/sec if the 4K stream is active, plus ~1.7MB/sec for depth and ~1.7 MB/sec for imu and microphone. I can copy data at 500 MB/sec the whole day long from an ssd without issues.

Also note, we are using a Basler USB Card (https://www.baslerweb.com/de/produkte/vision-komponenten/pc-karten/usb-3-0-interface-card-pcie-fresco-fl1100-4hc-x4-4ports/). This one works flawless using linux to capture 12 MP Basler Cameras.

Also, if the USB Controller would have an issue, I would expect all data streams on the bus to stall / break. This is not what we are seeing. Only the color stream breaks down, IMU/Microphone/Depth is fine.

I'll try to perform some tests with the k4aviewer next week.

jmachowinski avatar Feb 17 '21 10:02 jmachowinski

@jmachowinski the dev investigating this issue has not been able to repro it. He does not have access to a Linux box and investigated the issue on his Windows box. I am working on getting him a Linux box. In the mean time would you be able to attempt to repro the issue on Windows.

qm13 avatar Feb 24 '21 01:02 qm13