PX4-Autopilot icon indicating copy to clipboard operation
PX4-Autopilot copied to clipboard

[Bug] dataman_client unable to write to sd storage

Open gmesm opened this issue 9 months ago • 1 comments

Describe the bug

Starting in PX4 1.15.0, assumed to be with the introduction of https://github.com/PX4/PX4-Autopilot/commit/208552fdab773e19a39f49b8910031f617e11a94, dataman_client is unable to write to SD storage on my tested hardware (EchoPilot AI https://echomav.com/product/echopilot-ai/). Checking dmesg logs and dataman status, datman status is good but dataman_client times out after 5000ms. This issue does not occur on PX4 1.14.4 (before the dataman_client refactor). I have tried restarting dataman after the system has started via a mavlink console but this does not fix the issue. The issue results in unable to write to SD storage errors throughout mission planner and qgroundcontrol. The sd card can be read and written to successfully when tested via mavlink console (eg: echo thisisatest > /fs/microsd/filename and cat /fs/microsd/filename).

CC: @Igor-Misic

To Reproduce

No response

Expected behavior

No response

Screenshot / Media

No response

Flight Log

No response

Software Version

1.15.0-1.15.4 (latest stable tag)

Flight controller

No response

Vehicle type

None

How are the different components wired up (including port information)

No response

Additional context

No response

gmesm avatar Mar 25 '25 18:03 gmesm

I am facing similar issue on Pixhawk 6X with PX4 1.16.0 c0. I added custom module implemented as work queue item with DatamanClient declared in the module method (the method that is supposed to read all mission items synchronously from the dataman) and it made it impossible reading mission with it as it results in px4_poll timeouts, seen in dmesg. In SITL simulation however the issue dissapears once DatamanClient is declared as class field. I suppose the DatamanClient constructor has to do with it.

rosiakpiotr avatar Jun 12 '25 08:06 rosiakpiotr

@gmesm @rosiakpiotr, can you please post logs? Make sure you are collecting logs from boot rather than the default mode "starting by arming"; we need to get to the bottom of this and make sure it's not a regression.

@Igor-Misic, are you aware of any issues or have any tips you can share to help us debug?

mrpollo avatar Jun 23 '25 15:06 mrpollo

@mrpollo Not really, @bkueng was the author of this architecture and he pushed the whole PR across the finish line.

Igor-Misic avatar Jun 23 '25 17:06 Igor-Misic

Thanks!

@bkueng any tips for debugging here?

mrpollo avatar Jun 23 '25 17:06 mrpollo

@gmesm @rosiakpiotr, can you please post logs? Make sure you are collecting logs from boot rather than the default mode "starting by arming"; we need to get to the bottom of this and make sure it's not a regression.

Unfortunately I don't have access to the hardware anymore that displayed this issue. Oddly enough I couldn't get the issue to occur on a board from a different batch (no hardware changes) so not really sure what is going on. One of our users has the problem on their board but I don't have a github for them. I'll let them know via another method and maybe they can assist.

gmesm avatar Jun 25 '25 14:06 gmesm

Hi everyone, I was the user gmesm was referring to. I'll have to look back to find the logs, but I'm fairly confident I still have them. I should be able to get time in a couple weeks to go digging for this

ctitus1 avatar Jul 03 '25 16:07 ctitus1

That generally sounds like it's hardware-related. Checking the perf counters would be interesting (perf on the shell).

I suppose the DatamanClient constructor has to do with it.

@rosiakpiotr Yes DatamanClient needs to be instantiated on the same thread where it is used. Specifically for work queues it is somewhat easy to make that mistake as the module instantiation happens on a temporary thread.

bkueng avatar Jul 09 '25 08:07 bkueng

I can stably reproduce this phenomenon on the Pixhawk V5 board. First, I connect the board to QGC on my PC via USB. Then, I unplug the USB cable and switch to using a WiFi module to connect the board through any available serial port (with the PC's WiFi turned on to automatically connect to the WiFi module). This allows me to reproduce the MAVLink timeout shown in the following figure. Through simple debug output, I've determined that the timeout specifically occurs when QGC requests mission, fence, and rally point information upon connecting to the board. However, at this point, the dataman client is completely unresponsive and triggers a timeout after 5 seconds. This ultimately causes QGC to freeze for approximately 1 minute, and I'm unable to read or edit the board's flight plan in QGC until the board is restarted.

This phenomenon can be stably reproduced in both v1.15 and v1.16 versions.

The code _dataman_request_pub.publish(request); in DatamanClient::syncHandler doesn't seem to be taking effect.

Adding uint8 ORB_QUEUE_LENGTH = 8 to the end of DatamanRequest.msg may solve this problem, but I'm unsure of any potential consequences

Image

Coekjin avatar Nov 15 '25 12:11 Coekjin