[Bug] dataman_client unable to write to sd storage
Describe the bug
Starting in PX4 1.15.0, assumed to be with the introduction of https://github.com/PX4/PX4-Autopilot/commit/208552fdab773e19a39f49b8910031f617e11a94, dataman_client is unable to write to SD storage on my tested hardware (EchoPilot AI https://echomav.com/product/echopilot-ai/). Checking dmesg logs and dataman status, datman status is good but dataman_client times out after 5000ms. This issue does not occur on PX4 1.14.4 (before the dataman_client refactor). I have tried restarting dataman after the system has started via a mavlink console but this does not fix the issue. The issue results in unable to write to SD storage errors throughout mission planner and qgroundcontrol. The sd card can be read and written to successfully when tested via mavlink console (eg: echo thisisatest > /fs/microsd/filename and cat /fs/microsd/filename).
CC: @Igor-Misic
To Reproduce
No response
Expected behavior
No response
Screenshot / Media
No response
Flight Log
No response
Software Version
1.15.0-1.15.4 (latest stable tag)
Flight controller
No response
Vehicle type
None
How are the different components wired up (including port information)
No response
Additional context
No response
I am facing similar issue on Pixhawk 6X with PX4 1.16.0 c0. I added custom module implemented as work queue item with DatamanClient declared in the module method (the method that is supposed to read all mission items synchronously from the dataman) and it made it impossible reading mission with it as it results in px4_poll timeouts, seen in dmesg. In SITL simulation however the issue dissapears once DatamanClient is declared as class field. I suppose the DatamanClient constructor has to do with it.
@gmesm @rosiakpiotr, can you please post logs? Make sure you are collecting logs from boot rather than the default mode "starting by arming"; we need to get to the bottom of this and make sure it's not a regression.
@Igor-Misic, are you aware of any issues or have any tips you can share to help us debug?
@mrpollo Not really, @bkueng was the author of this architecture and he pushed the whole PR across the finish line.
Thanks!
@bkueng any tips for debugging here?
@gmesm @rosiakpiotr, can you please post logs? Make sure you are collecting logs from boot rather than the default mode "starting by arming"; we need to get to the bottom of this and make sure it's not a regression.
Unfortunately I don't have access to the hardware anymore that displayed this issue. Oddly enough I couldn't get the issue to occur on a board from a different batch (no hardware changes) so not really sure what is going on. One of our users has the problem on their board but I don't have a github for them. I'll let them know via another method and maybe they can assist.
Hi everyone, I was the user gmesm was referring to. I'll have to look back to find the logs, but I'm fairly confident I still have them. I should be able to get time in a couple weeks to go digging for this
That generally sounds like it's hardware-related. Checking the perf counters would be interesting (perf on the shell).
I suppose the DatamanClient constructor has to do with it.
@rosiakpiotr Yes DatamanClient needs to be instantiated on the same thread where it is used. Specifically for work queues it is somewhat easy to make that mistake as the module instantiation happens on a temporary thread.
I can stably reproduce this phenomenon on the Pixhawk V5 board. First, I connect the board to QGC on my PC via USB. Then, I unplug the USB cable and switch to using a WiFi module to connect the board through any available serial port (with the PC's WiFi turned on to automatically connect to the WiFi module). This allows me to reproduce the MAVLink timeout shown in the following figure. Through simple debug output, I've determined that the timeout specifically occurs when QGC requests mission, fence, and rally point information upon connecting to the board. However, at this point, the dataman client is completely unresponsive and triggers a timeout after 5 seconds. This ultimately causes QGC to freeze for approximately 1 minute, and I'm unable to read or edit the board's flight plan in QGC until the board is restarted.
This phenomenon can be stably reproduced in both v1.15 and v1.16 versions.
The code _dataman_request_pub.publish(request); in DatamanClient::syncHandler doesn't seem to be taking effect.
Adding uint8 ORB_QUEUE_LENGTH = 8 to the end of DatamanRequest.msg may solve this problem, but I'm unsure of any potential consequences