ros2_controllers icon indicating copy to clipboard operation
ros2_controllers copied to clipboard

Fix segfault in JTC and simplify checks

Open JafarAbdi opened this issue 2 years ago • 4 comments

This PR fixes a segfault in JTC, we were getting a segfault every time we deactivate and then re-activate the controllers.

Attaching a debugger to the node shows read_state_from_hardware(state_desired_) causing the segfault, for some reason the variable in empty, digging more I don't see this variable being modified much, but I suspect it's related to the call to sample which set it to the default value output_state = trajectory_msgs::msg::JointTrajectoryPoint()

JafarAbdi avatar Aug 23 '22 21:08 JafarAbdi

Are you changing interfaces between reactivation cycles?

No, we don't, it's the same controller

JafarAbdi avatar Aug 24 '22 14:08 JafarAbdi

I am not convinced that this actually solves the cause of the issue. I would say it's more about hiding the real issue. I would wait to reproduce this in the tests and then know what is causing this...

destogl avatar Aug 26 '22 06:08 destogl

This pull request is in conflict. Could you fix it @JafarAbdi?

mergify[bot] avatar Sep 07 '22 08:09 mergify[bot]

Codecov Report

Merging #423 (154bbd1) into master (e7f9962) will decrease coverage by 3.31%. The diff coverage is 26.60%.

@@            Coverage Diff             @@
##           master     #423      +/-   ##
==========================================
- Coverage   35.78%   32.48%   -3.31%     
==========================================
  Files         189        7     -182     
  Lines       17570      665   -16905     
  Branches    11592      357   -11235     
==========================================
- Hits         6287      216    -6071     
+ Misses        994      157     -837     
+ Partials    10289      292    -9997     
Flag Coverage Δ
unittests 32.48% <26.60%> (-3.31%) :arrow_down:

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
...ontroller/test/test_load_diff_drive_controller.cpp 12.50% <0.00%> (ø)
diff_drive_controller/src/odometry.cpp 42.16% <11.11%> (ø)
diff_drive_controller/src/speed_limiter.cpp 46.55% <11.11%> (ø)
...ive_controller/test/test_diff_drive_controller.cpp 17.62% <12.08%> (ø)
...troller/include/diff_drive_controller/odometry.hpp 20.00% <20.00%> (ø)
...iff_drive_controller/src/diff_drive_controller.cpp 39.22% <35.50%> (ø)
...de/diff_drive_controller/diff_drive_controller.hpp 100.00% <100.00%> (ø)
...ontrollers/src/joint_group_velocity_controller.cpp
...test/test_load_joint_group_velocity_controller.cpp
...include/joint_trajectory_controller/tolerances.hpp
... and 191 more

codecov-commenter avatar Sep 07 '22 11:09 codecov-commenter

I'm sorry, I forgot to mention that the test I added has an issue. I had a call with Denis, and he asked me to push my WIP test, so he could take a look and see if he can help finish it!

JafarAbdi avatar Dec 16 '22 15:12 JafarAbdi

I tried to merge this with the old reactivation test, where actual trajectory execution is done. The complexity there is that we change the controller to use input time and period in the update method, so their parameters should be adjusted.

First the test need to replicate the described issue and after that show that fix is working. The position, velocity and acceleration filed don't need to have the same size, i.e., they may be uninitialized if not used. But they should definitely be allocated in the configure method. The question here is when after reactivation we loose the initialization in the memory of those filed. That's confusing for me – this is the reason I think we have much significant issue then this fix proposes.

destogl avatar Jan 24 '23 22:01 destogl

@swiz23 noticed a segfault in the JTC specifically when aborting a motion, followed by deactivating, followed by activating where the segfault would occur. After looking at this PR, I think it very much has to do with the line output_state = trajectory_msgs::msg::JointTrajectoryPoint(). I think the flow of events occur as follows:

  1. When aborting, set_hold_position() is called. That essentially adds and writes an empty trajectory message in order to disregard the current one.
  2. When sample() is called in the update() loop, state_desired_ gets overwritten with that empty trajectory point meaning it has an empty position, velocity, and acceleration array.
  3. The controllers get deactivated
  4. The controllers get reactivated. When this happens, read_state_frome_hardware is called on state_desired_. Within that function, we try to index the position array within state_deisred_ as if it has the same size as dof_ when it in fact no longer does. This causes the segault.

The solution is to simply move the line defining the output_state within the sample() function in trajectory.cpp to after the last possible place that the function can return false -> here. I tested this out at least for the case that @swiz23 had issues with and verified that there was no longer a segfault.

mechwiz avatar Feb 06 '23 23:02 mechwiz

Thanks @mechwiz for the fix. I'm closing this since someone else who currently can reproduce the issue have a fix

JafarAbdi avatar Feb 07 '23 07:02 JafarAbdi