rcl
rcl copied to clipboard
Goal status corruption when goal takes longer to finishes than ActionServer's result_timeout
Bug report
Required Info:
- Operating System:
- Ubuntu 20.04
- Installation type:
- binaries
- Version or commit hash:
- ros-foxy-rcl/focal,now 1.1.13-1focal.20220204.170046 amd64
- DDS implementation:
- CycloneDDS
- Client library (if applicable):
- rclpy
Steps to reproduce issue
- Create an
ActionServer
with a certainresult_timeout
(default: 900s) and anexecute_callback
that callsgoal_handle.succeed()
after a period greater thanresult_timeout
. - Create an
ActionClient
, send a goal and wait for the result, e.g.,action_client.send_goal_async(goal).add_done_callback(_on_goal_response)
with:
def _on_goal_response(future):
goal_handle = future.result()
if goal_handle.accepted:
goal_handle.get_result_async().add_done_callback(_on_goal_result)
def _on_goal_result(future):
status = future.result().status
print(f"Goal result status: {status}")
- Keep sending a new goal when the previous one finishes.
Expected behavior
All goal result status values are equal to action_msgs.msg.GoalStatus.STATUS_SUCCEEDED
.
Actual behavior
Some goal status values are equal to action_msgs.msg.GoalStatus.STATUS_SUCCEEDED
and some have an invalid value, e.g., -59
or -91
.
Additional information
In rcl_action_expire_goals()
(action_server.c
), inactive goals are checked against their creation timestamp and considered expired if they are created longer than result_timeout
ago. If this function is called directly after the goal succeeds, but before sending the result to the ActionClient
, the goal might be considered as expired, its handle deallocated and bogus result data sent back to the ActionClient
. Whether the function is called at this point or not seems to depend on the exact course of events and their timing.
I believe the intended behavior is to cache the goal result after the goal finishes (succeed()
, abort()
, canceled()
) to give the client some time to retrieve the result.
Here is some debug log data of one of our production nodes - an ActionServer
- to illustrate the course of events. I did not include all lines to avoid clutter, but I'd be happy to send the full log if so desired. Several times, STATUS_SUCCEEDED
is returned (4
) and sometimes invalid values (-59
, -91
):
[1647335999.015429264] [action_server_node]: Executing goal with ID [215 184 120 184 118 145 70 116 178 10 54 132 216 44 110 194]
[1647336013.871788332] [action_server_node]: Goal reached!
[1647336013.873899139] [rcl]: Updated timer period from '900000000000ns' to '885129670410ns'
[1647336013.878504394] [action_server_node]: Goal with ID [215 184 120 184 118 145 70 116 178 10 54 132 216 44 110 194] finished with state 4
[1647336432.312907051] [action_server_node]: Executing goal with ID [ 90 235 184 166 112 213 67 229 171 43 38 9 89 154 96 65]
[1647336913.873975199] [rcl]: Timer in wait set is ready
[1647337229.639817858] [action_server_node]: Goal reached!
[1647337229.641596384] [rcl]: Updated timer period from '885129670410ns' to '102661039871ns'
[1647337229.645905871] [action_server_node]: Goal with ID [ 90 235 184 166 112 213 67 229 171 43 38 9 89 154 96 65] finished with state 4
[1647337879.914539520] [action_server_node]: Executing goal with ID [ 4 220 198 104 215 48 76 63 155 0 191 39 45 65 21 48]
[1647338114.771408273] [rcl]: Timer in wait set is ready
[1647338781.127302593] [action_server_node]: Goal reached!
[1647338781.128931152] [rcl]: Updated timer period from '102661039871ns' to '0ns'
[1647338781.133443158] [action_server_node]: Goal with ID [ 4 220 198 104 215 48 76 63 155 0 191 39 45 65 21 48] finished with state 4
[1647338883.790117503] [rcl]: Timer in wait set is ready
[1647339426.485650098] [action_server_node]: Executing goal with ID [210 78 147 224 212 228 78 138 142 42 122 80 252 208 240 141]
[1647340502.939447051] [action_server_node]: Goal reached!
[1647340502.941234100] [rcl]: Updated timer period from '0ns' to '0ns'
[1647340502.942209879] [rcl]: Timer in wait set is ready
[1647340502.947943806] [action_server_node]: Goal with ID [210 78 147 224 212 228 78 138 142 42 122 80 252 208 240 141] finished with state -91
[1647341152.600432900] [action_server_node]: Executing goal with ID [ 70 12 15 101 50 107 69 114 157 101 137 177 149 8 239 36]
[1647342335.974547175] [action_server_node]: Goal reached!
[1647342335.976116271] [rcl]: Updated timer period from '0ns' to '0ns'
[1647342335.976885865] [rcl]: Timer in wait set is ready
[1647342335.983043025] [action_server_node]: Goal with ID [ 70 12 15 101 50 107 69 114 157 101 137 177 149 8 239 36] finished with state -59
[1647342922.961496135] [action_server_node]: Executing goal with ID [161 156 97 122 246 126 72 167 159 114 98 74 115 120 1 203]
[1647343358.867554586] [rcl]: Updated timer period from '0ns' to '464087590402ns'
[1647343358.868359240] [rcl]: Timer in wait set is ready
[1647343358.869276426] [rcl]: Updated timer period from '464087590402ns' to '464085861582ns'
[1647343358.873525096] [action_server_node]: Goal with ID [161 156 97 122 246 126 72 167 159 114 98 74 115 120 1 203] finished with state 5
[1647343705.921281623] [action_server_node]: Executing goal with ID [ 36 17 229 73 24 216 74 170 148 57 118 169 88 245 57 207]
[1647343822.957017623] [rcl]: Timer in wait set is ready
[1647344134.444931055] [action_server_node]: Goal reached!
[1647344134.446515204] [rcl]: Updated timer period from '464085861582ns' to '471466247058ns'
[1647344134.450559400] [action_server_node]: Goal with ID [ 36 17 229 73 24 216 74 170 148 57 118 169 88 245 57 207] finished with state 4
[1647344598.532509457] [rcl]: Timer in wait set is ready
[1647344598.533403778] [rcl]: Updated timer period from '471466247058ns' to '7379354687ns'
[1647345069.999721681] [rcl]: Timer in wait set is ready
[1647345902.897012440] [action_server_node]: Executing goal with ID [106 242 246 207 113 219 74 13 142 76 224 184 220 234 180 216]
[1647346828.269364764] [action_server_node]: Goal reached!
[1647346828.271057242] [rcl]: Updated timer period from '7379354687ns' to '0ns'
[1647346828.275420165] [action_server_node]: Goal with ID [106 242 246 207 113 219 74 13 142 76 224 184 220 234 180 216] finished with state 4
[1647346835.650637372] [rcl]: Timer in wait set is ready
[1647347243.613745355] [action_server_node]: Executing goal with ID [ 79 212 168 155 71 58 77 135 155 13 216 146 157 219 37 100]
[1647347451.555608172] [action_server_node]: Goal reached!
[1647347451.557238031] [rcl]: Updated timer period from '0ns' to '692046631481ns'
[1647347451.558081235] [rcl]: Timer in wait set is ready
[1647347451.559103617] [rcl]: Updated timer period from '692046631481ns' to '692044761724ns'
[1647347451.563838790] [action_server_node]: Goal with ID [ 79 212 168 155 71 58 77 135 155 13 216 146 157 219 37 100] finished with state 4
[1647348143.605808335] [rcl]: Timer in wait set is ready
[1647348392.123199612] [action_server_node]: Executing goal with ID [166 131 71 148 187 142 66 184 160 139 82 226 88 39 73 94]
[1647349471.575331372] [action_server_node]: Goal reached!
[1647349471.577114966] [rcl]: Updated timer period from '692044761724ns' to '0ns'
[1647349471.581504490] [action_server_node]: Goal with ID [166 131 71 148 187 142 66 184 160 139 82 226 88 39 73 94] finished with state 4
[1647350113.120376174] [action_server_node]: Executing goal with ID [239 228 125 68 244 214 78 184 128 218 235 19 192 41 125 88]
[1647350163.621988713] [rcl]: Timer in wait set is ready
[1647351415.236826018] [action_server_node]: Goal reached!
[1647351415.238423224] [rcl]: Updated timer period from '0ns' to '0ns'
[1647351415.239265006] [rcl]: Timer in wait set is ready
[1647351415.245248589] [action_server_node]: Goal with ID [239 228 125 68 244 214 78 184 128 218 235 19 192 41 125 88] finished with state -91
Whenever I observe Updated timer period from '0ns' to '0ns'
after the goal is reached, an invalid state is returned.
Yes, I think the correct thing to do is to expire goal results based on the time the result is ready (not the goal creation time).
I think one way we could implement this would be to add a member to rcl_action_goal_handle_impl_s to record the result timestamp. We could then note the time when we are notified that a goal has reached a terminal state, e.g. change the signature of this function to include a goal UUID: https://github.com/ros2/rcl/blob/392f0d3ae6a29d4b88689b954b14f19716e44233/rcl_action/include/rcl_action/action_server.h#L648-L650
Changing the signature of rcl_action_notify_goal_done
is not a backwards compatible fix, though I'm not sure we want to port a fix to an existing release anyways since the change in behavior could be disruptive enough.
At least the workaround is simple enough: pass a custom result_timeout
when creating the ActionServer
. Choose it larger than the expected largest time duration to reach a terminal goal state and to retrieve that result, and shorter than the time duration expected to cause memory issues when keeping too many goal results in memory (unlikely).
@jacobperron Without changing the timing behavior or function signatures, do you think it's possible to create and backport a fix for the memory issue that causes the invalid returned goal state? I was hoping that the goal handle could be properly cleaned up before sending the result value. This would result in a STATUS_UNKNOWN
for an 'unknown'/expired goal I guess.
I'll admit I don't know whether this can be solved in rcl
, or that rclpy
is (also) causing it.
I'm asking because it took me quite some time to debug this issue. It's not trivial for unaware users that observing invalid goal states has something to do with the result_timeout
, whether it's server or client related, etc.
It was also hard to reproduce at first, given the default timeout of 900s.
Aside from the timing bug, there is still the race you pointed out. E.g. if we set the expiry time to zero (expire goals immediately), then I guess we'll run into the same invalid memory issue. IMO, this is a bug that should be fixed and backported.
I suspect it is something that needs to be fixed at the rclcpp
/ rclpy
layer, but I don't know for sure without taking some more time to dig into the code.