rcl Change the starting time of the goal expiration timeout

Address one issue mentioned #1103

The expire timeout is being calculated from the time the goal was accepted. According to the design https://design.ros2.org/articles/actions.html#result-caching, it should be calculated from the time the goal is completed.

Nov 29 '23 03:11 Barry-Xu-2018

This patch doesn't introduce new interfaces. As a result, the current modification is not very clear for setting goal terminal timestamp.
To achieve a clearer implementation, I am considering adding a new interface to set the goal terminal timestamp for a specific goal handle. However, this would require modifications in both rclcpp and rclpy.
Would you please share your opinions?

Nov 29 '23 04:11 Barry-Xu-2018

@SteveMacenski

How is goal_handle->impl->goal_terminal_timestamp set initially?

If my understanding of your code is correct (I don't live in rcl) when calling _recalculate_expire_timer, which I assume is periodically called during an action server's normal operations, we reset to now the clock ticking down when its expired.

I don't understand the goal_terminal_timestamp == INT64_MAX condition then --> shouldn't we always be resetting the clock if the action is still active? That would make the most sense to me; while the goal is still actively being processed, we stave off the timeout to remove it by another full timeout length. Then when the last update / sending of the result comes so we don't call it anymore, the full countdown is considered before removal.

First, let me explain when _recalculate_expire_timer() is called.
Currently, there are two scenarios in which this function is called.

While goal state is changed to terminal state (success or cancel or abort), rcl_action_notify_goal_done() will be called. And _recalculate_expire_timer() will be called in rcl_action_notify_goal_done() for setting a timer for the next expiring goal.
While timer for goal expire is triggered, rcl_action_expire_goals() will be called. And _recalculate_expire_timer() will be called for setting a timer for the next expiring goal.

I use scenario 1 to set goal_terminal_timestamp.
In order to avoid scenario 2, the condition to set goal_terminal_timestamp include

The state of goal should be one of success or cancel and abort.

So the code to set goal_terminal_timestamp should be in this condition. This can avoid wrong setting goal_terminal_timestamp in scenario 2.

https://github.com/ros2/rcl/blob/a17bc95988910ef372c84b311fb30cac6accd91e/rcl_action/src/rcl_action/action_server.c#L447

goal_terminal_timestamp isn't set before.

I want to use INT64_MAX as invalid value (unset state). So, only if goal_terminal_timestamp is INT64_MAX, goal_terminal_timestamp can be updated to current time. But invalid value is unsuitable (From Fujita-san' review comment). I will change it to 0.

Dec 07 '23 04:12 Barry-Xu-2018

CI:

Linux
Linux-aarch64
Windows

Dec 07 '23 16:12 fujitatomoya

setting a timer for the next expiring goal.

In this case, why is there any timer still set for expiring the goals? That seems like the root cause of the problem; our goals are being expired before they're done due to a terminal condition if they're long-running.

In this design, that seems to me should be completely removed and this check in terminal states (success or cancel or abort) is all that should be required. This is where the timer to count down the expiration of the goal handle should be initiated.

Alternatively, the timer for expiring the next goal should be adjusted during action server operations from checking to resetting the clock to the full timeout. That way the timeout would only trigger when its become stale. I think the first way is better though. That removes alot of unnecessary complication that would be removed with starting the countdown where you've made changes alone. God knows simplifying actions would reduce the likelihood of future issues

Dec 07 '23 19:12 SteveMacenski

@Barry-Xu-2018 can you rebase your development branch? it is failing build https://ci.ros2.org/job/ci_linux/19950/console since it does not include some changes in the mainline. https://github.com/Barry-Xu-2018/rcl/compare/review/topic-use-goal-terminal-timestamp-to-expire-check...ros2%3Arcl%3Arolling

Dec 07 '23 19:12 fujitatomoya

can you rebase your development branch? it is failing build https://ci.ros2.org/job/ci_linux/19950/console since it does not include some changes in the mainline. https://github.com/Barry-Xu-2018/rcl/compare/review/topic-use-goal-terminal-timestamp-to-expire-check...ros2:rcl:rolling

Rebase was done. @fujitatomoya

Dec 08 '23 01:12 Barry-Xu-2018

@SteveMacenski

setting a timer for the next expiring goal.

In this case, why is there any timer still set for expiring the goals? That seems like the root cause of the problem; our goals are being expired before they're done due to a terminal condition if they're long-running.

Sorry, I'm not quite clear about the scenarios you described. Could you please describe them again? The setting of the expire timer begins when a goal enters the terminal state. It is not related to how long the goal has been running.

https://github.com/ros2/rcl/pull/1121#issuecomment-1844405572 In the two scenarios described in my comment, it may be necessary to reset the expire timer:

When the state of a goal enters the terminal state, if the expire timer has not started yet (e.g., for the first terminal goal, or if the previous expire timer was triggered earlier but there was no subsequent expire time, so the timer was canceled), then the expire timer should be started. If the expire timer already started, nothing need to be done.
When the expire timer is triggered, at this point, we should calculate the nearest upcoming expire time and set the expire timer accordingly.

Dec 08 '23 01:12 Barry-Xu-2018

CI:

Linux
Linux-aarch64
Windows

Dec 08 '23 02:12 fujitatomoya

There is a failure in windows test. projectroot.src.core.ddsc.tests.CUnit_ddsc_whc_check_end_state
After checking the log, it seems that the failure is unrelated to this PR.

Dec 08 '23 09:12 Barry-Xu-2018

@sloretz @iuhilnehc-ynos either of you, can you take a look at this? i would like to have another review on this.

Dec 08 '23 17:12 fujitatomoya

Your two situations this method is called:

While goal state is changed to terminal state (success or cancel or abort), rcl_action_notify_goal_done() will be called. And _recalculate_expire_timer() will be called in rcl_action_notify_goal_done() for setting a timer for the next expiring goal.

I'll shorthand this as achieving a terminal state

While timer for goal expire is triggered, rcl_action_expire_goals() will be called. And _recalculate_expire_timer() will be called for setting a timer for the next expiring goal.

I'll shorthand the timer for goal expiring as "the ticking clock"

My assertion is that you found a clever place to make modifications to the goal handle's state: at achieving the terminal state. We know that once we achieve this state, the goal is done processing and anytime before this point, the goal is still processing (or was dropped on the floor from a users' crappy application, which no matter what is going to cause problems).

If we have a place (e.g. here) we can modify a ticking clock, that calls into question the need for the ticking clock prior to this point at all. The issue we want to solve is that ticking clock expiring still-processing goals -- thus why even have the ticking clock have the ability to expire still-processing goals? I'd argue that it shouldn't even be possible. We should be waiting until the terminal state to trigger the clock to start its countdown until the handle / result are deleted. Of which, the default of 10s or what-have-you is totally appropriate and fully solves the problem.

It also comes with the added benefit of simplifying the action logic since there (1) doesn't need to be a ticking clock checked while the goal is processing and (2) the modification you require don't need to reset it, it merely needs to start it

Dec 08 '23 19:12 SteveMacenski

@SteveMacenski

Thank you for the further explanation.

If we have a place (e.g. here) we can modify a ticking clock, that calls into question the need for the ticking clock prior to this point at all. The issue we want to solve is that ticking clock expiring still-processing goals -- thus why even have the ticking clock have the ability to expire still-processing goals? I'd argue that it shouldn't even be possible. We should be waiting until the terminal state to trigger the clock to start its countdown until the handle / result are deleted. Of which, the default of 10s or what-have-you is totally appropriate and fully solves the problem.

It also comes with the added benefit of simplifying the action logic since there (1) doesn't need to be a ticking clock checked while the goal is processing and (2) the modification you require don't need to reset it, it merely needs to start it

Your current question revolves around "the ticking clock", but I haven't fully caught your idea yet. To clarify, I use an example for the following discussion.

For "the ticking clock", while the timer is triggered (At 2 in above diagram), the next expire time need to be calculated.

It also comes with the added benefit of simplifying the action logic since there (1) doesn't need to be a ticking clock checked while the goal is processing and (2) the modification you require don't need to reset it, it merely needs to start it

There are currently two time points where "the ticking clock" is checked. One is when a goal enters the termination state, and the other is when the ticking clock timer is triggered. When you mentioned "the goal is processing," in what situation does that occur?

If expire timer has been run, a goal enters terminal state to call _recalculate_expire_timer(). At this case, calculating and setting the timer is an unnecessary operation.
While the expire timer is triggered (At 2 in above diagram), I think calculating and setting the timer is a necessary operation.

Dec 11 '23 08:12 Barry-Xu-2018

When you mentioned "the goal is processing," in what situation does that occur?

That's a question you're most suited to answer than me. With the current behavior before this PR, where was the goal being expired resulting in the bug that this is meant to address (which was expiring still-processing goals causing issues)? That was happening which triggered this discussion. My suggestion is removing entirely where-ever that is happening if we're able to start the count down to removing a goal at a terminal state. You've shown the part of the code where that terminal state is found to be able to reset a timer. Instead, we could just start it here and remove the previous problematic countdown that previously caused us issues.

I ack that perhaps I'm missing something and that my suggestion isn't fully informed with the ins-and-outs of rcl. What I'm trying to suggest at its core is this, regardless of the specific details:

We previously had a problem where goals were expiring mid-execution and we weren't getting the results back because they were no longer in existence. That required us in Nav2 to take somewhat drastic measures to expose this not-well-documented timeout to users as a temporary work around.
You've identified a part of the code where we know the goal has just reached its terminal state
Thus: my suggestion is to completely rip out the existing logic that caused the bug in the first place with a timeout existing at all during normal goal processing -- and replace it with logic in the goal-terminal-state-reached section of the code to start the timeout clock before expiring the goal's information.

To my eye, this PR resets some timeouts, but still leaves the original timeout in place that caused the problem in the past - which could still cause a problem moving forward as well if you have a timeout of 10s but an action length of 10 minutes (which, we have).

Dec 11 '23 19:12 SteveMacenski

@SteveMacenski

Thank you for detailed explanation.
I understand the concerns you mentioned.
Before this patch, when a goal handle is created, this time point is already recorded as the starting time for expire calculation. Of course, the expire timer is not set at this time. When the goal enters the terminal state, this time point is used to calculate and set the time on the expire timer. In this scenario, if the time it takes to process the goal is longer than the set timeout, when the goal enters the terminal state, the result will not be retained for the timeout duration; instead, it will be immediately removed.

This PR ensures that the current time is recorded as the start time for expiration only when the goal enters the terminal state. The expire timer is set based on this time. Therefore, with the current modifications, it is impossible for the goal to expire while being processed. In rcl_action_expire_goals(), it will get time of goal termination by calling rcl_action_goal_handle_get_goal_terminal_timestamp() to check if expiration occurs.

Dec 12 '23 07:12 Barry-Xu-2018

OK, got it. That makes sense, you can ignore my comments then :-) This is a very important fix for us in Nav2 so I wanted to make sure beyond the shadow of my ignorance that this was going to fully resolve the problem :wink: I very much appreciate your time here!

Dec 12 '23 18:12 SteveMacenski

@Barry-Xu-2018 since you added the fix from last CI, i started the full CI.

Linux
Linux-aarch64
Windows

@SteveMacenski thank you for taking your time for the review!

Dec 12 '23 18:12 fujitatomoya

Linux-aarch64

Dec 14 '23 16:12 fujitatomoya

What's the good word?

Jan 10 '24 19:01 SteveMacenski

we would like to have 2nd review on this.

CC: @wjwwood @clalancette @alsora @mjcarroll

Jan 10 '24 22:01 fujitatomoya

Any hope for getting this merged soon? A fix for this is highly appreciated.

Apr 15 '24 12:04 Bjoernolav

@Barry-Xu-2018 can you rebase this?

Apr 30 '24 02:04 fujitatomoya

@fujitatomoya Rebase was done.

Apr 30 '24 04:04 Barry-Xu-2018

@alsora @wjwwood @mjcarroll we would like to request the review from either of you before CI.

Apr 30 '24 16:04 fujitatomoya

Rebase was done.

May 28 '24 05:05 Barry-Xu-2018

@alsora @wjwwood @clalancette @mjcarroll i think this is one of the major issue for action, sadly we missed this for jazzy release. could you take a look at this?

Jun 07 '24 05:06 fujitatomoya

CI: