rcutils
rcutils copied to clipboard
Simplify error handling in failure mode or during finalization
Rationale
Error handling support in rcutils
(and by extension, in every ROS 2 C library implementation or client code) falls short when it comes to error propagation when in failure mode or during finalization. As discussed in ros2/rmw_fastrtps#414 and ros2/rmw_cyclonedds#210, ensuring the initial error gets propagated, while no subsequent error goes unnoticed, clutters client code significantly.
Proposal
Introduce a mechanism for rcutils
to distinguish between an error overwrite and a nested error that should be handled differently (e.g. logging to stderr
directly). For that matter, have a thread local integer to track nested error handling scopes and a boolean in local storage to ensure a single error handling scope per function.
With an API like:
typedef struct rcutils_error_handling_scope_t {
bool active;
} rcutils_error_handling_scope_t;
void rcutils_error_handling_scope_init(rcutils_error_handling_scope_t * scope);
void rcutils_error_handling_scope_enter(rcutils_error_handling_scope_t * scope);
void rcutils_error_handling_scope_leave(rcutils_error_handling_scope_t * scope);
a function may request rcutils_set_error_state
to behave differently in between _enter
and _leave
calls, for its own code and the call chain that follows after it. In C++, RAII may be used to simplify API use even further.
FYI @clalancette @ivanpauno
My problem with a call chain is that you cannot know how deep it needs to be. If we have one then, in my opinion, it should be preallocated per thread and have a fixed depth, similar to how the error message itself has a fixed length and is preallocated per thread.
I'm not sure I follow. The purpose of rcutils_error_handling_scope_t
is not to store anything, but to aid changing error handling macros (e.g. RCUTILS_SET_ERROR_MSG
) behavior whenever errors occur while handling another error. I guess we could generate full backtraces within each error handling "scope", but we also can start small and just print to stderr
(like we do explicitly in quite a few places).
My problem with a call chain is that you cannot know how deep it needs to be
Wouldn't it be possible to make a reasonable guess for how much we need, at least inside the ROS API? We can take a look at our call graph and see how deep it goes.
The purpose of rcutils_error_handling_scope_t is not to store anything, but to aid changing error handling macros (e.g. RCUTILS_SET_ERROR_MSG) behavior whenever errors occur while handling another error.
I think the goal should be to avoid that in the first place.
The error state is trivially copied, so I'd just recommend storing the error state locally before starting cleanup that might set the error state again. For example:
rcutils_ret_t rcutils_ret = rcutils_failing_function(...);
if (RCUTILS_RET_OK != rcutils_ret) {
rcutils_error_state_t error_state = *rcutils_get_error_state();
rcutils_reset_error(); // cannot fail
rcl_ret_t ret = rcl_cleanup_function(...);
if (RCL_RET_OK != ret) {
RCUTILS_SAFE_FWRITE_TO_STDERR("[some context:42] failed to clean up foo while handling error: ");
new_rcutils_print_error_state_to_stderr(&error_state); // cannot fail, this is the first error state
} else {
rcutils_set_error_state(error_state.message, error_state.file, error_state.line_number);
}
}
Now, that's obviously awful to write, and I think we should try to improve that workflow, but I don't think we should be pushing behavior into the error handling code, to be honest.
Maybe I would be more convinced to see your proposed API additions in action, and see how it makes a use case before and after better.
The error state is trivially copied, so I'd just recommend storing the error state locally before starting cleanup that might set the error state again.
That's precisely the kind of code that I'd like to avoid. Code gets cluttered very quickly, particularly for procedures that have to keep going after an error while not losing track of it (e.g. finalization functions).
Now, that's obviously awful to write, and I think we should try to improve that workflow, but I don't think we should be pushing behavior into the error handling code, to be honest.
That's fair. Thinking out loud, push
/pop
/print
APIs for error states could be a more verbose alternative. Having an easy way to pop the current error state into a local variable OR to print it to stderr
if it has already been popped into that local variable before would keep LOC count low for _fini()
functions too.
Thinking out loud, push/pop/print APIs for error states could be a more verbose alternative.
Print makes sense to me, but push/pop implies a stack and a stack has to have a size, and that either has to be fixed or dynamically allocated. If it is fixed it can be exhausted, which may be ok, and if it's dynamic then that's a non-started in my opinion because these error handling functions need to be used in places where memory allocations are not allowed (or well desired, but not allowed if we want to use them in real-time situations).
Having an easy way to pop the current error state into a local variable OR to print it to stderr if it has already been popped into that local variable before would keep LOC count low for _fini() functions too.
I have a hard time imagining how this would be used.
Like I said, having a specific example where the code is clunky and showing how it could be improved with new API would be helpful for me to understand.