rcl Error handling often overwrites errors already set

Bug report

Required Info:

Operating System:
- All Tier 1 platforms
Installation type:
- From source
Version or commit hash:
- Current master

Steps to reproduce issue

Run rcl tests locally:

colcon test --event-handlers=console_direct+ --packages-select rcl

Expected behavior

All tests pass and no error handling error (an overwrite) is reported via stderr.

Actual behavior

All tests pass but plenty error handling errors (overwrites) are reported.

Additional information

These overwrites are the result of either ignoring errors set by rcutils or rmw APIs or attempting to propagate them, which isn't necessarily bad practice if it weren't for the fact that all our libraries share the same global error state.

Aug 11 '20 21:08 hidmic

IMO our main issue here is that there's no clear path forward. Even if we decide to rely on all libraries sharing the same error handling state, rcutils error handling API was not designed to ease error propagation, and suppressing portions of an error chain either decontextualizes an error message or obscures its origin e.g. failed to duplicate string and publisher failed to initialize are equally useless error messages.

Aug 11 '20 21:08 hidmic

FYI @Blast545 @brawner.

Aug 11 '20 21:08 hidmic

For future reference, discussion started here: https://github.com/ros2/rcl/pull/734#discussion_r468098092

I am not really sure what's the best way to tackle this. Thought a little on this, what if a "build" flag is set then the overwritten errors are print to stderr and if not set the error is overwritten without the warning?

Aug 12 '20 14:08 Blast545

Thought a little on this, what if a "build" flag is set then the overwritten errors are print to stderr and if not set the error is overwritten without the warning?

Hmm, a possible outcome of this discussion is that we start setting error messages in every C function in a call chain. If we do that, then putting a message on stderr saying that there was an error with error handling and that a previous message was overwritten will never be correct.

Aug 12 '20 17:08 hidmic

What is supposed to happen is that when you encounter an error with an rcutils function, like failed to duplicate string, from within an rcl function is that you collecte the rcutils error, add it to your new error, clear the error state, and return with a new error state set, something like failed to initialize publisher: failed to duplicate string.

Any proposal to make this easier needs to (in my mind) consider two things:

printing to stderr is often not desired (maybe the error is expected and will be ignored/suppressed)
memory allocation should be avoided (the current error handling code in rcutils is memory static after the thread local storage is initialized)

Aug 18 '20 17:08 wjwwood

That's one way forward, but I will point out that:

Propagating error states with the current API is clunky. Namely, chaining error states properly is hard, and we don't do it almost anywhere. It gets even harder when errors occur while handling errors (see https://github.com/ros2/rcutils/issues/269).
Error handling buffers usually span a 4k page. If we use them to build backtraces, we have to deal with the possibility of running out of space.

Aug 18 '20 19:08 hidmic

Propagating error states with the current API is clunky. Namely, chaining error states properly is hard, and we don't do it almost anywhere. It gets even harder when errors occur while handling errors

Chunkiness can be fixed with a better API. But what interests me about your comment is the errors-while-handling-errors thing. Is this actually different from propagating an error from below?

Aug 19 '20 00:08 gbiggs

Chunkiness can be fixed with a better API.

Sure. I was just stating that if we go that way we probably have to roll out new API.

Is this actually different from propagating an error from below?

If we want to build human-readable backtraces, errors that occur while dealing with errors cannot simply be chained. But perhaps that's a non-issue if we settle on always printing to stderr on https://github.com/ros2/rcutils/issues/269.

Aug 19 '20 15:08 hidmic

Error handling buffers usually span a 4k page. If we use them to build backtraces, we have to deal with the possibility of running out of space.

I don't get this point, my suggestion was to not build backtraces but to handle existing errors properly, only combining their message text, as needed, into a single error state at all times. I don't see how we could run out of space.

If you're worried about the stack you can allocate error state objects on the heap, but I doubt that will be an issue.

Aug 19 '20 17:08 wjwwood

I don't get this point, my suggestion was to not build backtraces but to handle existing errors properly, only combining their message text, as needed, into a single error state at all times.

Consider how would the error message look like if we start chaining errors using the "new message : old message" pattern (including file path and line number) in a 5+ levels-deep call stack. It seems to me that, to keep it readable, we have to format it as if it was a backtrace.

I don't see how we could run out of space.

The buffer can hold up to 4096 characters. Let's say that on average the file path is ~70 characters long and the log message is ~50 characters long. That's a roughly ~35 levels deep call stack. It's deep, but not unattainable.

Aug 19 '20 18:08 hidmic

I meant running out of space on the stack, not in the error message. Sorry if that wasn't clear.

If the error message gets deep it will look silly, but it will contain a lot of information, which might be useful. If the underlying causes are no longer important, you could choose to summarize the issue at any point in the chain.

Aug 19 '20 19:08 wjwwood

I can see the value in having a backtrace-style output going down to the real source of an error, but I wonder if that's better handled by using log files to trace back to error causes. A backtrace mechanism sounds like complexity that is not strictly necessary.

It would look cool, though.

If the underlying causes are no longer important, you could choose to summarize the issue at any point in the chain.

I think this is important. We should probably be going case-by-case and understanding what the user needs to see at each point where an error is returned rather than giving them all the gritty details. One point of view is that error messages are for quick diagnostics, with log files for when the cause is not obvious.

Aug 24 '20 02:08 gbiggs