libcyphal
libcyphal copied to clipboard
Add error handling in IRunnable::run
PROBLEM
The design doc omits error handling for IRunnable::run
. We want to return an expected
from there (the CETL rendition of it) to communicate errors; however:
-
In scenarios involving redundant network interfaces, we want to be able to keep going if a redundant entity fails to keep the other members of the redundant group operational.
-
IRunnable
is a very abstract interface and we don't want to tie it to the transport-specific error hierarchy because that creates very bad coupling between different components of the library. We seem to need a way to both retain the specific context-dependent error type (e.g.,libcyphal::transport::AnyError
if we're in the transport context) and erase the type to manage the coupling.
SOLUTION
1. Use error notifier to manage redundant entities in a way that is somewhat similar to algebraic effects. When an error occurs in the context of a redundant entity group (like multiple IMedia
or multiple transports), we don't return it immediately but inform the caller about this, allowing the caller to decide if we should go on or abort:
class IRunnable
{
using Error = /* definition omitted */;
virtual expected<void, Error> run(const TimePoint now, const function<cetl::optional<Error>(const Error&)> on_error) = 0;
};
Behavior:
- When an error occurs that does not involve redundant entities, it is returned immediately as it normally would be in such cases.
- When an error occurs in a redundant entity context, the decision is delegated upwards via
on_error
, which can return the same error (or perhaps any other error, we don't care) or nothing.- If
on_error
returns nothing, we move on. - If
on_error
returns an error (which may be the same or distinct), we cease further processing and return said error.
- If
Optionally, we can provide a type-erased unbounded_variant
containing a typed pointer or even an untyped const void*
pointing to the entity that triggered the error (e.g., an IMedia
instance) for additional context. This is immaterial at this stage.
2. Use the unbounded variant (formerly cetl::any
) to capture and erase the detailed error type. The error type is defined as cetl::unbounded_variant<sizeof(void*)*8>
. The caller will be able to query for relevant error types that it can handle at runtime, propagating unhandleable errors upward to the application. This allows us to retain the detailed type via the unbounded variant type, and at the same time hide it from the abstract IRunnable
interface:
class IRunnable
{
using Error = cetl::unbounded_variant<sizeof(void*)*8>;
virtual expected<void, Error> run(const TimePoint now, const function<cetl::optional<Error>(const Error&)> on_error) = 0;
};
@thirtytwobits please review
What is the guaranteed threading model of on_error
? Do we always call it back on the same thread run was entered on? Do we allow any thread but one-thread-at-a-time? ISRs?
I think the error type should be structured. Something like a polymorphic Error
class with a set of known sub-classes libcyphal uses but which can be extended by the implementation of the media layer et-al. (They would be returned by value)
The handler is meant to delegate control flow decisions up the call stack to the place where there is sufficient context available for the decisions to be made, similar to exception handlers except that we return back to the place where the "exception" was raised afterward. This happens strictly within the current call stack in one thread. The way I see it now, if one were to invoke multithreading then the proposed mechanism is not being used correctly. We will document this.
Currently we define the set of possible error types explicitly in a cetl::variant
, such that the failure modes are encoded in the function type. This approach provides type safety but the disadvantage is that it lacks flexibility, as it is not really possible to use polymorphic error types. The polymorphism-without-pointers approach that we discussed a while back allows us to use polymorphic types without heap using cetl::unbounded_variant
instead. This is kinda outside of the scope of this proposal actually.
error callback discussion
how will the implementer of this callback know the context of the error? When considering exception handling, context is everything. I'm unconvinced that a generalized solution is usable. It'd probably become a lot of documentation about unrelated parts of the middleware and how they interact with the runnable error callback. I'd omit this callback.
For errors where we need to consult a higher layer without aborting execution we should utilize a similar pattern but with callbacks injected on objects local to the errors being generated. So, if you had Menagerie : IRunnable
and the user was going to run teams of horses in this Menagerie then a "HorseTeam" object should have a way to set an error callback like on_horse_injury(IHorse& team_member, int row, int column)
to allow the application to decide if they should stop running. The Context is now implied by where the error handler was injected.
Error Return
Any IRunnable::run return has to start with a notion of how we surface errors from below the LibCyphal middleware layer (e.g. the media layer implementation/port)
A practical case is CanTransport : IRunnable
that implements IRunnable::run
which invokes pop
on several instances of IMedia
, which may fail. Similar issues occur in tx sessions where an outgoing transfer needs to be replicated into multiple underlying lizard objects, or also in the transport class when each of the redundant media or socket instances needs to be pushed to independently.
We have so far identified the following approaches:
-
Provide a shared error handler for the group of
IMedia
, which could be done in multiple ways:- Instead of passing each
IMedia&
to the transport, introduce a container type for media which is equipped with an error handler as suggested by Scott. - Extend
CanTransport
with a media error callback.
- Instead of passing each
-
Use the generalized handler with additional context, as suggested in the OP post. The full signature would be like
function<cetl::optional<Error>(const cetl::unbounded_variant<>, const Error&)>
, where the first argument contains a reference to the culprit (IMedia
in this case). -
Push error handling down such that entities that are used as part of a redundant group become infallible. This means that
pop
cannot return an error by design, signaling failures via a user-installed callback instead. In Scott's example, this would be like installing an individual error handler per horse.
The error type discussion is moved to #345
- I'm not sure I understand this proposal. Do you mean something like
ErrorType addMedia(ErrorHandlerCallbackType error_handler, IMediaGroup& media)
? - This isn't any better then the original proposal; we'd end up with a single, generic parameter that contained a bunch of unrelated documentation like "If you get called with IFoo and ErrorType bar it means the foo layer failed to ..." etc.
- How is this different then 1?
- I'm not sure I understand this proposal. Do you mean something like ErrorType addMedia(ErrorHandlerCallbackType error_handler, IMediaGroup& media)?
Currently, we pass media instances into the transport directly via a span or something. What I'm proposing as the first alternative is that we introduce a managing entity that holds the media instances and can handle their errors, where handling is to be understood as forwarding errors to an external handler:
class MediaGroup
{
public:
using ErrorHandler = cetl::function<void(std::size_t media_index, const Error& error)>;
void setErrorHandler(const ErrorHandler& eh);
...
private:
std::span<IMedia*> media_;
};
- This isn't any better then the original proposal; we'd end up with a single, generic parameter that contained a bunch of unrelated documentation like "If you get called with IFoo and ErrorType bar it means the foo layer failed to ..." etc.
The handler is only meant to be invoked on errors originating from redundant things so I don't think there will be much documentation involved. If you get called with an IMedia from the CAN transport then it's pretty clear what the scope of the error is. If we don't agree on this, we have the next best thing, the third proposal.
- How is this different then 1?
In this case we don't introduce additional classes but instead extend IMedia
with a new method:
class IMedia
{
public:
using ErrorHandler = cetl::function<void(std::size_t media_index, const Error& error)>;
virtual void setErrorHandler(const ErrorHandler& eh) noexcept = 0;
//fgsfds
};
I find this proposal superior compared to the first one.
I find this proposal superior compared to the first one.
As do I.
Okay. We are going to make IMedia
infallible from the standpoint of their normal push
/pop
methods; errors will be instead reported via the external error handler callback. The callback should be passed into the constructor rather than via a method post-initialization but that is not important right now.
The same approach will be used in Cyphal/UDP for ISocket
etc.
The IRunner::run
method still needs some error handling facility to which I propose unbounded_variant
to avoid coupling the abstract runner interface with the specifics of the transport layer.
Notes based on the last dev call. The design objective is to expose detailed error information at the media layer and at the same time allow high-level handling at the higher layers. To do this, we could (this is not a proposal yet but an idea) do the following:
-
Simply return errors from
IMedia
,ISocket
, and other entities beneath the transport layer as we normally do viaexpected
. For example the pop method oncan::IMedia
becomesexpected<optional<Rx>, variant<ArgumentError, PlatformError>> pop(const std::span<std::byte> payload_buffer)
. This will provide highly detailed error information via the type-erasedPlatformError
. See #345 -
Provide
ITransport
with a new method for handling non-fatal transient processing errors:struct TransientError { static constexpr std::size_t Footprint = sizeof(void*) * 8; unbounded_variant<Footprint> error; unbounded_variant<sizeof(void*)> culprit; // Pointer to the failed instance }; /// The reference to the TransientError loses validity after the return from the handler. using TransientErrorHandler = function<void(const TransientError& err)>; void setTransientErrorHandler(const TransientErrorHandler& teh);
If no error handler is installed, errors propagate upward as they normally would; if one media fails in a redundant group, further processing is terminated, and the call returns early without handling the remaining media instances. If the error handler is installed, processing becomes resilient, with errors being reported via the handler upward.
The RedundantTransport
will be able to override the error handlers of its inferiors to enable resiliency in them and propagate errors upward via its own error handlier. One issue here though is that the use of unbounded variants does not allow nesting: one TransientError
cannot nest another (originating from an inferior transport) because the footprint is fixed.
One issue here though is that the use of unbounded variants does not allow nesting: one
TransientError
cannot nest another (originating from an inferior transport) because the footprint is fixed.
We can address this by introducing a non-template polymorphic base for cetl::unbounded_variant
: https://github.com/OpenCyphal/CETL/issues/121. Then we'll have two ways to proceed.
First option
The second part of the solution is amended as follows:
struct TransientError
{
/// The reference to the error container loses validity after the return from the handler.
const unbounded_variant_base& error;
unbounded_variant<sizeof(void*)> culprit; ///< Pointer to the failed instance
};
/// The reference to the TransientError, including all nested references, loses validity after the return from the handler.
/// It is, therefore, necessary to handle the error directly in the handler.
using TransientErrorHandler = cetl::function<void(const TransientError& err)>;
void setTransientErrorHandler(const TransientErrorHandler& teh);
If the const reference causes problems with copyability, it can be replaced with a const pointer.
The RedundantTransport
would then be able to nest TransientError
s from its inferiors and forward them up the stack.
Second option
The second part of the solution is left as-is. The RedundantTransport
forwards its own TransientError
up the stack where the error
field contains a pointer to the TransientError
from its inferior. The nesting is done through pointer indirection here.
The second option is preferable because it doesn't put undue restrictions on copyability in the simple cases.
One non-critical limitation here is that neither solution allows copying the error object for postponed processing unless you know its exact type. We could fix this shortcoming by introducing support for fallible copyability to cetl::unbounded_variant
, such that you could copy the contents of one unbounded variant into another without knowing their exact footprints. One strong disadvantage is that this approach leaves the possibility of a runtime failure should the size of the destination variant be insufficient to contain the copied object.
class unbounded_variant_copyable_base : public unbounded_variant_base
{
//...
/// Returns false if the footprint is not large enough, in which case no copy is done.
CETL_NODISCARD virtual bool copy(const unbounded_variant_base&) = 0;
//...
};
As a piece of syntactic sugar, would it not be nice to define an unbounded variant instantiation specifically designed for holding pointers:
/// May contain any raw pointer.
using any_raw_ptr = unbounded_variant<sizeof(void*)>;
/// May contain any raw pointer, unique_ptr, shared_ptr, and most other smart pointers.
using any_ptr = unbounded_variant<sizeof(void*) * 3>;
@serges147 Here's the refined proposal that incorporates everything discussed above.
Amend the definition of IRunnable
as shown below. We use this returned error instance if a non-transient error occurs (i.e., an error not related to redundant entities such as IMedia
), or a transient error occurs when the transient error handler is not installed.
class IRunnable
{
public:
/// We may enable PMR for this error type later if necessary.
/// If/when that happens, the footprint should probably be set to zero to minimize overhead.
using Error = cetl::unbounded_variant<sizeof(void*) * 8>; // TODO: the footprint may need to be enlarged to fit nested variants.
virtual expected<void, Error> run(const TimePoint now) = 0;
};
Next, we add new entities to ITransport
as shown below.
class ITransport
{
public:
using TransientErrorHandler = cetl::function<void(const RichAnyError&)>;
void setTransientErrorHandler(const TransientErrorHandler& teh);
};
If the transient error handler is not installed, errors bubble up through IRunnable::run
and other methods as they normally would. If the transient error handler is installed, errors that are not related to redundant entities still propagate as before; errors that are related to these are instead reported via the transient error handler.
It should be documented that the user must not attempt to modify the configuration of ITransport
from within the transient error handler; if necessary, the user may record the fact that the error has occurred and perform the required actions after the return from IRunnable::run
or whatever other method has caused the TEH to be invoked. It should also be documented which specific methods may invoke the TEH.
Above we reference a nonexistent type named RichAnyError
. I propose to define it roughly as follows:
struct RichAnyError final
{
AnyError error;
/// Pointer to the entity that has caused this error for enhanced context; emoty if not applicable.
cetl::unbounded_variant<sizeof(void*)> culprit;
};
Alternatively, we could perhaps extend AnyError
with culprit
, but this is likely to be redundant in most scenarios. I leave it to Sergei to decide.
Then ITransport::run
will be returning RichAnyError
wrapped in IRunnable::Error
.
Done in #361