libcyphal Add error handling in IRunnable::run

PROBLEM

The design doc omits error handling for IRunnable::run. We want to return an expected from there (the CETL rendition of it) to communicate errors; however:

In scenarios involving redundant network interfaces, we want to be able to keep going if a redundant entity fails to keep the other members of the redundant group operational.
IRunnable is a very abstract interface and we don't want to tie it to the transport-specific error hierarchy because that creates very bad coupling between different components of the library. We seem to need a way to both retain the specific context-dependent error type (e.g., libcyphal::transport::AnyError if we're in the transport context) and erase the type to manage the coupling.

SOLUTION

1. Use error notifier to manage redundant entities in a way that is somewhat similar to algebraic effects. When an error occurs in the context of a redundant entity group (like multiple IMedia or multiple transports), we don't return it immediately but inform the caller about this, allowing the caller to decide if we should go on or abort:

class IRunnable
{
    using Error = /* definition omitted */;
    virtual expected<void, Error> run(const TimePoint now, const function<cetl::optional<Error>(const Error&)> on_error) = 0;
};

Behavior:

When an error occurs that does not involve redundant entities, it is returned immediately as it normally would be in such cases.
When an error occurs in a redundant entity context, the decision is delegated upwards via on_error, which can return the same error (or perhaps any other error, we don't care) or nothing.
- If on_error returns nothing, we move on.
- If on_error returns an error (which may be the same or distinct), we cease further processing and return said error.

Optionally, we can provide a type-erased unbounded_variant containing a typed pointer or even an untyped const void* pointing to the entity that triggered the error (e.g., an IMedia instance) for additional context. This is immaterial at this stage.

2. Use the unbounded variant (formerly cetl::any) to capture and erase the detailed error type. The error type is defined as cetl::unbounded_variant<sizeof(void*)*8>. The caller will be able to query for relevant error types that it can handle at runtime, propagating unhandleable errors upward to the application. This allows us to retain the detailed type via the unbounded variant type, and at the same time hide it from the abstract IRunnable interface:

class IRunnable
{
    using Error = cetl::unbounded_variant<sizeof(void*)*8>;
    virtual expected<void, Error> run(const TimePoint now, const function<cetl::optional<Error>(const Error&)> on_error) = 0;
};

@thirtytwobits please review

Apr 16 '24 12:04 pavel-kirienko

What is the guaranteed threading model of on_error? Do we always call it back on the same thread run was entered on? Do we allow any thread but one-thread-at-a-time? ISRs?

I think the error type should be structured. Something like a polymorphic Error class with a set of known sub-classes libcyphal uses but which can be extended by the implementation of the media layer et-al. (They would be returned by value)

Apr 16 '24 18:04 thirtytwobits

The handler is meant to delegate control flow decisions up the call stack to the place where there is sufficient context available for the decisions to be made, similar to exception handlers except that we return back to the place where the "exception" was raised afterward. This happens strictly within the current call stack in one thread. The way I see it now, if one were to invoke multithreading then the proposed mechanism is not being used correctly. We will document this.

Currently we define the set of possible error types explicitly in a cetl::variant, such that the failure modes are encoded in the function type. This approach provides type safety but the disadvantage is that it lacks flexibility, as it is not really possible to use polymorphic error types. The polymorphism-without-pointers approach that we discussed a while back allows us to use polymorphic types without heap using cetl::unbounded_variant instead. This is kinda outside of the scope of this proposal actually.

Apr 16 '24 22:04 pavel-kirienko

error callback discussion

how will the implementer of this callback know the context of the error? When considering exception handling, context is everything. I'm unconvinced that a generalized solution is usable. It'd probably become a lot of documentation about unrelated parts of the middleware and how they interact with the runnable error callback. I'd omit this callback.

For errors where we need to consult a higher layer without aborting execution we should utilize a similar pattern but with callbacks injected on objects local to the errors being generated. So, if you had Menagerie : IRunnable and the user was going to run teams of horses in this Menagerie then a "HorseTeam" object should have a way to set an error callback like on_horse_injury(IHorse& team_member, int row, int column) to allow the application to decide if they should stop running. The Context is now implied by where the error handler was injected.

Error Return

Any IRunnable::run return has to start with a notion of how we surface errors from below the LibCyphal middleware layer (e.g. the media layer implementation/port)

Apr 16 '24 22:04 thirtytwobits

A practical case is CanTransport : IRunnable that implements IRunnable::run which invokes pop on several instances of IMedia, which may fail. Similar issues occur in tx sessions where an outgoing transfer needs to be replicated into multiple underlying lizard objects, or also in the transport class when each of the redundant media or socket instances needs to be pushed to independently.

We have so far identified the following approaches:

Provide a shared error handler for the group of IMedia, which could be done in multiple ways:
- Instead of passing each IMedia& to the transport, introduce a container type for media which is equipped with an error handler as suggested by Scott.
- Extend CanTransport with a media error callback.
Use the generalized handler with additional context, as suggested in the OP post. The full signature would be like function<cetl::optional<Error>(const cetl::unbounded_variant<>, const Error&)>, where the first argument contains a reference to the culprit (IMedia in this case).
Push error handling down such that entities that are used as part of a redundant group become infallible. This means that pop cannot return an error by design, signaling failures via a user-installed callback instead. In Scott's example, this would be like installing an individual error handler per horse.

The error type discussion is moved to #345

Apr 17 '24 09:04 pavel-kirienko

I'm not sure I understand this proposal. Do you mean something like ErrorType addMedia(ErrorHandlerCallbackType error_handler, IMediaGroup& media)?
This isn't any better then the original proposal; we'd end up with a single, generic parameter that contained a bunch of unrelated documentation like "If you get called with IFoo and ErrorType bar it means the foo layer failed to ..." etc.
How is this different then 1?

Apr 17 '24 14:04 thirtytwobits

I'm not sure I understand this proposal. Do you mean something like ErrorType addMedia(ErrorHandlerCallbackType error_handler, IMediaGroup& media)?

Currently, we pass media instances into the transport directly via a span or something. What I'm proposing as the first alternative is that we introduce a managing entity that holds the media instances and can handle their errors, where handling is to be understood as forwarding errors to an external handler:

class MediaGroup
{
public:
    using ErrorHandler = cetl::function<void(std::size_t media_index, const Error& error)>;
    void setErrorHandler(const ErrorHandler& eh);
    ...
private:
    std::span<IMedia*> media_;
};

This isn't any better then the original proposal; we'd end up with a single, generic parameter that contained a bunch of unrelated documentation like "If you get called with IFoo and ErrorType bar it means the foo layer failed to ..." etc.

The handler is only meant to be invoked on errors originating from redundant things so I don't think there will be much documentation involved. If you get called with an IMedia from the CAN transport then it's pretty clear what the scope of the error is. If we don't agree on this, we have the next best thing, the third proposal.

How is this different then 1?

In this case we don't introduce additional classes but instead extend IMedia with a new method:

class IMedia
{
public:
    using ErrorHandler = cetl::function<void(std::size_t media_index, const Error& error)>;
    virtual void setErrorHandler(const ErrorHandler& eh) noexcept = 0;
    //fgsfds
};

I find this proposal superior compared to the first one.

Apr 17 '24 20:04 pavel-kirienko

I find this proposal superior compared to the first one.

As do I.

Apr 18 '24 16:04 thirtytwobits

Okay. We are going to make IMedia infallible from the standpoint of their normal push/pop methods; errors will be instead reported via the external error handler callback. The callback should be passed into the constructor rather than via a method post-initialization but that is not important right now.

The same approach will be used in Cyphal/UDP for ISocket etc.

The IRunner::run method still needs some error handling facility to which I propose unbounded_variant to avoid coupling the abstract runner interface with the specifics of the transport layer.

Apr 18 '24 21:04 pavel-kirienko

Notes based on the last dev call. The design objective is to expose detailed error information at the media layer and at the same time allow high-level handling at the higher layers. To do this, we could (this is not a proposal yet but an idea) do the following:

Simply return errors from IMedia, ISocket, and other entities beneath the transport layer as we normally do via expected. For example the pop method on can::IMedia becomes expected<optional<Rx>, variant<ArgumentError, PlatformError>> pop(const std::span<std::byte> payload_buffer). This will provide highly detailed error information via the type-erased PlatformError. See #345

Provide ITransport with a new method for handling non-fatal transient processing errors:

struct TransientError
{
   static constexpr std::size_t Footprint = sizeof(void*) * 8;
   unbounded_variant<Footprint>     error;
   unbounded_variant<sizeof(void*)> culprit;  // Pointer to the failed instance
};
/// The reference to the TransientError loses validity after the return from the handler.
using TransientErrorHandler = function<void(const TransientError& err)>;
void setTransientErrorHandler(const TransientErrorHandler& teh);

If no error handler is installed, errors propagate upward as they normally would; if one media fails in a redundant group, further processing is terminated, and the call returns early without handling the remaining media instances. If the error handler is installed, processing becomes resilient, with errors being reported via the handler upward.

The RedundantTransport will be able to override the error handlers of its inferiors to enable resiliency in them and propagate errors upward via its own error handlier. One issue here though is that the use of unbounded variants does not allow nesting: one TransientError cannot nest another (originating from an inferior transport) because the footprint is fixed.

Apr 20 '24 14:04 pavel-kirienko

One issue here though is that the use of unbounded variants does not allow nesting: one TransientError cannot nest another (originating from an inferior transport) because the footprint is fixed.

We can address this by introducing a non-template polymorphic base for cetl::unbounded_variant: https://github.com/OpenCyphal/CETL/issues/121. Then we'll have two ways to proceed.

First option

The second part of the solution is amended as follows:

struct TransientError
{
    /// The reference to the error container loses validity after the return from the handler.
    const unbounded_variant_base&    error;
    unbounded_variant<sizeof(void*)> culprit;  ///< Pointer to the failed instance
};
/// The reference to the TransientError, including all nested references, loses validity after the return from the handler.
/// It is, therefore, necessary to handle the error directly in the handler.
using TransientErrorHandler = cetl::function<void(const TransientError& err)>;
void setTransientErrorHandler(const TransientErrorHandler& teh);

If the const reference causes problems with copyability, it can be replaced with a const pointer.

The RedundantTransport would then be able to nest TransientErrors from its inferiors and forward them up the stack.

Second option

The second part of the solution is left as-is. The RedundantTransport forwards its own TransientError up the stack where the error field contains a pointer to the TransientError from its inferior. The nesting is done through pointer indirection here.

The second option is preferable because it doesn't put undue restrictions on copyability in the simple cases.

One non-critical limitation here is that neither solution allows copying the error object for postponed processing unless you know its exact type. We could fix this shortcoming by introducing support for fallible copyability to cetl::unbounded_variant, such that you could copy the contents of one unbounded variant into another without knowing their exact footprints. One strong disadvantage is that this approach leaves the possibility of a runtime failure should the size of the destination variant be insufficient to contain the copied object.

class unbounded_variant_copyable_base : public unbounded_variant_base
{
//...
    /// Returns false if the footprint is not large enough, in which case no copy is done.
    CETL_NODISCARD virtual bool copy(const unbounded_variant_base&) = 0;
//...
};

As a piece of syntactic sugar, would it not be nice to define an unbounded variant instantiation specifically designed for holding pointers:

/// May contain any raw pointer.
using any_raw_ptr = unbounded_variant<sizeof(void*)>;
/// May contain any raw pointer, unique_ptr, shared_ptr, and most other smart pointers.
using any_ptr = unbounded_variant<sizeof(void*) * 3>;

May 07 '24 11:05 pavel-kirienko

@serges147 Here's the refined proposal that incorporates everything discussed above.

Amend the definition of IRunnable as shown below. We use this returned error instance if a non-transient error occurs (i.e., an error not related to redundant entities such as IMedia), or a transient error occurs when the transient error handler is not installed.

class IRunnable
{
public:
    /// We may enable PMR for this error type later if necessary.
    /// If/when that happens, the footprint should probably be set to zero to minimize overhead.
    using Error = cetl::unbounded_variant<sizeof(void*) * 8>;  // TODO: the footprint may need to be enlarged to fit nested variants.

    virtual expected<void, Error> run(const TimePoint now) = 0;
};

Next, we add new entities to ITransport as shown below.

class ITransport
{
public:
    using TransientErrorHandler = cetl::function<void(const RichAnyError&)>;
    void setTransientErrorHandler(const TransientErrorHandler& teh);
};

If the transient error handler is not installed, errors bubble up through IRunnable::run and other methods as they normally would. If the transient error handler is installed, errors that are not related to redundant entities still propagate as before; errors that are related to these are instead reported via the transient error handler.

It should be documented that the user must not attempt to modify the configuration of ITransport from within the transient error handler; if necessary, the user may record the fact that the error has occurred and perform the required actions after the return from IRunnable::run or whatever other method has caused the TEH to be invoked. It should also be documented which specific methods may invoke the TEH.

Above we reference a nonexistent type named RichAnyError. I propose to define it roughly as follows:

struct RichAnyError final
{
    AnyError error;
    /// Pointer to the entity that has caused this error for enhanced context; emoty if not applicable.
    cetl::unbounded_variant<sizeof(void*)> culprit;
};

Alternatively, we could perhaps extend AnyError with culprit, but this is likely to be redundant in most scenarios. I leave it to Sergei to decide.

Then ITransport::run will be returning RichAnyError wrapped in IRunnable::Error.

Jun 03 '24 11:06 pavel-kirienko

Done in #361

Jun 24 '24 22:06 pavel-kirienko

libcyphal libcyphal copied to clipboard

Add error handling in IRunnable::run

PROBLEM

SOLUTION

error callback discussion

Error Return

First option

Second option

libcyphal
libcyphal copied to clipboard