ibc-rs Make core `Error` enums less specific

Follow-up to #255.

Our Error enums have too many variants. We should drastically condense them following the ideas described here.

### Tasks
- [ ] Remove all host error variants from ibc-rs error enums
- [ ] https://github.com/cosmos/ibc-rs/issues/1249
- [ ] https://github.com/cosmos/ibc-rs/issues/950

Nov 29 '22 18:11 plafer

A thought: ideally, internal functions that do use an Error enum would return an Error enum where the variants capture only the possible errors for that function. For example, verify_delay_passed()'s Error would only have the TimestampOverflow,NotEnoughTimeElapsed, and NotEnoughBlocksElapsed variants (or less if we don't care to be that specific). This provides better documentation of what can go wrong, and callers won't write code for errors that cannot happen.

Feb 08 '23 15:02 plafer

In #164, there was some push-back against using anyhow for our errors. However, looking at projects that use ibc-rs, none of them actually make use of our Error enum:

Nomic's errors are modeled as String
Namada make heavy use of the Other(String) error variant,
Penumbra use anyhow in their own implementation of IBC handlers

In other words, programmatic logic based on which error occurred is non-existent in all these projects. This tells me that the most important use case for our errors is aiding in debugging; either host chain implementations, or relayers such as Hermes.

Therefore, I feel like our error system should focus on producing great human-readable error diagnostics, which is exactly what libraries such as anyhow are designed to do. It seems to me like the best way forward is to ignore the conventional wisdom that libraries shouldn't use anyhow, and use anyhow::Error as our main Error type.

I would love the hear people's thoughts on this!

cc @Farhad-Shabani @romac @yito88 @keppel

Feb 27 '23 21:02 plafer

My take, Anyhow has a concise syntax for propagating errors, cleaning up the code look, and is easier to implement since it doesn't care what error types your functions return. Though, this approach (freeform as mentioned) comes at a cost of not having a finely tuned error-handling system. We would be limited in the places where we might need to perform specific actions based on the type of error. It can sometimes even make debugging harder by providing unnecessary info/backtrace and representing all kinds of errors making it tricky to identify the exact causes.

Maybe you have a way of implementing anyhow in your mind? Perhaps not a bad idea to demo a part of our codebase if we think it works.

And, It would be great if you elaborate more on the limitation we face with the current error enum. Couldn’t we refactor it for a better one?

I see the fact you mentioned that IBC-rs errors are not used well in projects, but I think the main reason is they are still in development (e.g. bunch of unwrap() in those codebases) and don't have a refined error-handling system yet. Though, that's true for us too - we're still figuring out some of the designs, so our error enum isn't perfect either, but I don't think it necessarily is the reason why our current errors are not widely used.

Actually, I am a bit concerned, as IIRC we just revamped our error system. If we switch to Anyhow, that would be the second change. For me, It sounds like a more detailed study/proposal is required to make sure! Maybe other folks have excellent insights and ideas to make it an easy choice.

Mar 01 '23 02:03 Farhad-Shabani

I agree with Fahrad, and believe he has a good point wrt codebases not handling errors properly when under heavy development.

Moreover, I too am not sure that I see how anyhow can improve the quality of the errors reported over custom error enums. Perhaps you could into more details about the problems you are seeing?

Mar 01 '23 07:03 romac

I believe I found a good strategy for a better error systems (discussion with @romac heavily influenced the design).

The new system recognizes 2 "error sources":

Host errors: the errors controlled by the host, coming from methods Validation/ExecutionContext
Protocol errors: the errors that come form IBC handlers' validation (e.g. channel not in the expected state)

Host errors

The host error type will be defined by the host. Specifically, we'd make it an associated type of ValidationContext:

trait ValidationContext {
  type Error;

  fn host_timestamp(&self) -> Result<Time, Self::Error>;
}

Protocol errors

The type for protocol errors would roughly be a cleaned up version of our current ContextError. The crucial differences for ContextError in this system compared to the current one are:

we no longer need error variants for host errors in ContextError
its purpose is solely to generate clear debugging error messages

that is, we don't expect users to handle these errors

Top-level error type

Our top-level error type, returned by dispatch(), would be:

enum Error<E> {
    Host(E),
    Protocol(CleanedUpContextError),
}

As an example, I would expect the user code to look like:

match dispatch(...) {
    Host(err) => /* optionally handle my errors */,
    Protocol(err) => /* log errors */
}

Internal logic based on errors from the host

This section borrows the great idea presented in this sled.rs's blog post.

In a few instances we need to perform some logic based on the specific error returned from the host. For example, when the packet commitment is not present during acknowledgement processing, we want to return early without an error. We currently return early for all errors, but ideally we'd distinguish the "packet commitment not present" error from all other errors, and only return early for the "packet commitment not present" error. Here's where we'd use sled's idea. get_packet_commitment would look like:

fn get_packet_commitment(
    &self,
    commitment_path: &CommitmentPath,
) -> Result<Result<PacketCommitment, CommitmentError>, Self::Error>;

enum CommitmentError {
    NotFound,
}

The benefit of this signature is that our internal code would look like:

// note how the outer `Self::Error` gets propagated out, and we 
// match on the inner error that we care about
if matches!(ctx_a.get_packet_commitment(&commitment_path_on_a)?, CommitmentError::NotFound) {
    return Ok(());
};

And that's it! I also expect cleaning up ContextError to solve #342.

Mar 15 '23 19:03 plafer

I liked the idea of separating the error from the sources. So, we only have to take care of ours without assuming anything about the hosts! That’s perfect in my view.

There are just a few questions it may help to clarify some of the details:

Given that hosts will be free to introduce their desired errors through the associated Error type under the ValidationContext, why is it needed to include the Host(E) variant, and basically having a top-level error type?
Regarding the "not present" state, I have a different perspective. In similar cases, unavailability shouldn't be treated as an error. Instead, it should be represented by None as a legitimate response from a storage call. Accordingly, users won’t have to deal with the additional overhead of introducing errors like the CommitmentError. Since there is a single acceptable error variant, we can automate this process in our boundary. I believe Result<Option<Output>, Error> is the correct signature, and I've expanded my view on this in #607

And a suggestion:

Some of the confusing error variants we already have are due to how we implemented certain logic. Prioritizing to study of error sources and the resolution of issues like 603, 607, 536 will streamline the error variants and reduce the burden of redesigning the system.

Apr 06 '23 21:04 Farhad-Shabani

Given that hosts will be free to introduce their desired errors through the associated Error type under the ValidationContext, why is it needed to include the Host(E) variant, and basically having a top-level error type?

The "top-level" type is just the type that validate(), execute() and dispatch() return. We have 2 variants because 2 separates types of errors can occur (ValidationContext::Error and CleanedUpContextError), and both can be bubbled up and returned from one of the 3 calls. If you don't have a Host(E) variant, then how do you return Host errors?

Accordingly, users won’t have to deal with the additional overhead of introducing errors like the CommitmentError. Since there is a single acceptable error variant, we can automate this process in our boundary.

Errors such as CommitmentError are defined by us, not the user. The user only defines ValidationContext::Error; we define all the others.

In similar cases, unavailability shouldn't be treated as an error. Instead, it should be represented by None as a legitimate response from a storage call.

Returning an error doesn't mean it's "illegitimate". Result<Option<Output>, Error> is mathematically equivalent to Result<Result<PacketCommitment, CommitmentError>, Self::Error> (where CommitmentError is defined with just one variant); they're isomorphic. Choosing one over the other is a subjective decision; a matter of taste. I prefer Result<Result<PacketCommitment, CommitmentError>, Self::Error>, since it effectively avoids form of boolean blindness (which we could call "Option blindness"). Basically, reading

if matches!(ctx_a.get_packet_commitment(&commitment_path_on_a)?, CommitmentError::NotFound) {
    return Ok(());
};

tells me a ton more about what's going on; I see that the commitment was not found directly at the call site. Compare to your suggested signature:

if ctx_a.get_packet_commitment(&commitment_path_on_a)?.is_none() {
    return Ok(());
};

is_none() doesn't tell me the semantics of None in this case; I need to go read the docstring of the method to remember what None means in this case.

So using a Result doesn't mean the error is "illegitimate"; it's just an equivalent but more readable way of encoding the scenario where the packet commitment is not found.

Apr 10 '23 18:04 plafer

ibc-rs ibc-rs copied to clipboard

Make core `Error` enums less specific

Host errors

Protocol errors

Top-level error type

Internal logic based on errors from the host

ibc-rs
ibc-rs copied to clipboard