fedimint Client API error handling overheal in a new submission call

Introduce a new tx submission call

See #5238

To achieve that some more changes were needed.

Overhaul error handling

Use jsonrpc error codes and ApiError as a carrier for api errors. This is a big change, that requires some thinking through.

JsonRPC standard says that all non-server-defied error codes are for application use, which jsonrpsee calls "Call"-errors.

It would generally make our errors much more idiomatic, which should make 3rd-party clients and tools job easier. Over time we should stop relying on error messages and use error codes, which now become first-class citizen.

The errors will now not be serialized, but JsonRPC error also has an data that can encode free form extra data. If needed even consensus-encoded fields can be passed in it (like I do for DynInputError etc. here).

Adapt strategy error handling

The "call" errors need to get consensus on, in a similar way that normal values do. This requires some tweak to error strategy.

May 09 '24 17:05 dpc

Concept NACK: Rpc errors and transaction errors are two completely different concepts and should not be squashed into one; For example Rpc errors can be retried while transaction errors are final and usually imply a critical failure like a faulty federation. Squashing this into one requires us to always differentiate via the is_call_err method to check and apply different logic, like different thresholds in the query strategy for example. The additional complexity introduced in code as central as the query strategies is imo not worth it; I feel very strongly about this as query strategies are generally very easy to mess up.

In fact it seems to me like this pr already introduced a race condition since transaction errors will now not be retried by ThresholdConsensus anymore. ThresholdConsensus is designed for updating values so it will retry to get new values if it cannot establish consensus, however it will not retry rpc errors so it still fails in case of an offline federation. Now consider the case where a transaction becomes valid - like claiming an incoming contract - in a 3/4 federation.

Assume guardian 0 is offline while the incoming contract is confirmed (hence 1,2,3 re online). The client will see the incoming contract as confirmed and create a transaction to claim it by submitting it to all peers via ThresholdConsensus. No guardian 0 comes back online but is behind the federations status while guardian 3 goes offline. The client will now receive a transaction error on submission from guardian 0 since it is behind, Ok(txid) from guardian 1,2 and an rpc error from guardian 3. Currently, ThresholdConsensus would retry guardian 0 until it returns Ok(txid) as well and return Ok(txid) as the federations consensus. However with this pr guardian 0 is not retried and also the ErrorStrategy cannot establish a threshold of either call or non call errors of 3 or 2 since it only has one of each.

In the current code this would lead to a panic with "Query strategy ran out of peers to query without returning a result", however with your pr the query strategy would at least fail and return an error.

As you can see the behaviour of those strategies is quite subtle and I would avoid further complexity here at all costs to keep this maintainable. Actually I would much prefer it to further simplify the query strategies instead like I did in https://github.com/fedimint/fedimint/pull/5308

May 21 '24 13:05 joschisan

Sticking to standards is nice, but given that there likely won't be a second client impl it probably isn't something we should expand too much energy on. Do error codes actually help us long term?

We need some way to enumerate potential failure cases. Using Decodable enum was kinda trying to do that, but is just too rigid on every other account.

Error codes are robust way to enumerate and distinguish error/outcome cases.

And since client side needs to get a consensus on the error itself, getting consensus on error code itself, seems most natural. Otherwise if consensus on the application error is to be established on the entirety of the error (like it's encoding or serialization), it's impossible to even fix a typo in an error.

I'm not strongly set on jsonrpc error codes ... but it does have nice properties, like not requiring another level of wrapping error in a success on the wire.

Touching the query strategy makes me feel uneasy, huge potential to break stuff and cause us a lot of debugging.

There are two different aspects here:

How is the application-level (aka transaction) error represented on the wire.
How is it implemented in the client code.

Conceptually one way or the other the result of an api call to a peer has 3 outcomes:

enum ApiResult {
  TransportSuccess(CallOutcome)
  TransportError
}

enum CallOutcome {
  ApplicationError(...)
  ApplicationSuccess
}

The code in this PR. is clumsy w.r.t. query strategy, because it doesn't convert the underlying jsonrpc error to something like ApiResult. I think with a bit better refactoring, query strategy logic could largely remain intact.

As you can see the behaviour of those strategies is quite subtle and I would avoid further complexity here at all costs to keep this maintainable.

Instead of being scared of it, maybe we should add some tests.

Query strategies are (or at least should be) very convenient to test. We call them with responses, they give back outputs that we can assert.

There are lots of places in the code that deal with mutable state, and have side-effects, etc. I don't find query strategies particularly scary.

May 22 '24 03:05 dpc

To sum up:

I agree that changes to the error handling query strategies here are clumsy, but I think it can be done better. It didn't occurred to me at the time, and I wanted to wrap it up so there's something to talk about.
When client needs to establish some form of consensus on the non-success application-level outcome, what should it do it on exactly. Codes, error messages, some serialization of the whole thing? Anything you do it on, makes that thing near-unevolvable, as it will break client-side consensus forming. Because of it, it seems to me that some form of minimal "codes" is the way to go. "message" and "data" in my mind should be used for debugging purposes only, or very special circumstances. And I feel (much less strongly) that if it will be "codes", we might as well flatten it into "jsonrpc error" as it has both code, message and data.
If client code would "unflatten" the "flatten" jsonrpc error codes early, so that the query strategy logic can remain the same, would using jsonrpc error codes would be OK with you? If not - what else would you prefer and why?

@elsirion @joschisan

May 22 '24 03:05 dpc

First of all, I think @joschisan, @dpc and I should get on a call about this. You both have valid points and trying to get to a solution that everyone can agree on is very hard asynchronously.

Using Decodable enum was kinda trying to do that, but is just too rigid

Could we just add a default variant? We need to model possible variants very rigidly here anyway since adding an unknown error that old clients can't interpret could lead to backwards incompatibility (e.g. they wouldn't know if it's retryable).

Otherwise if consensus on the application error is to be established on the entirety of the error (like it's encoding or serialization), it's impossible to even fix a typo in an error.

The encoding of the error in question doesn't use strings, but enums (and thus indices).

Query strategies are (or at least should be) very convenient to test. We call them with responses, they give back outputs that we can assert.

Well, I remember us having some very nasty bugs resulting from subtle errors in query strategies that @joschisan debugged for days back then. These were leading to very hard to debug, timing and online-ness dependent payment failures and loss of funds. Not messing them up again is a high priority.

I second that they should be tested, but a comprehensive test suite would need to simulate different response orders, missing responses, etc. Fuzzing that would be great imo or possibly even exhaustive testing if the combinatorial complexity isn't too bad.

May 22 '24 12:05 elsirion

Could we just add a default variant?

We could, but if the query strategy is just comparing whole encoding, then any new variant will not be considered the same error, even if it supposed to be.

The encoding of the error in question doesn't use strings, but enums (and thus indices).

Right, my bad. So let's say - adding a new field, so it's possible to print the extra information. Adding a new field requires a new variant, even if that new variant means exactly same thing as the old variant, just carries more data.

The variants are an enumeration of possible ... encoded states. Error codes are an enumeration of possible "not OK logical outcomes". Coupling them is the issue.

We need to model possible variants very rigidly here anyway since adding an unknown error that old clients can't interpret could lead to backwards incompatibility (e.g. they wouldn't know if it's retryable).

Good point. We might not be able to add any new error codes to the existing rpc calls altogether. If you need more error codes, it needs to be a new api call, on a new api version.

BTW. That also made me realize that in a submit, possibly multiple inputs/outputs can have issues, so why would we return only the first one? That could possibly spook some logic somewhere.

Also - right now we completely ignore the transaction details. If enough peers return transaction error, it is just ignored. So possibly submit should just return a single application error code "transaction invalid" and that's about it. Everything else would be just "extra metdata" for debugging purposes.

I know that @joschisan was describing submit errors being useful for lnv2, but maybe that should be facilitated by another module-specific call? After lnv2-client-module tx gets "transaction invalid" from "submit", it can call some extra endpoint to figure out what to do about. This way it's actually possible to evolve the whole thing.

Module specific errors being returned from "submit" seems just wrong anyway.

I second that they should be tested, but a comprehensive test suite would need to simulate different response orders, missing responses, etc.

The timing doesn't matter, only the order of responses. So nothing fancy is really needed:

let mut query_strategy = SomeQueryStragy::new();
assert_eq!(query_strategy.process(PeerId(0), SomeFakeResponse { a: 1}), QueryStep::Continue);
assert_eq!(query_strategy.process(PeerId(1), SomeFakeResponse { a: 1}), QueryStep::Continue);
assert_eq!(query_strategy.process(PeerId(2), SomeFakeResponse { a: 1}), QueryStep::Success);

Sure, one could write a property test (fuzzing seems too low level, property testing fits better here) to test a lot of combinations, but in practice there are only a handful of boundary conditions worth testing for dpc accidentally breaking them in a freak refactoring accident.

May 22 '24 21:05 dpc

BTW. That also made me realize that in a submit, possibly multiple inputs/outputs can have issues, so why would we return only the first one?

Why waste compute on validating a tx that will not be accepted?

I know that joschisan was describing submit errors being useful for lnv2, but maybe that should be facilitated by another module-specific call? After lnv2-client-module tx gets "transaction invalid" from "submit", it can call some extra endpoint to figure out what to do about. This way it's actually possible to evolve the whole thing.

Then we'd need to save information about rejected transactions, which iirc @joschisan specifically removed when migrating to AlephBFT (likely because DoS risk).

property testing fits better here

I was thinking of (ab)using fuzzer input for choosing message ordering :sweat_smile: I need to read up on what property testing actually is :see_no_evil: I guess something similar, just less hacky.

May 23 '24 11:05 elsirion

@elsirion

BTW. That also made me realize that in a submit, possibly multiple inputs/outputs can have issues, so why would we return only the first one?

Why waste compute on validating a tx that will not be accepted?

I don't mean we should accept it. Federation's submit call should reject it, but why would we return only details of the first error that was encountered. What if there are other errors? If client logic was to ever depend on the details of the rejection, there's no guarantee that such details will be returned, because there's always a possibility that some other irrelevant error wasn't encountered and returned first.

Then we'd need to save information about rejected transactions

No we wouldn't. On rejected transaction the client, could e.g. send that whole transaction (or relevant parts) to e.g. LNv2 module API, so that module can validate LNv2 parts and respond to the client with whatever details the client needs to do what it needs to do.

To sum up: What I'm saying is that submit should just return Ok or Rejected. Anything else is unworkable, AFAICT. If any client logic needs to react to rejection in certain cases, it can use the rejection as a trigger, to use additional module-specific APIs to figure out the details it needs.

May 24 '24 03:05 dpc

Stale, and I don't think it will get any action anytime soon.

Oct 16 '24 19:10 dpc