cosmos-sdk
cosmos-sdk copied to clipboard
SDK Errors UX
Context
Error registration in modules has proven to be pretty cumbersome, e.g. needing to avoid certain integers.
Proposal
There are two proposals floating around:
- Use gRPC error codes for all errors in modules.
- Allow modules to bypass registration completely and simply define a codespace and their own errors as needed.
Specifically, a module would define it's own codespace (this can even be simply inferred from the store key). The question is, does a module define it's own errors or does it just simply use the gRPC ones?
/cc @tac0turtle @aaronc
Error registration in modules has proven to be pretty cumbersome, e.g. needing to avoid certain integers.
This is kind of needed to preserve a unique mapping of (codespace, code) which was supposed be meaningful in the app. As long as we maintain this restriction, I don't see that much we can simplify. However...
The whole error code / code space idea was a design from tendermint that pre-dated me and the cosmos sdk. Tendermint wanted unique error codes for all failures. In practice, no one ever uses them and the produce a lot of burden on app developers. People heavily rely on the error messages, but the error codes can be broken into 3 groups:
- out of gas error
- other panic
- application error
The first two are handled in ante handlers. We could simply let the application return normal Go errors (errors or github.com/pkg/errors) and the ante handler assigns them the same error code (eg 501) and a fixed code space. We just use the string message in the logs to return to the users as we do now.
This would be a consensus breaking change but I challenge you to find one client that checks the error codes beyond code == 0. We could even do this as part of 0.47 and simplify a lot. Keep the sdk/errors package as is and people can use it still (non-breaking) but any new module can just use simpler errors and refactor existing ones there slowly.
Specifically, a module would define it's own codespace
I think this is how it is done now. Or at least wasmd does this since ages. Actually, staking does the same.
Yes, we have to skip 0 and 1 and then add a few lines to create a list of all error codes. It is a bit annoying but I don't see much simplification as long as we want meaningful code ids. As I said above, I think there was a lot of demand on providing these abc error codes and no usage. Happy to see them go.
This is kind of needed to preserve a unique mapping of
(codespace, code)which was supposed be meaningful in the app.
I'm unaware of any application that uses the codespace or inspects it for anything. I'd propose removing them altogether since users only care about the error message
I don't think we should remove the ability to register errors because that would break too much code, but we can point users away from it.
I think providing a good set of common errors in cosmossdk.io/errors and/or supporting out of the box the gRPC status codes are good approaches. Maybe we do both.
The gRPC status codes are already in common usage when people implement query handlers so if they're reusable elsewhere it's one API to learn. And they're also pretty well thought out IMHO (for instance RESOURCE_EXHAUSTED can be used for gas).
But having reusable SDK specific errors would also be good. I just think they should be more well thought out and minimal than all the ones currently registered in types/errors.
I insist on all new Osmosis code that we never register errors, because it is an enormous overhead / barrier to safe patch releases.
I advocate for removing the sdk error registration entirely for now, and only revisit it if someone is going to make a plan to ensure clients can actually use them (and gather what they really care about being merkelized)
I insist on all new Osmosis code that we never register errors, because it is an enormous overhead / barrier to safe patch releases.
This <3
To re-iterate, no one really uses codespaces or codes for that matter...maybe they use codespaces, I'm not sure. But codes are really meaningless. So that's why I proposed we remove registration and just rely on good messaging and clear codespaces as we do now. Removing the unnecessary registration process.
just rely on good messaging and clear codespaces as we do now. Removing the unnecessary registration process.
What do codespaces get us?
Knowing where the error came from. Like which module or package. Otherwise a invalid logic could mean 20 different things
I use codespaces! There are certain codes/spaces combos that denote no fee was taken. These are used in the earlier set of blocks, but still useful if theres a very specific scenario.
There are certain codes/spaces combos that denote no fee was taken.
I assume these are from sdk/auth module? Can you list the ones you care about?
I think if we have even a dozen cases people care about, we can keep those, but in general when writing new modules, they just return generic error that has codespace somehow (how to do that in least invasive way) but not a specific code per error.
What I'm getting from this discussion is we want code spaces, which could be inferred from the module name, if there is a need for more granularity the application developer can register another code space (staking/delegation, etc..).
Error codes are not used fully because there is not a design flow we are aware that uses them. We should remove error codes and simply return a default number.
What will be the API for just using the code space?
From a relayer perspective (i.e., a client for the SDK/app) we typically match on error messages. Examples:
- client state verification error (from IBC-go module): https://github.com/informalsystems/hermes/blob/5cdae2b1ae9e74c39627a645ab1c7523897cb3d6/crates/relayer/src/error.rs#L568-L580
- we used to match on
status.code() != tonic::Code::InvalidArgumentbut that changed across releases (see the comment in that code) and broke the relayer so we're only matching on the message since then
- we used to match on
- for account sequence mismatch: https://github.com/informalsystems/hermes/blob/5cdae2b1ae9e74c39627a645ab1c7523897cb3d6/crates/relayer/src/error.rs#L582-L603
- for out of order packets (error from IBC-go module): https://github.com/informalsystems/hermes/blob/5cdae2b1ae9e74c39627a645ab1c7523897cb3d6/crates/relayer/src/error.rs#L605-L618
- for detecting ignorable account sequence mismatch we parse the error message and look at its details inside: https://github.com/informalsystems/hermes/blob/5cdae2b1ae9e74c39627a645ab1c7523897cb3d6/crates/relayer/src/error.rs#L620-L639
In general, dealing with the above error messages is very brittle and feels ad-hoc. In an ideal world, messages would be typed and encoded in a .proto file so we can consume/match on them in some automated way instead of string matching. But I guess a tradeoff of that approach is code evolving slower.
- Typed errors!
- Proper codespaces (already used)
- Remove need for registration
- Remove the need for error codes?
@adizere Why didn't you use the existing error codes?
There are codes for fee issues and for sequence mismatch.
If you don't currently use codes to detect this, it is a bit of a proof that they don't help client.
@ethanfrey The ones I watch for are sdk/8 and sdk/32
@adizere Why didn't you use the existing error codes?
There are codes for fee issues and for sequence mismatch.
If you don't currently use codes to detect this, it is a bit of a proof that they don't help client.
heavily agree here, you should not be matching strings for errors.
@adizere Why didn't you use the existing error codes?
Two reasons:
-
We tried using error codes and had at least in one case bite us back, so we eliminated matching on the error code
- the specific case is documented here, see the comment:
Gaia v6.0.1 (SDK 0.44.5) returns code
InvalidArgument, whereas gaia v6.0.4 // (SDK 0.44.6, and potentially others) returns codeUnknown. // Workaround by matching strictly on the status message. -
The team has not had much SDK experience (this was especially the case 1-2 years ago, but still applies today), so using a piece of input from SDK (error codes) in a way in which we didn't fully understand implications and dependencies seemed unwise
- one way to think of this is we were applying the unix/robustness principle of being "liberal on what we accept" from SDK and make the least assumptions we can make
My proposal here is:
- remove the need to register errors to wrap them with a module name.
- error codes defaults to a integer that represents a module error if the error code is not already set by the module.
- modules can still register with custom error codes, but it's not needed.
- message server and query server wraps the returned error with the module name.
This simplifies and possibly removes most of the custom errors in modules
@tac0turtle Would the error codes be unique across the code? or only unique within the module?
error codes can only be guaranteed to be unique within the module
@adizere
I am trying to understand your comment here
You say there is a change of error code for a given action between sdk 0.44.5 and 0.44.6? Such a change is consensus breaking (code is committed to app hash, unlike the string). Seems like something funky was going on somewhere...
I quickly checked the commits between 0.44.5 and 0.44.6 and there was no new error defined nor updated. BTW, 0.44 is EOL (that being said, critical bugs should be pointed).
@tac0turtle Would the error codes be unique across the code? or only unique within the module?
I was thinking the error code is only the module, not specific to an error as this is unneeded information in consensus
we worked on errorsv2 that removed a lot of functionality in the errors pkg and made it zerodeps. users dont need to register errors anymore and i can see the sdk moving away from using sdk errors going forward