seneca icon indicating copy to clipboard operation
seneca copied to clipboard

Error Handling Rethink

Open mcdonnelldean opened this issue 8 years ago • 52 comments

Error handling needs a rethink, my thoughts here,

  • err is not always respected, many types of errors instead cause timeouts
  • Lots of options that do sort of the same thing, errhandler, .error, seneca.on('error')
  • Cryptic error messages, feeded by logging
  • Greedy handling, testing with an Exception Assertion based lib requires weird hacks
  • Exception trapping, are we being too cautious?
  • Cryptic Gating, errors from gate executor need to be more clear

A lot of why error handling is bad is due to cryptic messages. We need to rethink the sort of messages we are returning. Gitter is littered with people asking about gate executor timeout questions because they are basically unreadble if you don't know the code.

We are also way to safe in my opinion. We shouldn't be trapping errors, the system ends up in very peculiar states. I'm a firm believer in fail fast and restart. From gitter and issues it does seem people's expectation is fail fast, we really need to consider this.

Bear in mind, our own tests are in the state they are in due to weird error handling, if we had untrapped exceptions and proper err flow we could cut our test suite in half and end up with more coverage.

Contributors and users, please feel free to add pain points and constructive feedback.

mcdonnelldean avatar Apr 12 '16 13:04 mcdonnelldean

@rjrodger what release did you want to tackle error handling/logging for ?

AdrianRossouw avatar Apr 12 '16 13:04 AdrianRossouw

@AdrianRossouw It'll be major for sure.

mcdonnelldean avatar Apr 12 '16 13:04 mcdonnelldean

We're not on the lastest cause, well, time & priority. Also, some of my remarks may be related to how our stack is built, so that it may not be relevant. From a 2 months of dev perspective, the need to have the possibility to understand from WHERE a call has been done before failing (when having cross microservices calls) is crucial from a debug perspective. Also, yes, the truncated messages are.. not helpful to reproduce or understand the call context. Or timeouts issues which are triggered when the real issue is 2 minutes before, but the whole call stack (asyn.each or eq) fails afterwards by "timing out". Gateway errors as you report them @mcdonnelldean are often a misimplementation of an act from what I've been experiencing, but could be more descriptive for sure. On production side, our concern is to not have to get our heads inside the logs, but to have logentries to warn us, which is another concern : will logentry be able to parse our logs? Or use vidi to make a visualisation of errors? ;)

Wardormeur avatar Apr 12 '16 13:04 Wardormeur

I would like a major bump to remove the timeout on messages, and possibly adding it back as a "filter" or plugin for some very specific microservices. I think the timeout is awesome for entry-point microservices, and maybe for Chairo.

mcollina avatar Apr 12 '16 13:04 mcollina

@mcollina Do you want seneca to hang in the case of a block at the other side? I'm trying to work out how we could do this and want to be clear as possible.

mcdonnelldean avatar Apr 12 '16 14:04 mcdonnelldean

Just a heads up @Wardormeur manages CoderDojo's Zen platform.

mcdonnelldean avatar Apr 12 '16 14:04 mcdonnelldean

Added related issues.

mcdonnelldean avatar Apr 12 '16 14:04 mcdonnelldean

The timeouts were responsible for cascading failures at higher loads on a system that I worked on before.

Poor design decisions around how entities were used meant that each entity load / list would fan out into multiple actions loading more entities, until eventually the node callback queue would get so long that the time it takes to respond to an action would surpass the hardcoded seneca message timeout.

so instead of things failing reasonably (as in, one can reason about it) in one place as matteo suggests, it would fail all over the place in a way that had no useful debugging info.

AdrianRossouw avatar Apr 12 '16 15:04 AdrianRossouw

Please also note: http://senecajs.org/contribute/principles.html - Continuity

Error handling rethink != major breakage is acceptable.

The challenge here is not just to rethink the error handling, but also to ensure you have a backwards compatibility path. That means code that depends on the old way still needs to work, by using an option, or a plugin.

The rule for major releases is: one line of code (typically a new plugin) is all you need to make existing code work. It is also acceptable to have an option flag to revert to previous behaviour.

It is an absolutely core value of the Seneca community that existing users are respected and that we do not break existing apps.

This does make the work of fixing error handling harder, I know :)

/cc @mcdonnelldean @AdrianRossouw @mcollina @davidmarkclements @pelger @geek

rjrodger avatar Apr 22 '16 11:04 rjrodger

@rjrodger - I wonder if, for each area, can we have a two step approach?

Step 1 - unencumbered reimagining

here thoughts can run free, protocols/implementations do not consider backwards compatibility

this would be "seneca-next-next" ... postmodern seneca

Step 2 - shimming

once we have nailed down "the best way" then we create an adapter that delivers backwards compatiblity.

If necessary, tweaks may then be made to the pristine implementation, but only in support of the shim.

this would be "seneca-next"

davidmarkclements avatar Apr 22 '16 15:04 davidmarkclements

I tend to agree with @davidmarkclements, with a catch. I think we should try to implement the new features without breaking, but not being afraid to break the current API if we see no alternative. If such an event happen, we go and seek a solution (maybe one more module). So this became a two step problem, which is easier to solve.

mcollina avatar Apr 26 '16 10:04 mcollina

I agree with the principle but we don't have the time or resources for the above. Realistically what is going to happen is we bludgeon our way through this until we have something smaller and more composable and around version 6 we can start blue sky thinking.

We need to be a little more careful here. For instance @Wardormeur has a fairly big system we will need to run some tests against (CoderDojo) ensure we break as little as possible.

Basically it must be reasonably backwards compatible within a version. We cannot expect a user to have to spend two days changing code to make v.next work, thats just not going to fly with customers who have built big systems.

mcdonnelldean avatar Apr 26 '16 10:04 mcdonnelldean

To be clear, I'm saying we supply our shims wholesale - v.next is v.next.next + shim

The judgement call is on whether it will save more time and resources in the long run to to follow the ideal implementation and work backwards to a shimmed version or to deal with the backwards compat overhead and strive towards the eventual rainbow of perfection

davidmarkclements avatar Apr 26 '16 10:04 davidmarkclements

My concern is pretty much around the volume of work to get to there given we do timed releases. Back compat to a single major version is our red line which leaves our cycle of change tighter again. Generally backwards compat means you will only have to make minor changes which roughly equate to adding a shim.

If you can change the underlying implementation but still 'add a plugin' that maintains our promise that it will all work the same if you add the shim then we are pretty much open to any change you want (as long as the change is in line with the spirit of the toolkit).

So yeah, my chief worry is back compat, if that's respected I'm happy :D

mcdonnelldean avatar Apr 26 '16 11:04 mcdonnelldean

I don't expect you guys to ensure backwards compat at 100% Updating always goes through some changes at some point, you must adapt your code. So don't be afraid to "break" if it a search/replace/add param kind of work, and not whole redesign. If it's a 2 day process that sortof fine, if it's a week, that's something that blocks our dev process for some time and may slow down the adoption of your new version. That's my only concern ;)

Wardormeur avatar Apr 26 '16 11:04 Wardormeur

@Wardormeur yeah that is pretty much the sort of break I am talking. If I can just replace a bunch of plugins or do a simple find and replace then It's good. If it's any more than that we start losing projects to rot since the work gets pushed out. Breaking bad usually means folks get left on older version and we end up with a support nightmare rather than people moving up versions.

With breaks we pay up front or in the future. I'd rather pay up front and bring our community along with us up the versions.

mcdonnelldean avatar Apr 26 '16 11:04 mcdonnelldean

@davidmarkclements Sorry to leave off on this. I'd like to organise a chat with you at some stage around this shim concept? We have a few other objectives that need a similar idea. I'd like to get a practical overview on what it means?

mcdonnelldean avatar May 09 '16 11:05 mcdonnelldean

Just ran into error handling "issue", pasting my comment from gitter.

Context: I was running into issue where my service would fail because seneca transports do not transport errors (even with fatal$ set to false). This was completely unexpected for me so I lost quite a lot of time just figuring this out, and logs did not help me.

Well it was really unexpected for me. For example, in this case these were validation errors (like, your username is taken, or, your email is not valid), so I definitely wanted to propagate the error to my web server in order to let my api consumer know what he did wrong. Of course, I could just not use the err object (at least not as the first parameter of the reply), but it's counter-intuitive to me to say there were no errors when there were. I mean, if I myself created an error object, then I probably know how to handle it, and want to handle it myself. I understand this is logical for programmer errors which you can't really recover from, but for operational ones I believe it should be possible. Maybe I'm wrong, but that doesn't change that fact that I expected this behavior (and I was not the only one), so it definitely needs to be pointed out very prominently in the docs. As for whether the behavior needs to change (by default or with some kind of flag), I'm up for that discussion.

OriginalEXE avatar May 09 '16 11:05 OriginalEXE

@OriginalEXE In my understanding there are two big error categories:

  • You make a request to a service that doesn't function properly (like a store that has the DB down). This is the case that results in an error, the service can't execute your request properly,
  • You make a request to a service that functions properly, executes your request properly but the result of your request is an "error" result (like invalid mail). This is your case. This is not an error from the system's point of view.

mihaidma avatar May 09 '16 12:05 mihaidma

@OriginalEXE @mihaidma An this I think is the issue, we have no way to express business logic errors as such. This is further exacerbated by the fact that their is no consensus as to the standard way to do business logic errors in node.

mcdonnelldean avatar May 09 '16 12:05 mcdonnelldean

I agree. I like this writeup on errors in node: https://www.joyent.com/developers/node/design/errors

Also, the library they mention for aiding the error propagation seems interesting: https://github.com/davepacheco/node-verror

So yeah, there really needs to be a way to handle operational errors other than passing the error as part of a message object and having to check for some kind of "success" flag when you receive response. This is even more annoying when you use seneca.act as a promise, as it's really logical imo to use .catch for handling the (for example) input validation failure.

OriginalEXE avatar May 09 '16 12:05 OriginalEXE

@mcollina Could you add your summary from #402

The end result of the spike for /all. We can't just remove try-catch, it sets off a series of changes that ends with needing to make a decision on error handling and some large breaking changings. Before we continue, I want to be really clear about what we are changing and make sure that each thing we add or remove is accounted for. This means we need to capture the current and proposed error flow before we start messing with it.

I really want to make sure we capture @OriginalEXE's user story above. Removing Error capture will get us cleanly failing systems and better errors but it still won't fix the major community issue of trying to deal with system level errors vs business logic level errors which is the larger priority of this work.

@rjrodger @mcollina, et al, it may be a good idea to start looking at some design docs to iterate on. They can go in the /docs folder for now. We need to capture the new and existing flow in a PR / Iterative friendly way, markdown doc in this repo seems the most sensible?

mcdonnelldean avatar May 10 '16 10:05 mcdonnelldean

Nothing https://github.com/senecajs/seneca/issues/133 here as it's pretty much the same issue as @OriginalEXE speaks about but the other way around. All roads lead to having to types of error scenarios but only one way to handle them.

mcdonnelldean avatar May 13 '16 10:05 mcdonnelldean

@mcdonnelldean just caught up with thread, happy to talk about shims but.. just to clarify, are we at a point where everyones okay with clean break + shim, or is that whole discussion still being had? If so we still need to nail down prevailing strategy and context, which is a bigger conversation requiring more input than I should or could provide alone (i.e. cc @rjrodger @pelger @mcollina ....)

davidmarkclements avatar May 13 '16 22:05 davidmarkclements

@mcdonnelldean it would be good if @rjrodger (or someone else) writes down how things are supposed to work in the current version, not how they work from a code point of view. This will help clarify the scenarios Seneca is trying to cover, and provide some good documentation for v3. Then, we can iterate anche change that document.

mcollina avatar May 17 '16 07:05 mcollina

@mcollina good idea - let me get this on paper

rjrodger avatar May 17 '16 08:05 rjrodger

+1 to write down on how things are supposed to work in current version for error handling.

dgonzalez avatar May 17 '16 23:05 dgonzalez

+1 @rjrodger did you start on this paper?

StarpTech avatar May 29 '16 10:05 StarpTech

@StarpTech This is starting post 3.0.0, we are focusing on getting that out first and then we will move on to documenting this stuff in more detail before 4.0.0 so we can get the community in on the decision making process. As soon as the draft becomes available I'll update this thread.

mcdonnelldean avatar May 29 '16 14:05 mcdonnelldean

@mcdonnelldean is it possible to track the progress on 3.0 or how it is possible to contribute?

Something like this https://developer.microsoft.com/en-us/microsoft-edge/platform/status/ or a realtime board with trello https://trello.com/b/EceUgtCL/ghost-roadmap

This is much easier to maintain e.g https://github.com/OpenIotOrg/openiot/wiki/Roadmap-Implementation-Status

StarpTech avatar Jun 07 '16 11:06 StarpTech