sticky-cluster icon indicating copy to clipboard operation
sticky-cluster copied to clipboard

ECONNRESET Failures

Open gboysko opened this issue 7 years ago • 5 comments

I'm seeing periodic failures on the spawned worker processes due to this socket error. It seems to be causing the worker process to die (and be restarted).

I have the following code in my socket.io handler, but that doesn't seem to prevent the failure:

socket.on('error', err => {
    debug(`Received event: 'error': ${err}`);
}

Is there a way that we can ensure that ECONNRESET failures don't bring down the worker?

gboysko avatar Jun 23 '17 21:06 gboysko

Any chance for me to ask for a reproduction example? At least when it fails sometimes.

uqee avatar Jun 23 '17 21:06 uqee

I can't reproduce it reliably. I have 20-30 active users hitting the server. I do have this collected trace, fwiw:

events.js:160
      throw er; // Unhandled 'error' event
      ^

Error: read ECONNRESET
    at exports._errnoException (util.js:1026:11)
    at TCP.onread (net.js:569:26)

I've searched stack overflow and the issues I've read seem to suggest the handler at the socket level, but I'm wondering if it needs to happen at the server level. Since this module is responsible for creating the worker processes, I thought you might have some ideas.

gboysko avatar Jun 23 '17 22:06 gboysko

FWIW, I think I have an idea how to fix this issue. For me, I added the following code into my Worker process:

    // Handle uncaught exceptions...
    process.on('uncaughtException', err => {
        debug(`Process received event 'uncaughtException': ${err}`);
    });

I had tried a number of different things (including handling the error event in socket and the child_process), but this is the only thing that worked.

If it was incorporated into the sticky-cluster module directly, that might save others some frustration.

gboysko avatar Jun 27 '17 21:06 gboysko

The idea of swallowing unknown exceptions slightly bothers me, actually.. For example, what if it was a failed DB connection instead? Worker wouldn't be able to behave properly then. The best possible strategy facing an unknown error is to die, imho.

Maybe you got any success investigating a real cause of your errors? Then we could fix a problem instead of the symptom.

uqee avatar Jul 08 '17 14:07 uqee

The root problem seems to be somewhere in the socket/web socket/socket.io layer. Somehow a client disconnects from the socket while data is being sent. This rather benign failure is ECONNRESET and while it seems like it should be caught (or catchable) at the socket or HTTP Server level, it is not. The result seems to be that the only place to trap and not cause your worker process to go down, is to handle it at the process level.

I don't like it for the same reasons you mentioned. I just cannot find a more elegant way to handle it.

gboysko avatar Jul 08 '17 15:07 gboysko