sticky-cluster
sticky-cluster copied to clipboard
ECONNRESET Failures
I'm seeing periodic failures on the spawned worker processes due to this socket error. It seems to be causing the worker process to die (and be restarted).
I have the following code in my socket.io
handler, but that doesn't seem to prevent the failure:
socket.on('error', err => {
debug(`Received event: 'error': ${err}`);
}
Is there a way that we can ensure that ECONNRESET failures don't bring down the worker?
Any chance for me to ask for a reproduction example? At least when it fails sometimes.
I can't reproduce it reliably. I have 20-30 active users hitting the server. I do have this collected trace, fwiw:
events.js:160
throw er; // Unhandled 'error' event
^
Error: read ECONNRESET
at exports._errnoException (util.js:1026:11)
at TCP.onread (net.js:569:26)
I've searched stack overflow and the issues I've read seem to suggest the handler at the socket level, but I'm wondering if it needs to happen at the server level. Since this module is responsible for creating the worker processes, I thought you might have some ideas.
FWIW, I think I have an idea how to fix this issue. For me, I added the following code into my Worker process:
// Handle uncaught exceptions...
process.on('uncaughtException', err => {
debug(`Process received event 'uncaughtException': ${err}`);
});
I had tried a number of different things (including handling the error
event in socket and the child_process
), but this is the only thing that worked.
If it was incorporated into the sticky-cluster
module directly, that might save others some frustration.
The idea of swallowing unknown exceptions slightly bothers me, actually.. For example, what if it was a failed DB connection instead? Worker wouldn't be able to behave properly then. The best possible strategy facing an unknown error is to die, imho.
Maybe you got any success investigating a real cause of your errors? Then we could fix a problem instead of the symptom.
The root problem seems to be somewhere in the socket/web socket/socket.io layer. Somehow a client disconnects from the socket while data is being sent. This rather benign failure is ECONNRESET
and while it seems like it should be caught (or catchable) at the socket or HTTP Server level, it is not. The result seems to be that the only place to trap and not cause your worker process to go down, is to handle it at the process level.
I don't like it for the same reasons you mentioned. I just cannot find a more elegant way to handle it.