nelson.cli icon indicating copy to clipboard operation
nelson.cli copied to clipboard

Nelson crashes after a few hours/days

Open rafaelbrizola opened this issue 7 years ago • 14 comments

I don't run nelson as a service. I run it using the command:

nohup sudo nelson --config /Users/iota/nelson/config.ini &

Just happened for the 3rd time. I go to check my node and found out that nelson is not running. This last time it was running for almost 24 hours before crashing. More details in the log below.

Expected behaviour

Nelson running continuously

Actual behaviour

After a while, can be hours or a few days, Nelson crashes.

Steps to reproduce

Not sure. It crashes randomly

Basic Info

  • Operating System: Macos High Sierra (10.13.2)
  • Node (npm) Version: v8.4.0
  • IRI Version: 1.4.1.6
  • Nelson version: 0.3.16

Nelson Info

  • Epoch: don't know
  • Cycle: don't know
  • Connected peers: my configuration is 6 in, 5 out

8:30:23 AM.190 16600::NODE new cycle 8:30:23 AM.195 16600::IRI Neighbors removed (if there were any): udp://31.48.190.3:14600 8:30:23 AM.195 16600::NODE connection closed 31.48.190.3:16600 (remotely dropped) 8:30:23 AM.195 16600::IRI Neighbors removed (if there were any): udp://139.226.182.140:14600 8:30:23 AM.195 16600::NODE connection closed 139.226.182.140:16600 (remotely dropped) 8:30:23 AM.196 16600::IRI Neighbors removed (if there were any): udp://31.48.190.3:14600 8:30:23 AM.196 16600::IRI Neighbors removed (if there were any): udp://139.226.182.140:14600 /usr/local/lib/node_modules/nelson.cli/node_modules/ws/lib/WebSocket.js:638 this._req.abort(); ^

TypeError: Cannot read property 'abort' of null at ClientRequest._req.setTimeout (/usr/local/lib/node_modules/nelson.cli/node_modules/ws/lib/WebSocket.js:638:16) at Object.onceWrapper (events.js:314:30) at emitNone (events.js:105:13) at ClientRequest.emit (events.js:207:7) at Socket.emitTimeout (_http_client.js:722:34) at Object.onceWrapper (events.js:314:30) at emitNone (events.js:105:13) at Socket.emit (events.js:207:7) at Socket._onTimeout (net.js:398:8) at ontimeout (timers.js:469:11)

rafaelbrizola avatar Jan 19 '18 04:01 rafaelbrizola

In my case it crashes several times a day. It's probably a different crash, prints the following error on console:

Error: read ECONNRESET
    at _errnoException (util.js:1024:11)
    at TCP.onread (net.js:615:25)

OS Linux, Nelson v.0.3.16

GoodMirek avatar Jan 21 '18 21:01 GoodMirek

ECONNRESET is anther bug.

@rafaelbrizola have you closed your mac's lid in between? I think, there is a bug reported in https://github.com/websockets/ws/issues/1280 that is related to this issue.

I will check if this can be fixed/catched somehow...

romansemko avatar Jan 23 '18 06:01 romansemko

Hi Roman, Thanks for having a look at this. I configured my macbook to not sleep when closing the lid on power cord. Of course, something could have happened, like a power outage, and it could have lost that configuration momentarily. I would thought that's unlikely because the error happened consistently, but let's not disconsider that possibility.

rafaelbrizola avatar Jan 23 '18 06:01 rafaelbrizola

I will try catching it and, at least, cleaning up neighbors before crashing. Will need more investigation, though..

romansemko avatar Jan 23 '18 06:01 romansemko

Let's work together to get the bug fixed upstream. I have a feeling that something isn't working properly in Node.js core though as per this comment.

lpinca avatar Jan 23 '18 11:01 lpinca

@lpinca , I have upgraded to Node LTS 8.9.4 Let's see if this problem persists.

@rafaelbrizola , could you upgrade your node version to 8.9.4 and try the new Nelson version 0.3.21?

romansemko avatar Jan 24 '18 10:01 romansemko

@romansemko Sure. Just did it. Let see how it goes

rafaelbrizola avatar Jan 25 '18 04:01 rafaelbrizola

Hi, I left it running and around 36 hours later I got the error below. Nelson is still running, but doesn't update anymore.

7:03:19 AM.026 16600::NODE new cycle 7:03:19 AM.313 16600::IRI Neighbors removed (if there were any): udp://188.166.3.182:14600 7:03:19 AM.313 16600::NODE connection closed 188.166.3.182:16600 (remotely dropped) 7:03:19 AM.319 16600::IRI Neighbors removed (if there were any): udp://176.9.139.119:14600 7:03:19 AM.319 16600::NODE connection closed 176.9.139.119:16600 (remotely dropped) 7:03:24 AM.037 16600::IRI Neighbors removed (if there were any): tcp://81.169.215.74:15600 7:03:24 AM.037 16600::NODE connection closed 81.169.215.74:16600 (remotely dropped) 7:03:29 AM.045 16600::NODE connecting peer 46.5.17.47:16600 7:03:29 AM.046 16600::NODE connecting peer 104.172.104.192:16600 7:03:29 AM.046 16600::NODE connecting peer 83.135.133.111:16600 7:03:29 AM.344 16600::IRI Neighbors removed (if there were any): udp://83.135.133.111:14600 7:03:29 AM.345 16600::NODE connection closed 83.135.133.111:16600 (remotely dropped) 7:03:29 AM.365 16600::IRI Neighbors removed (if there were any): udp://46.5.17.47:14600 7:03:29 AM.365 16600::NODE connection closed 46.5.17.47:16600 (remotely dropped) 7:03:34 AM.058 16600::IRI Neighbors removed (if there were any): tcp://104.172.104.192:15600 7:03:34 AM.058 16600::NODE connection closed 104.172.104.192:16600 (remotely dropped) 7:03:39 AM.072 16600::NODE connecting peer 23.233.137.139:16600 7:03:39 AM.073 16600::NODE connecting peer 84.229.1.5:16600 7:03:39 AM.073 16600::NODE connecting peer 77.69.198.173:16600 7:03:44 AM.088 16600::IRI Neighbors removed (if there were any): udp://84.229.1.5:14600 7:03:44 AM.088 16600::NODE connection closed 84.229.1.5:16600 (remotely dropped) 7:03:44 AM.089 16600::IRI Neighbors removed (if there were any): udp://23.233.137.139:14256 7:03:44 AM.090 16600::NODE connection closed 23.233.137.139:16600 (remotely dropped) 7:03:44 AM.090 16600::IRI Neighbors removed (if there were any): udp://77.69.198.173:14777 7:03:44 AM.091 16600::NODE connection closed 77.69.198.173:16600 (remotely dropped) 7:03:49 AM.105 16600::NODE connecting peer 106.70.61.176:16600 7:03:49 AM.106 16600::NODE connecting peer 176.100.190.180:16600 7:03:49 AM.106 16600::NODE connecting peer 173.239.214.18:16600 7:03:49 AM.121 16600::IRI Neighbors removed (if there were any): udp://106.70.61.176:14600 7:03:49 AM.122 16600::NODE connection closed 106.70.61.176:16600 (remotely dropped) 7:03:49 AM.123 16600::IRI Neighbors removed (if there were any): udp://176.100.190.180:14600 7:03:49 AM.124 16600::NODE connection closed 176.100.190.180:16600 (remotely dropped) 7:03:49 AM.126 16600::IRI Neighbors removed (if there were any): udp://173.239.214.18:14600 7:03:49 AM.127 16600::NODE connection closed 173.239.214.18:16600 (remotely dropped) 7:03:54 AM.118 16600::NODE terminating... 7:03:54 AM.124 16600::NODE terminating... 7:03:54 AM.125 16600::NODE terminating... 7:03:54 AM.128 16600::IRI Neighbors removed (if there were any): udp://94.130.168.177:14600, udp://45.77.187.45:14600 7:03:54 AM.130 16600::IRI Neighbors removed (if there were any): udp://94.130.168.177:14600, udp://45.77.187.45:14600 Unhandled Rejection at: Promise Promise { TypeError: Cannot read property 'abort' of null at ClientRequest._req.setTimeout (/usr/local/lib/node_modules/nelson.cli/node_modules/ws/lib/websocket.js:630:17) at Object.onceWrapper (events.js:313:30) at emitNone (events.js:106:13) at ClientRequest.emit (events.js:208:7) at Socket.emitTimeout (_http_client.js:708:34) at Object.onceWrapper (events.js:313:30) at emitNone (events.js:106:13) at Socket.emit (events.js:208:7) at Socket._onTimeout (net.js:407:8) at ontimeout (timers.js:475:11) } reason: TypeError: Cannot read property 'abort' of null at ClientRequest._req.setTimeout (/usr/local/lib/node_modules/nelson.cli/node_modules/ws/lib/websocket.js:630:17) at Object.onceWrapper (events.js:313:30) at emitNone (events.js:106:13) at ClientRequest.emit (events.js:208:7) at Socket.emitTimeout (_http_client.js:708:34) at Object.onceWrapper (events.js:313:30) at emitNone (events.js:106:13) at Socket.emit (events.js:208:7) at Socket._onTimeout (net.js:407:8) at ontimeout (timers.js:475:11) 7:03:54 AM.159 16600::IRI Neighbors removed (if there were any): udp://94.130.168.177:14600, udp://45.77.187.45:14600 Unhandled Rejection at: Promise Promise { TypeError: Cannot read property 'abort' of null at ClientRequest._req.setTimeout (/usr/local/lib/node_modules/nelson.cli/node_modules/ws/lib/websocket.js:630:17) at Object.onceWrapper (events.js:313:30) at emitNone (events.js:106:13) at ClientRequest.emit (events.js:208:7) at Socket.emitTimeout (_http_client.js:708:34) at Object.onceWrapper (events.js:313:30) at emitNone (events.js:106:13) at Socket.emit (events.js:208:7) at Socket._onTimeout (net.js:407:8) at ontimeout (timers.js:475:11) } reason: TypeError: Cannot read property 'abort' of null at ClientRequest._req.setTimeout (/usr/local/lib/node_modules/nelson.cli/node_modules/ws/lib/websocket.js:630:17) at Object.onceWrapper (events.js:313:30) at emitNone (events.js:106:13) at ClientRequest.emit (events.js:208:7) at Socket.emitTimeout (_http_client.js:708:34) at Object.onceWrapper (events.js:313:30) at emitNone (events.js:106:13) at Socket.emit (events.js:208:7) at Socket._onTimeout (net.js:407:8) at ontimeout (timers.js:475:11) Unhandled Rejection at: Promise Promise { TypeError: Cannot read property 'abort' of null at ClientRequest._req.setTimeout (/usr/local/lib/node_modules/nelson.cli/node_modules/ws/lib/websocket.js:630:17) at Object.onceWrapper (events.js:313:30) at emitNone (events.js:106:13) at ClientRequest.emit (events.js:208:7) at Socket.emitTimeout (_http_client.js:708:34) at Object.onceWrapper (events.js:313:30) at emitNone (events.js:106:13) at Socket.emit (events.js:208:7) at Socket._onTimeout (net.js:407:8) at ontimeout (timers.js:475:11) } reason: TypeError: Cannot read property 'abort' of null at ClientRequest._req.setTimeout (/usr/local/lib/node_modules/nelson.cli/node_modules/ws/lib/websocket.js:630:17) at Object.onceWrapper (events.js:313:30) at emitNone (events.js:106:13) at ClientRequest.emit (events.js:208:7) at Socket.emitTimeout (_http_client.js:708:34) at Object.onceWrapper (events.js:313:30) at emitNone (events.js:106:13) at Socket.emit (events.js:208:7) at Socket._onTimeout (net.js:407:8) at ontimeout (timers.js:475:11)

rafaelbrizola avatar Jan 28 '18 04:01 rafaelbrizola

@rafaelbrizola can you try to monkey patch ws and change this to this._req = undefined;?

This will help us understand if the timeout expires after an error or after the upgrade.

lpinca avatar Jan 28 '18 10:01 lpinca

Ok, done and restarted nelson. Let's see how it goes.

rafaelbrizola avatar Jan 28 '18 21:01 rafaelbrizola

nelson gui also errors out for me after about a day of running. In the top left quarter of the screen, I see the following error multiple times: Unhandled Rejection at: Promise Promise

Seems Nelson is still running though, ,because the bottom half is still updated with neighbor connects and disconnects.

hapklaar avatar Feb 06 '18 10:02 hapklaar

@rafaelbrizola any update?

lpinca avatar Apr 21 '18 11:04 lpinca

Hi @lpinca sorry for not getting back to you. After that day, I still run Nelson for a couple of days but didn't notice any other crashes. However, I stopped running the experiment, and for other reason, never got back to that. So, it's inconclusive it that change helped in my case. If nobody else is noticing the crash, you can close this issue. Thanks!

rafaelbrizola avatar Apr 22 '18 09:04 rafaelbrizola

Ok, thanks. I'm pretty sure the issue is not fixed and still persists on ws@4 and Node.js v8.x. Unfortunately I don't have a way to reproduce it, so it's hard to understand what happens and fix it.

This is the tracking issue in ws: https://github.com/websockets/ws/issues/1280

lpinca avatar Apr 22 '18 10:04 lpinca