hub
hub copied to clipboard
Out Of Memory errors
There are a number of reasons that the Hub stops responding to requests (OOM, exceptions, hangs, etc). The goal isn't to solve all of these problems because there will likely be more introduced in the future via open source contributions and limited automated testing.
We need to setup a system to improve the Hub's HA, ideally via pm2 and possibly other packages. This is a common problem with Node.js projects and there are many examples and guides for handling this. We just need someone to set it up, test it, and finally work with me on deployment.
On Oct 6th at 6am the Hub started responding to all requests with a 502 error and the console just logged the request and timed out processing it at 2 seconds. This appears to be different than the previous issue with a resource leak which left a clear exception.
I've restarted the Hub and it's back online.
We need to spin up another Hub node and connect it to the load balancer so that if one goes down, we don't loose service. Then we probably also need to enable Stackdriver Monitoring and alerts so that we get emailed when the Health Checks fail for a node under the load balancer. We currently get no such notification.
What you mentioned is a workaround right. Do we really need more than 1 server if we didn't have these kind of problems. I guess we don't have that many users.
Happened again tonight:
FATAL ERROR: node::smalloc::Alloc(v8::Handle<v8::Object>, size_t, v8::ExternalArrayType) Out Of Memory
Aborted (core dumped)
@tasomaniac it's not so much of a work around as it is a proper HA configuration.
This seems related http://stackoverflow.com/questions/31856829/memory-error-in-node-js-nodesmallocalloc
Happened again as few days ago, but I didn't have time to investigate or collect stack trace.
:(
On Mon, Nov 2, 2015, 02:22 Michael Prentice [email protected] wrote:
Happened again as few days ago, but I didn't have time to investigate or collect stack trace.
— Reply to this email directly or view it on GitHub https://github.com/gdg-x/hub/issues/58#issuecomment-152874890.
FATAL ERROR: node::smalloc::Alloc(v8::Handle<v8::Object>, size_t, v8::ExternalArrayType) Out Of Memory
Aborted (core dumped)
npm ERR! Linux 3.19.0-28-generic
npm ERR! argv "/usr/bin/node" "/usr/bin/npm" "runScript" "startProd"
npm ERR! node v0.12.7
npm ERR! npm v2.11.3
npm ERR! code ELIFECYCLE
npm ERR! [email protected] startProd: `grunt serve:dist`
npm ERR! Exit status 134
npm ERR!
npm ERR! Failed at the [email protected] startProd script 'grunt serve:dist'.
npm ERR! This is most likely a problem with the gdgx-hub package,
npm ERR! not with npm itself.
npm ERR! Tell the author that this fails on your system:
npm ERR! grunt serve:dist
npm ERR! You can get their info via:
npm ERR! npm owner ls gdgx-hub
npm ERR! There is likely additional logging output above.
npm ERR! Please include the following file with any support request:
npm ERR! /opt/hub/npm-debug.log
Here's the latest status when the Hub stopped and started giving 502 status errors:
System information as of Mon Nov 30 10:41:11 UTC 2015
System load: 0.0 Processes: 4398
Usage of /: 24.7% of 9.69GB Users logged in: 0
Memory usage: 49% IP address for eth0: 10.111.216.151
Swap usage: 0%
=> There are 4318 zombie processes.
4318 zombies does not look good... but the resources don't seem to be otherwise bottlenecked (RAM and disk are fine).
OK, I've spun up a second Hub node (small instance, tried micro but ran into ENOMEM errors with grunt).
Now clustering and load balancing seems to be working:
hub:
[1510] worker-2317 just said hi. Replying.
[1510] was master: true, now master: true
hub-backup:
2317] Risky is up. I'm worker-2317
[2317] Cancel masterResponder
[2317] was master: false, now master: false
Then kill hub:
2317] worker-1510 has gone down...
[2317] was master: false, now master: true
And the handoff is seamless with no interruption to traffic. I tried a few iterations of this in both directions and it seemed to work great.
This does not solve the fact that the hub instances sometimes run out of memory or otherwise stop responding, but it should reduce the impact. I've started to setup Stackdriver monitoring to alert us when one of them stops responding, but I haven't completed that process yet.
Still seeing OOM errors bringing the server down:
FATAL ERROR: node::smalloc::Alloc(v8::Handle<v8::Object>, size_t, v8::ExternalArrayType) Out Of Memory
Aborted (core dumped)
npm ERR! Linux 3.19.0-39-generic
npm ERR! argv "/usr/bin/node" "/usr/bin/npm" "runScript" "startProd"
npm ERR! node v0.12.9
npm ERR! npm v2.14.9
npm ERR! code ELIFECYCLE
npm ERR! [email protected] startProd: `grunt serve:dist`
npm ERR! Exit status 134
The hub-backup also stopped responding to requests. But it did not have any kind of stack trace, crash, or logs. I really want to move this to a managed service as this is far too much trouble atm.
Both VMs locked up last night, so even pm2 wouldn't have helped. We may need to go farther and implement Kubernetes to orchestrate the containers and restart them when they fail health checks.
If we implement #100, then this should be much less of an issue. It's also been many months since these were an issue, though I think that this is due to the resolution of auto restarting in #88.