hub icon indicating copy to clipboard operation
hub copied to clipboard

Out Of Memory errors

Open Splaktar opened this issue 10 years ago • 11 comments

There are a number of reasons that the Hub stops responding to requests (OOM, exceptions, hangs, etc). The goal isn't to solve all of these problems because there will likely be more introduced in the future via open source contributions and limited automated testing.

We need to setup a system to improve the Hub's HA, ideally via pm2 and possibly other packages. This is a common problem with Node.js projects and there are many examples and guides for handling this. We just need someone to set it up, test it, and finally work with me on deployment.


On Oct 6th at 6am the Hub started responding to all requests with a 502 error and the console just logged the request and timed out processing it at 2 seconds. This appears to be different than the previous issue with a resource leak which left a clear exception.

I've restarted the Hub and it's back online.

We need to spin up another Hub node and connect it to the load balancer so that if one goes down, we don't loose service. Then we probably also need to enable Stackdriver Monitoring and alerts so that we get emailed when the Health Checks fail for a node under the load balancer. We currently get no such notification.

Splaktar avatar Oct 07 '15 04:10 Splaktar

What you mentioned is a workaround right. Do we really need more than 1 server if we didn't have these kind of problems. I guess we don't have that many users.

tasomaniac avatar Oct 07 '15 06:10 tasomaniac

Happened again tonight:

FATAL ERROR: node::smalloc::Alloc(v8::Handle<v8::Object>, size_t, v8::ExternalArrayType) Out Of Memory
Aborted (core dumped)

@tasomaniac it's not so much of a work around as it is a proper HA configuration.

Splaktar avatar Oct 16 '15 03:10 Splaktar

This seems related http://stackoverflow.com/questions/31856829/memory-error-in-node-js-nodesmallocalloc

Splaktar avatar Oct 17 '15 22:10 Splaktar

Happened again as few days ago, but I didn't have time to investigate or collect stack trace.

Splaktar avatar Nov 01 '15 23:11 Splaktar

:(

On Mon, Nov 2, 2015, 02:22 Michael Prentice [email protected] wrote:

Happened again as few days ago, but I didn't have time to investigate or collect stack trace.

— Reply to this email directly or view it on GitHub https://github.com/gdg-x/hub/issues/58#issuecomment-152874890.

tasomaniac avatar Nov 01 '15 23:11 tasomaniac

FATAL ERROR: node::smalloc::Alloc(v8::Handle<v8::Object>, size_t, v8::ExternalArrayType) Out Of Memory
Aborted (core dumped)
npm ERR! Linux 3.19.0-28-generic
npm ERR! argv "/usr/bin/node" "/usr/bin/npm" "runScript" "startProd"
npm ERR! node v0.12.7
npm ERR! npm  v2.11.3
npm ERR! code ELIFECYCLE
npm ERR! [email protected] startProd: `grunt serve:dist`
npm ERR! Exit status 134
npm ERR! 
npm ERR! Failed at the [email protected] startProd script 'grunt serve:dist'.
npm ERR! This is most likely a problem with the gdgx-hub package,
npm ERR! not with npm itself.
npm ERR! Tell the author that this fails on your system:
npm ERR!     grunt serve:dist
npm ERR! You can get their info via:
npm ERR!     npm owner ls gdgx-hub
npm ERR! There is likely additional logging output above.
npm ERR! Please include the following file with any support request:
npm ERR!     /opt/hub/npm-debug.log

Splaktar avatar Nov 14 '15 17:11 Splaktar

Here's the latest status when the Hub stopped and started giving 502 status errors:

  System information as of Mon Nov 30 10:41:11 UTC 2015
  System load:  0.0               Processes:           4398
  Usage of /:   24.7% of 9.69GB   Users logged in:     0
  Memory usage: 49%               IP address for eth0: 10.111.216.151
  Swap usage:   0%
  => There are 4318 zombie processes.

4318 zombies does not look good... but the resources don't seem to be otherwise bottlenecked (RAM and disk are fine).

Splaktar avatar Nov 30 '15 10:11 Splaktar

OK, I've spun up a second Hub node (small instance, tried micro but ran into ENOMEM errors with grunt).

Now clustering and load balancing seems to be working:

hub:

[1510] worker-2317 just said hi. Replying.
[1510] was master: true, now master: true

hub-backup:

2317] Risky is up. I'm worker-2317
[2317] Cancel masterResponder
[2317] was master: false, now master: false

Then kill hub:

2317] worker-1510 has gone down...
[2317] was master: false, now master: true

And the handoff is seamless with no interruption to traffic. I tried a few iterations of this in both directions and it seemed to work great.

This does not solve the fact that the hub instances sometimes run out of memory or otherwise stop responding, but it should reduce the impact. I've started to setup Stackdriver monitoring to alert us when one of them stops responding, but I haven't completed that process yet.

Splaktar avatar Dec 01 '15 02:12 Splaktar

Still seeing OOM errors bringing the server down:

FATAL ERROR: node::smalloc::Alloc(v8::Handle<v8::Object>, size_t, v8::ExternalArrayType) Out Of Memory
Aborted (core dumped)
npm ERR! Linux 3.19.0-39-generic
npm ERR! argv "/usr/bin/node" "/usr/bin/npm" "runScript" "startProd"
npm ERR! node v0.12.9
npm ERR! npm  v2.14.9
npm ERR! code ELIFECYCLE
npm ERR! [email protected] startProd: `grunt serve:dist`
npm ERR! Exit status 134

The hub-backup also stopped responding to requests. But it did not have any kind of stack trace, crash, or logs. I really want to move this to a managed service as this is far too much trouble atm.

Splaktar avatar Dec 28 '15 00:12 Splaktar

Both VMs locked up last night, so even pm2 wouldn't have helped. We may need to go farther and implement Kubernetes to orchestrate the containers and restart them when they fail health checks.

Splaktar avatar Jan 19 '16 13:01 Splaktar

If we implement #100, then this should be much less of an issue. It's also been many months since these were an issue, though I think that this is due to the resolution of auto restarting in #88.

Splaktar avatar Mar 13 '17 01:03 Splaktar