pmx icon indicating copy to clipboard operation
pmx copied to clipboard

memory leak in PMX where Meter class retains object ExponentiallyMovingWeightedAverage and grows to substantial size (heapdump screens included)

Open askhogan opened this issue 9 years ago • 48 comments

See screenshots. This is a continued and sharp rise in use over 8 hour period. The box has consistent processing flow.

The leak appears to be from the global include and extension of var Probe = require('./Probe.js');

See heapsnapshots. This is an 8 hour period of time.

screen shot 2015-05-20 at 12 44 28 pm screen shot 2015-05-20 at 12 44 32 pm

askhogan avatar May 20 '15 17:05 askhogan

Thanks for inspecting,

https://github.com/keymetrics/pmx/blob/master/lib/utils/EWMA.js does not look to retain any reference

How do you know that it comes from EWMA?

Unitech avatar May 20 '15 21:05 Unitech

Thanks for followup. I know that it is PMX, because Meter is the retainer object. The retainer of Meter is the system / Context. So this is occurring from the require statement and not some custom probe I am applying.

Also, when I turn off pmx, the memory leak goes away.

screen shot 2015-05-20 at 7 02 57 pm

screen shot 2015-05-20 at 7 05 10 pm

askhogan avatar May 21 '15 00:05 askhogan

I can provide the heapdump snapshots to you, just pm me. Its a production box.

[email protected]

askhogan avatar May 21 '15 00:05 askhogan

So, you aren't using any custom probes ? This is just pmx running on default configuration ?

jshkurti avatar May 21 '15 13:05 jshkurti

Yes. No custom probes. I do have one action to take a heapdump. The other odd thing is keymetrics shows me more memory usage than the box has in total. See screenshot with 36GB

[Collaborator Edit] => removed screenshot link because it contains your keys

askhogan avatar May 21 '15 17:05 askhogan

screen shot 2015-05-21 at 11 52 25 am

askhogan avatar May 21 '15 17:05 askhogan

Thanks for the remove. Didn't know keys were in there. Let me know if I need to post again or if you were able to get. I trashed mine so will probably have to recreate. The jist was keymetrics added up to 36GB, but my box only has 30GB total.

I know you guys are swamped. I am more than happy to help or provide more testing. I love PM2. We are beginning to rely on it heavily. Just point me in the right direction.

askhogan avatar May 23 '15 06:05 askhogan

I did more testing on this using node-inspector. I can reproduce the leak each time. I was able to resolve the leak by adding pmx to my client library and setting the below options.

require('pmx').init({ http : false, // (Default: true) errors : false, custom_probes : false, network : false, // Traffic usage monitoring ports : false // Shows which ports your app is listening on });

I bet the leak is likely on http, however, I disabled for all of them. I will update once I have more time to narrow it down.

screen shot 2015-05-25 at 1 53 00 am screen shot 2015-05-25 at 1 53 06 am

askhogan avatar May 25 '15 17:05 askhogan

Hey, can you tell for sure if it's http or custom_probes or errors ? (These are the only flags true by default, network and ports are false by default). I also think it's http thought but just to be sure.

And can you try with [email protected] ? I changed a bit the way how http is overloaded but I can't tell for sure if it fixes the memory leak.

Thanks again for your contribution :)

jshkurti avatar May 26 '15 16:05 jshkurti

It's for sure http. I have some more node inspector dumps. I think it doesn't back off or batch enough under high load. This box I have is a message processor. Long running and a hundred thousand + messages. All outbound connections.

I tried to see where to edit but didn't have enough time. New relic has some good batching code in their node SDKs for reference.

On Tuesday, May 26, 2015, Joni Shkurti [email protected] wrote:

Hey, can you tell for sure if it's http or custom_probes or errors ? (These are the only flags true by default, network and ports are false by default). I also think it's http thought but just to be sure.

And can you try with [email protected] ? I changed a bit the way how http is overloaded but I can't tell for sure if it fixes the memory leak.

— Reply to this email directly or view it on GitHub https://github.com/keymetrics/pmx/issues/23#issuecomment-105598053.

askhogan avatar May 26 '15 17:05 askhogan

Will give it a shot later today with new version.

On Tuesday, May 26, 2015, Patrick Hogan [email protected] wrote:

It's for sure http. I have some more node inspector dumps. I think it doesn't back off or batch enough under high load. This box I have is a message processor. Long running and a hundred thousand + messages. All outbound connections.

I tried to see where to edit but didn't have enough time. New relic has some good batching code in their node SDKs for reference.

On Tuesday, May 26, 2015, Joni Shkurti <[email protected] javascript:_e(%7B%7D,'cvml','[email protected]');> wrote:

Hey, can you tell for sure if it's http or custom_probes or errors ? (These are the only flags true by default, network and ports are false by default). I also think it's http thought but just to be sure.

And can you try with [email protected] ? I changed a bit the way how http is overloaded but I can't tell for sure if it fixes the memory leak.

— Reply to this email directly or view it on GitHub https://github.com/keymetrics/pmx/issues/23#issuecomment-105598053.

askhogan avatar May 26 '15 17:05 askhogan

I've been testing a bit with this sample code :

var pmx = require('pmx').init({
  network: true,
  ports: true
});

var http = require('http');

setInterval(function keepAlive() {
  global.gc();
}, 500);

http.createServer(function(req, res) {
  res.end('hello world');
}).listen(8080);

(launch it with --node-args="--expose-gc")

And I flooded the server with requests via ab -c 200 -n 1000000 localhost:8080/

I wasn't able to reproduce the memory leak :/ Memory usage stabilized at around ~50MB and never went above that.

jshkurti avatar May 27 '15 15:05 jshkurti

Can you give me a sample code which triggers the leak please ?

jshkurti avatar May 27 '15 16:05 jshkurti

Sure. It relies on some internal dependencies. Give me some time to clean it up.

Did you use node .12.x? What version of pmx did you use? I was on 0.3.11 when I ran that

Environmentally the first screenshots (7 days ago) I gave you that started this ticket were in production. I was running pm2 cluster and no pmx inside. I was able to get rid of the memory increase by setting pmx.init({http: false}) inside of my production code. Can you explain a little bit more how pm2 and pmx interact? It looks to me from the code that pm2 has default options to communicate with keymetrics if pmx is not defined and then (if) pmx is included it will send an rpc command to pm2 to modify those default options.

The second screenshots (2 days ago) I gave you were running through node-inspector so I wasn't able to run through pm2. I explicitly set options via pmx to imitate what pm2 would do.

I just want to make sure I am comparing apples to apples. Will code that is not using pm2, and just using pmx act the same as a code that is being run by pm2 (after keymetrics initialization)?

askhogan avatar May 27 '15 19:05 askhogan

pmx is injected automatically when an app is launched in cluster mode with default options https://github.com/keymetrics/pmx/blob/master/lib/index.js#L32-L42

Did you use a framework like express, koa, sails ? Because i ran my tests with a basic http server and weren't able to make it leak.

jshkurti avatar Jun 01 '15 09:06 jshkurti

I used this sample : https://github.com/jshkurti/benchmark

jshkurti avatar Jun 01 '15 13:06 jshkurti

I did 3 heapsnapshots within 1 hour interval each. The Meter object stayed still : meter

jshkurti avatar Jun 01 '15 13:06 jshkurti

Can you give it a try with [email protected] please ? :)

jshkurti avatar Jun 01 '15 13:06 jshkurti

And/or give us a sample code which triggers the leak.

jshkurti avatar Jun 01 '15 13:06 jshkurti

@askhogan :)

jshkurti avatar Jun 02 '15 15:06 jshkurti

Sorry man. I got swamped over the last few days. I will try this, this week.

askhogan avatar Jun 02 '15 15:06 askhogan

Thanks :D

jshkurti avatar Jun 02 '15 15:06 jshkurti

Installed 0.3.16

New settings

require('pmx').init({ http: true, errors: true, custom_probes: false, network: true, ports: true });

You can look at my metrics in keymetrics to see difference. With all of it off over the last few weeks memory usage has been stable at around 2.4GB.

askhogan avatar Jun 11 '15 00:06 askhogan

I hope it levels out, but you can see once its on, it just goes up.

screen shot 2015-06-10 at 7 40 21 pm screen shot 2015-06-10 at 7 40 35 pm

askhogan avatar Jun 11 '15 00:06 askhogan

Grew from consistent 2.4GB mem usage with pmx off to now 7GB (< 200MB -> around 500MB per worker) in a few hours with it on. The only code change was the additions I noted above. I gotta shut this off for now and just use keymetrics without the http tracking turned on (still super awesome without this on BTW!!). This server processes lots and lots of messages. When the heap gets big the garbage collection starts eating a lot of CPU and I get weird latency on batches of messages.

I know you are having trouble re-creating it. Let me know how I can help.

screen shot 2015-06-10 at 9 17 04 pm

askhogan avatar Jun 11 '15 02:06 askhogan

Pause on my comments for a bit. I just realized the npm module didn't update. Really weird. Let me re-test. That was all done on .3.11

askhogan avatar Jun 11 '15 02:06 askhogan

I had a local version at .3.11 and a global version inside pm2 at .3.16.... I updated the global pm2 version earlier. I now just updated the local version as well. Re-testing...

askhogan avatar Jun 11 '15 02:06 askhogan

Same issue. I had to turn off http.

screen shot 2015-06-10 at 11 26 51 pm

askhogan avatar Jun 11 '15 04:06 askhogan

Thanks for reporting,

We have some questions:

  • Which Node.js version do you use?
  • Which Node.js modules do you use? (if we could get a copy of your dependencies it would be helpful)
  • Do you use NewRelic in the same time?
  • Are you doing anything uncommon with the HTTP?

If you find the time, the best would be to have small subset of your application to reproduce the leak ourselves.

Thanks! Best, Alex

Unitech avatar Jun 11 '15 15:06 Unitech