cloud-debug-nodejs icon indicating copy to clipboard operation
cloud-debug-nodejs copied to clipboard

Potential Memory Leak

Open tbadlov opened this issue 5 years ago • 55 comments

Thanks for stopping by to let us know something could be better!

PLEASE READ: If you have a support contract with Google, please create an issue in the support console instead of filing on GitHub. This will ensure a timely response.

Please run down the following list and make sure you've tried the usual "quick fixes":

  • Search the issues already opened: https://github.com/googleapis/cloud-debug-nodejs/issues
  • Search the issues on our "catch-all" repository: https://github.com/googleapis/google-cloud-node
  • Search StackOverflow: http://stackoverflow.com/questions/tagged/google-cloud-platform+node.js

If you are still having issues, please be sure to include as much information as possible:

Environment details

  • OS: Google App Engine
  • Node.js version: 12.14.x
  • npm version: 6.13
  • @google-cloud/debug-agent version: 4.2.1

Steps to reproduce

We found that the package cause a memory leak. This image shows our application memory with the package and without it.

image

We are running on Google App Engine with 4 nodes. Any help is appreciated.

Thanks!

tbadlov avatar Jan 10 '20 21:01 tbadlov

It looks like the issue node #28420 is impacting the version of node you are using. I suspect it could be the root cause.

DominicKramer avatar Jan 23 '20 13:01 DominicKramer

@DominicKramer node version upgrade didn't resolve the issue.

We are observing a similar issue on the upgraded node version. The memory usage gradually climbs until OOM kill when debug agent is enabled. After it was disabled the memory usage is stable.

Screen Shot 2020-02-19 at 2 59 40 PM

Running on GKE (Google Kubernetes Engine) pod

  • OS: official node docker image
  • Node.js version:12.16.0
  • npm version:6.13.4
  • @google-cloud/debug-agent version:4.2.2

legopin avatar Feb 19 '20 07:02 legopin

Thank you for the update. Have you seen similar problems on GCP services other than GKE?

DominicKramer avatar Feb 19 '20 16:02 DominicKramer

That's the only GCP service we use. We have all our services running under kubernetes

legopin avatar Feb 20 '20 02:02 legopin

Ours is on AppEngine

tbadlov avatar Feb 20 '20 19:02 tbadlov

@legopin and @tbadlov thanks for the update.

DominicKramer avatar Feb 20 '20 20:02 DominicKramer

if anyone has a repro gist or repository we can run that leaks this way that would help us get to the bottom off this a little more quickly. It helps to confirm that we're solving the problem you're hitting.

soldair avatar Feb 21 '20 18:02 soldair

@soldair Here is a reproducible code snippet: This was run in GKE as a single pod, continued to observe memory leak even when no requests were received.

Screen Shot 2020-02-24 at 5 56 19 PM

index.js

require("@google-cloud/debug-agent").start({
  projectId: 'ID',
  allowExpressions: true,
  serviceContext: {
    service: 'debug-leak',
  },
});

const Koa = require("koa");
const app = new Koa();

// response
app.use(ctx => {
  ctx.body = "Hello Koa";
});

app.listen(3000, ()=>{console.log('server started')});

Dockerfile

FROM node:12.16.0

COPY package.json package-lock.json /tmp/

RUN mkdir /opt/app \
    && mv /tmp/package.json /tmp/package-lock.json /opt/app \
    && cd /opt/app && npm ci

WORKDIR /opt/app
COPY src /opt/app/src

EXPOSE 3000

ENV NODE_ENV production

CMD ["node", "src"]

legopin avatar Feb 24 '20 09:02 legopin

Hello, are there any updates on this issue? Were you able to reproduce the issue?

legopin avatar Mar 19 '20 09:03 legopin

We are experiencing this as well and had to shut down the debugger. The image below shows the same app deployed to two clusters, one with the debugger on and one with the debugger off.

The debugger is a big piece of our workflow so this is pretty dramatic :)

kilianc avatar Apr 11 '20 23:04 kilianc

I just tested the same code with 12.16 and same results. The library in the current state is not usable. How is this not a p0 on a LTS version of node!

kilianc avatar Apr 12 '20 05:04 kilianc

@kilianc If I understand correctly this issue isn't happening on 10.x?

bcoe avatar Apr 13 '20 22:04 bcoe

@bcoe can't confirm, we're running on 12.x it's very easy to repro with the example provided (just changing the FROM to 10.x.

How can I help?

kilianc avatar Apr 13 '20 23:04 kilianc

Are there any updates on the issue? We ae running on GCE:

  • node v14.1.0
  • @google-cloud/debug-agent: ^4.2.2

Memory leaks occur when we have any logpoints. In 2 hours our VM exceeds it's memory having only one logpoint and die. Only stop and start can help (SSH doesn't work also)

escapedoc avatar May 16 '20 00:05 escapedoc

I know that this is reported well enough. Just wanted to pitch in that we are experiencing the same in our GAE Flex in all services that utilise this library. We've tested this by removing npm libraries one-by-one and stress testing our application to see the memory consumption. Only the versions with the debug-agent are affected with this leak.

I think it is also important to note that the less log lines there are the slower the memory leak. So this seems to be somehow related to logging.

I would even say that this is a critical issue as it makes it impossible to utilise this library.

ankero avatar Jul 07 '20 06:07 ankero

It seems our services on GKE that use this library are also affected by this issue. It seems to be a problem on any node version from 12.11.0 onwards with debug agent 5.1.0

andrewg010 avatar Jul 22 '20 12:07 andrewg010

Deployed a change to one of our services today. That change was to drop @google-cloud/debug-agent. We went from erroring with an OOM after every ~10-20 requests to having done a million requests without any OOM. Definitely an issue with this package or one of its deps.

somewhatabstract avatar Aug 27 '20 16:08 somewhatabstract

I'm looking into this issue. I've been able to reproduce a memory leak on GKE using the snippet from above, though I'm not convinced that I've covered the problematic codepath as the leak is slow (~5MB/hour) and seems to plateau eventually.

Unfortunately, I am currently unable to track down the actual memory leak when running it locally with node 12.18.3 and using the memory allocation timeline. I'll continue to investigate.

mctavish avatar Sep 01 '20 01:09 mctavish

@mctavish Our particular app loads and executes JS dynamically on each request. I think that consumed memory very quickly. You might try recreating that dynamic load of additional JS to increase the memory burden and demonstrate the leak more quickly.

somewhatabstract avatar Sep 02 '20 16:09 somewhatabstract

@mctavish the charts that I posted above are from a staging service without traffic (<10rps)

kilianc avatar Sep 02 '20 17:09 kilianc

I can confirm that we are seeing this exact issue in production. We were able to turn off the cloud debugger and resolve the memory leak we detected. Based on our heapdumps we suspect the memory leak has something to do with how access_tokens are stored.

klmz avatar Nov 30 '20 11:11 klmz

I can also confirm that this caused a memory leak in our production environment. We missed it in staging because of how often we deploy. This library is effectively unusable on Node 12.x (< 12.16).

samcfinan avatar Dec 07 '20 00:12 samcfinan

I was able to reproduce the memory leak on both GAE standard and locally deployed application. The key to reproduce is to have an active logpoint that is set at a place which is reached periodically. According to the local application's heap usage, every time the logpoint is triggered, the memory usage will increase around 300KB. That is, if each request triggers one logpoint, then the memory usage will quickly reach 256 MB (default memory limit of GAE standard instance) before serving 1000 requests.

Louis-Ye avatar May 07 '21 20:05 Louis-Ye

Further investigation shows that, the "memory leak" is not caused by the code in cloud debugger, but rather it is caused by the fact that the v8 inspector session stored every context for each breakpoint hit (cloud debugger uses v8 inspector breakpoint to realize the snapshot/logpoint).

This is verified by my local environment where an application would still suffer from the same "memory leak" issue when it does not have cloud debugger, but has v8 inspector setting a breakpoint at a busy place.

Louis-Ye avatar May 07 '21 21:05 Louis-Ye

Storing each of the breakpoint hit context might be a feature provided by v8 inspector, but as for the usage of cloud debugger, this is not what was needed, because cloud debugger does not need the context as soon as after catching the paused event and taking the snapshot/evaluating the logpoint. So to fix the memory leak, cloud debugger should clean up those contexts. However, according to the v8 inspector documentation, there seems to be no quick way to achieve the goal.

One possible way for cloud debugger to fix this problem is to monitor the number of breakpoint hits, when reaching certain threshold, it closes the current session to v8 inspector, re-connect and re-set all the existing snapshots/logpoints. (I have verified it locally that closing the session can clear all the breakpoint-hit contexts thus releasing the memory usage, but will also clear all the previously set breakpoints). The only concern here is the performance hit of such a reset (need to measure).

Louis-Ye avatar May 07 '21 21:05 Louis-Ye

@Louis-Ye in my case I had no active logpoints nor breakpoints, unless they were not visible in the UI and cached in the system.

Thanks for looking into this.

kilianc avatar May 09 '21 01:05 kilianc

A bug on the V8 Engine is opened about the related issue: https://bugs.chromium.org/p/chromium/issues/detail?id=1208482

Louis-Ye avatar May 13 '21 14:05 Louis-Ye

@Louis-Ye in my case I had no active logpoints nor breakpoints, unless they were not visible in the UI and cached in the system.

Thanks for looking into this.

Just to echo the above comment, we're also getting this issue deployed to Cloud Run using Node 14 with no log/break points every being set. Removed the debug agent and memory usage constant.

rhodgkins avatar May 13 '21 15:05 rhodgkins

Hi @kilianc and @rhodgkins, I wasn't able to reproduce the memory leak without setting active breakpoints. The fix (for fixing memory leak when having active breakpoints) was just merged in, would you be able to try the latest fix to see if it helps to solve your problem? If not, please re-open this issue. Also it would be much appreciated if more environment information can be provided (e.g., node.js version/v8 version, debugger agent version, deployment environment, the way you measure memory usage/detect memory leak, the speed/rate of the memory leak, etc.)

Louis-Ye avatar May 31 '21 17:05 Louis-Ye

Thanks @Louis-Ye! When this makes it into a release, I can also try it out on our service that was exhibiting this problem.

somewhatabstract avatar Jun 01 '21 18:06 somewhatabstract