nodejs-spanner icon indicating copy to clipboard operation
nodejs-spanner copied to clipboard

Client leaks memory - repro provided with single getInstances call.

Open tomdee opened this issue 4 years ago • 18 comments

I have a specific use case where I need to be able to create multiple Spanner clients with different credentials inside a single long running process. This process is currently leaking memory and crashing.

Environment details

  • OS: Ubuntu 20.04
  • Node.js version: v14.15.1
  • npm version: 6.14.8
  • @google-cloud/spanner version: v5.4.0

Steps to reproduce

Run this sample code (after filling in some credentials)

index.js

import { Spanner } from "@google-cloud/spanner";
import { HeapDiff } from "@airbnb/node-memwatch";

// Call the spanner fetch 10 times, outputting memeory after each one
const projectId = "FILL_IN";
const credentials = {  client_email: `FILL_IN`,  private_key: `FILL IN`};

const REPETITIONS = 10;
console.log(`Running ${REPETITIONS} times`);
for (var i = 0; i < REPETITIONS; i++) {
  await getSpannerInstances();
}

async function getSpannerInstances() {
  var hd = new HeapDiff();
  const spanner = new Spanner({
    projectId,
    credentials,
  });
  await spanner.getInstances();
  var diff = hd.end();
  console.dir(diff.change.size_bytes);
}

package.json

{
  "type": "module",
  "scripts": {
    "start": "node index.js"
  },
  "dependencies": {
    "@airbnb/node-memwatch": "^2.0.0",
    "@google-cloud/spanner": "^5.4.0"
  }
}

The output indicates that it's leaking around 700-800KB of RAM on each loop through. There's no global state in this simple repro which is indicating that there's something wrong in @google-cloud/spanner

Running 10 times
2601376
914860
899300
729308
779676
797884
764027
790885
729727
759473

Thanks for creating this library and let me know if there are any more diagnostics that I can provide.

tomdee avatar Feb 10 '21 18:02 tomdee

Thanks for creating the issue @tomdee. Assigning to @olavloite to take a look.

skuruppu avatar Feb 11 '21 01:02 skuruppu

@tomdee Thanks for the detailed report!

It seems that this is not directly caused by the client library itself, but by the generated client that is used to communicate with Spanner (the gapic client). I've been running some different tests, and have slightly changed your reproduction case to the following:

const REPETITIONS = 10;
console.log(`Running ${REPETITIONS} times`);
for (let i = 0; i < REPETITIONS; i++) {
  await useGapic();
  global.gc();
  const used = process.memoryUsage().heapUsed / 1024 / 1024;
  console.log(`Current mem usage ${Math.round(used * 100) / 100} MB`);
}

async function useGapic() {
  const client = new InstanceAdminClient({
    projectId: 'my-project',
    keyFile: '/path/to/my-key.json',
  });
  await client.listInstanceConfigs({
    parent: 'projects/my-project',
  });
  await client.close();
}

The result of the above is similar to your observation:

Running 10 times
Current mem usage 26.63 MB
Current mem usage 26.9 MB
Current mem usage 27.08 MB
Current mem usage 27.28 MB
Current mem usage 27.44 MB
Current mem usage 27.82 MB
Current mem usage 27.77 MB
Current mem usage 27.94 MB
Current mem usage 28.07 MB
Current mem usage 28.23 MB

And running it for a lot more than 10 times shows the same pattern of ever increasing memory usage until it goes out-of-memory.

olavloite avatar Feb 11 '21 13:02 olavloite

@olavloite This is great, thanks for adding these additional insights.

Is there another project that I should raise a ticket on or are you following up on this?

tomdee avatar Feb 11 '21 16:02 tomdee

@olavloite This is great, thanks for adding these additional insights.

Is there another project that I should raise a ticket on or are you following up on this?

I'm currently investigating the problem to see if I can find the root cause. If needed, then I'll file an issue with the corresponding project and link it here.

olavloite avatar Feb 11 '21 17:02 olavloite

To appease the bots, this issue is being investigated in https://github.com/googleapis/gapic-generator-typescript/issues/823.

skuruppu avatar Feb 19 '21 11:02 skuruppu

@olavloite it seems that there may be a fix based on https://github.com/googleapis/gapic-generator-typescript/issues/823#issuecomment-829844782? Would you be able to try updating google-gax to that version?

skuruppu avatar May 05 '21 03:05 skuruppu

@skuruppu

I've tried it out and added my findings to https://github.com/googleapis/gapic-generator-typescript/issues/823. The TLDR version:

  1. The update to v2.11.0 or later for google-gax helps, but does not solve the entire problem.
  2. Adding a close() method to the operations client as suggested by @Utsav2 also further reduces the amount of memory that seems to be leaded for each iteration.
  3. Even after applying the above two fixes, there still seems to be a small amount of memory being leaked for each iteration of opening and closing a client.

olavloite avatar May 05 '21 08:05 olavloite

@olavloite, that's really great. I'm glad you were able to narrow it down further.

skuruppu avatar May 05 '21 09:05 skuruppu

@olavloite is there any other action we can take on this issue?

skuruppu avatar Jul 02 '21 00:07 skuruppu

Hi,

I'm not sure if it's related to this issue (or if it deserves its own), but I have been observing a memory leak with the Spanner client, in particular when working together with the debug agent but also to a lesser extent by itself.

The details are described in this issue about a memory leak with the debug agent.

pascaldelange avatar Jul 27 '21 13:07 pascaldelange

Hi @pascaldelange Thanks for the link with the other issue. I've read through the thread and also your example application, and I don't think these two issues are directly related, as the issue in this thread is related to a memory leak in the (generated) Spanner client when a client is opened and closed multiple times during the lifetime of the application. As far as I can tell from your sample application, you only open a single client and keep that one open during the entire lifetime of the application.

It is however interesting to read that you experience a 10MB/hour increase when you run the application without the debug agent. How exactly are you measuring the increase? As in; do you force a garbage collect before measuring, or could it be that the increase is just caused by garbage that has not yet been cleaned up? Will the application eventually run out of memory if you run it long enough in a memory-limited environment?

olavloite avatar Jul 27 '21 15:07 olavloite

Hi @olavloite, I was measuring it by looking at the memory usage stats of the GAE monitoring. The number 10Mb/hour is based in a test instance (based on the code shared) where only the liveness calls are made, it might be different with real traffic (in production we only observed it with the debug agent on). In the test case I have not waited until the application runs out of memory (which at this rate might have taken a while), but in prod with debug agent on (same symptoms but faster increase) the instance would run out of memory every few hours and restart, causing transient errors on the service in the event.

pascaldelange avatar Jul 28 '21 07:07 pascaldelange

@pascaldelange Thanks for the additional information. My questions are mainly aimed at determining whether there actually is a memory leak when you are only executing the liveness calls without the debug agent. The memory leak when using the debug agent seems quite clear, and also the (small) memory leak when opening and closing a Spanner client multiple times is well documented. I'm not sure about the case where only a SELECT 1 query is executed. An increase in memory usage over a limited time, without the application actually running out of memory, might be just an indication of normal usage:

  1. Using the Spanner client will automatically initialize a session pool. That session pool will consume some memory.
  2. Executing queries etc. will always require some memory. Not all of that memory will be reclaimed directly by the garbage collector, so a temporary increase in memory consumption is not necessarily an indication of a memory leak.

Would it be possible for you to run the same test script on your environment (the one without the Debug Agent), but with an explicit garbage collect after each query? See this Stack Overflow post for a good explanation on how to run the garbage collector from code.

olavloite avatar Jul 28 '21 10:07 olavloite

Hi, I can try that (it will have to wait for a couple of weeks due to vacations though). I'm not sure if that's sufficient to indicate a memory leak, but at least the linear increase in memory usage was only visible after changing to node 12+ (not present with node 10). I'll keep you posted.

pascaldelange avatar Jul 28 '21 11:07 pascaldelange

With the latest changes in the generated clients the memory leak when opening and closing a new client repeatedly has been significantly reduced, but unfortunately not completely eliminated. Running a simple script that opens a simple generated client, executes one request with it and then closes the client will now leak approx 2MB after 100 iterations. That is approx 1/10th of the original problem.

Test script

const {Spanner, v1} = require('@google-cloud/spanner');

const REPETITIONS = 100;
console.log(`Running ${REPETITIONS} times`);

function main() {
    async function test() {
        for (let i = 0; i < REPETITIONS; i++) {
            await useGapic();
            global.gc();
            const used = process.memoryUsage().heapUsed / 1024 / 1024;
            console.log(`${i}: Current mem usage ${Math.round(used * 100) / 100} MB`);
        }
    }

    async function useGapic() {
        const client = new v1.InstanceAdminClient({
            projectId: 'my-project',
            keyFile: '/path/to/my-key.json',
        });
        await client.listInstanceConfigs({
            parent: 'projects/my-project',
        });
        await client.close();
    }
    test();
}

process.on('unhandledRejection', err => {
    console.error(err.message);
    process.exitCode = 1;
});
main();

Results for 100 iterations

Running 100 times
0: Current mem usage 17.28 MB
1: Current mem usage 17.34 MB
2: Current mem usage 17.43 MB
3: Current mem usage 17.53 MB
4: Current mem usage 17.57 MB
5: Current mem usage 17.6 MB
6: Current mem usage 17.63 MB
7: Current mem usage 17.66 MB
8: Current mem usage 17.69 MB
9: Current mem usage 17.71 MB
...
90: Current mem usage 19.01 MB
91: Current mem usage 19.03 MB
92: Current mem usage 19.04 MB
93: Current mem usage 19.06 MB
94: Current mem usage 19.07 MB
95: Current mem usage 19.08 MB
96: Current mem usage 19.1 MB
97: Current mem usage 19.1 MB
98: Current mem usage 19.11 MB
99: Current mem usage 19.13 MB

I've been trying out a number of different things, but I've not been able to exactly pinpoint the problem, although the following is clear:

  1. Opening and closing a generated gapic client without executing any requests will not leak any memory.
  2. Opening a gapic client, executing one single simple request and then closing the client will leak a bit of memory. It does not seem to make any real difference which request is being executed.

@alicejli If you have any additional ideas where the last bit might be coming from, then I'm all ears.

olavloite avatar Aug 25 '21 12:08 olavloite

Downgrading to P2 since most of the memory leak is now fixed but there's still some room for improvement.

skuruppu avatar Sep 08 '21 01:09 skuruppu

Hi @olavloite,

Sorry I missed this! I have a question out to the grpc team to see if they're able to pinpoint what's going on, as I am able to replicate your findings and the memory leakage seems to be consistent across all methods (so it doesn't seem to be a client library issue). Will update here with any findings.

Thanks,

Alice

alicejli avatar Sep 17 '21 23:09 alicejli

It looks like grpc is likely not the cause of this memory leak after some discussion and testing with the grpc team (when running a script with a "pure" grpc call a million times, we found that the memory was regularly being garbage collected and was more fluctuating than growing).

The other potential lead on this memory leak issue is it seems there may be an issue with the google-cloud/common package (here: https://github.com/googleapis/nodejs-common/issues/664).

Will comment back here with any updates on findings there.

alicejli avatar Sep 27 '21 17:09 alicejli