grpc-node icon indicating copy to clipboard operation
grpc-node copied to clipboard

Duplex stream leads to memory leak

Open sprotznock85 opened this issue 4 years ago • 10 comments

Hi,

we experiencing a memory leak when using a duplex stream with grpc and node. The following simplified example growths:

const protoLoader = require("@grpc/proto-loader");
const grpc = require("@grpc/grpc-js");
const protoFile = "test.proto";
const packageDefinition = protoLoader.loadSync(protoFile, {
    keepCase: true,
    longs: String,
    enums: String,
    defaults: true,
    oneofs: true
});
const definition = grpc.loadPackageDefinition(packageDefinition).test;

function getServer() {
    const server = new grpc.Server();
    server.addService(definition.TestService.service, {
        startDataStream:  call => {
            call.on('data', () => {
                const array = new Array(33000).fill(-6000);
                call.write({values: array});
            })
        }
    });
    return server;
}
const routeServer = getServer();
routeServer.bindAsync('0.0.0.0:50051', grpc.ServerCredentials.createInsecure(), () => {
    routeServer.start();
    const client = new definition.TestService('localhost:50051', grpc.credentials.createInsecure());
    const call = client.startDataStream();
    call.on('data', () => {
        call.write({});
    });
    call.write({});
});


And the corresponding proto file:

syntax = "proto3";
import "google/protobuf/empty.proto";
package test;

message Test {
	repeated double values = 1;
}

service TestService {
	rpc startDataStream(stream google.protobuf.Empty) returns(stream Test){}
}

I took the snapshots with Jetbrains Webstorm v8. That's my package.json:

{
  "name": "tests",
  "version": "1.0.0",
  "main": "index.js",
  "license": "MIT",
  "dependencies": {
    "@grpc/grpc-js": "0.5.0",
    "@grpc/proto-loader": "0.5.1"
  }
}

I use node 10.13.0 and yarn 1.19.1

BR

sprotznock85 avatar Apr 07 '20 09:04 sprotznock85

There is no reason to be using the google-protobuf library here. The method definitions emitted by @grpc/proto-loader use regular objects to represent proto messages, so an Empty object can be represented as {}. Does changing that fix the problem?

You also don't need to depend on protobufjs for this code. @grpc/proto-loader already pulls in that dependency automatically.

murgatroid99 avatar Apr 07 '20 23:04 murgatroid99

I changed the code above but that has no effect at all unfortunately. There is still a memory leak problem. Can you reproduce this behavior on your machine?

sprotznock85 avatar Apr 08 '20 05:04 sprotznock85

I can't reproduce a memory leak. On my own machine, I let it run for several minutes and I took several memory snapshots, and I see the memory snapshot size fluctuate between 5.7 and 7.1 MB, which doesn't look like a leak. What does the growth pattern look like in your runs? In your memory snapshots, is there any type of object that has a consistently increasing count?

murgatroid99 avatar Apr 08 '20 17:04 murgatroid99

Hmm I changed the code above to show maybe a bit better what the problem is actually. I also added some pics from my first snapshot the last (approx. 10 min.) where you can see, that it's growing. The second pic shows one retained function e.g.

Capture Capture2

sprotznock85 avatar Apr 09 '20 10:04 sprotznock85

It doesn't look like you're explicitly invoking the garbage collector before taking snapshots, so between any two individual snapshots you may see allocations of some objects increase or decrease.

I took your code and made a few changes to get this:

const protoLoader = require("@grpc/proto-loader");
const grpc = require("@grpc/grpc-js");
const protoFile = "test.proto";
const packageDefinition = protoLoader.loadSync(protoFile, {
    keepCase: true,
    longs: String,
    enums: String,
    defaults: true,
    oneofs: true
});
const definition = grpc.loadPackageDefinition(packageDefinition).test;

function getServer() {
    const server = new grpc.Server();
    server.addService(definition.TestService.service, {
        startDataStream:  call => {
            let iterations = 0;
            call.on('data', () => {
                iterations += 1;
                if (iterations % 1000 === 0) {
                    global.gc();
                    debugger;
                }
                const array = new Array(33000).fill(-6000);
                call.write({values: array});
            })
        }
    });
    return server;
}
const routeServer = getServer();
routeServer.bindAsync('0.0.0.0:50051', grpc.ServerCredentials.createInsecure(), () => {
    routeServer.start();
    const client = new definition.TestService('localhost:50051', grpc.credentials.createInsecure());
    const call = client.startDataStream();
    call.on('data', () => {
        call.write({});
    });
    call.write({});
});

This can be run with node --expose-gc to be able to call global.gc(). Those changes pause exactly every thousand iterations, and call the garbage collector first for more consistency. When I run that and take several snapshots, I don't see the memory consistently increasing. I don't even see the number of arrays consistently increasing.

What problem did you experience that made you start investigating this in the first place?

murgatroid99 avatar Apr 09 '20 16:04 murgatroid99

Calling the gc manually isn‘t the real world example isn’t it? I will try this next week if this makes any changes. I encountered this problem after keeping this continuously running app a few days. Afterwards it crashes due to JavaScript heap exception and the memory was at around 2 gb. Then I made this simplified example where I see the leak. I tried it with require(„grpc“). There I cannot reproduce this issue. BTW why do I can’t change localhost to „127.0.0.1“ without getting the error „try to connect to a http/1 server“ if I use grpc instead of grpc-js?

sprotznock85 avatar Apr 09 '20 17:04 sprotznock85

Calling the gc manually cleans up memory that is allocated but not actually leaking, so it just helps narrow down where real leaks are. You shouldn't call global.gc() in production. It's just a debugging tool.

I'm not familiar with the error you're referring to in the grpc library.

murgatroid99 avatar Apr 09 '20 17:04 murgatroid99

@murgatroid99 Something similar to this issue.

If create a new connection repeatedly during periodic disconnections, the existing stream does not seem to disappear. Were there any further developments related to that?

stream and client connection was removed in order to be a gc target in code, but remains as shown in the picture.

info

    "node v12.20.1"
    "@grpc/grpc-js": "1.3.6",
    "@grpc/proto-loader": "0.5.5",

Summary Repeat Flow

1. client.connect to server 
2. recv stream data
3. server is doing "drop connection" for stream
4. client stream.cancel() & destory() + client.close();
5. other make client connect & stream

remain object memory

image

0Ams avatar Jul 21 '21 06:07 0Ams

Your memory dump shows a single active stream object, which seems correct for the client actions you described. But an "Object" can be literally anything, so there is no way to draw any useful conclusions about why you have 30 of them.

murgatroid99 avatar Jul 21 '21 16:07 murgatroid99

I recently "migrated" a fairly large and complex application that was using the native grpc Node.JS package to @grpc/grpc-js (the "migration" actually amounts to replacing const grpc = require('@grpc/grpc-js') with import * as grpc from '@grpc/grpc-js', no other code was changed...) and I can report that memory steadily increases and never comes down. There definitely is a memory leak in @grpc/grpc-js. I don't know where it is. I am planning to eventually capture heap dumps (using the heapdump package) in production to see what's happening, but yeah, this is a massive problem...

jlecomte avatar Mar 28 '22 22:03 jlecomte

I recently "migrated" a fairly large and complex application that was using the native grpc Node.JS package to @grpc/grpc-js (the "migration" actually amounts to replacing const grpc = require('@grpc/grpc-js') with import * as grpc from '@grpc/grpc-js', no other code was changed...) and I can report that memory steadily increases and never comes down. There definitely is a memory leak in @grpc/grpc-js. I don't know where it is. I am planning to eventually capture heap dumps (using the heapdump package) in production to see what's happening, but yeah, this is a massive problem...

I also experienced a memory leak and I'm rolling back to the deprecated option as it is more stable and doesn't have this problem

Farenheith avatar Jan 07 '23 19:01 Farenheith

If you provide a code snippet that demonstrates the leak, I would be happy to investigate.

murgatroid99 avatar Jan 10 '23 18:01 murgatroid99

Okay, I'll try to. But locally it's very hard to see any problem, I think it's something very subtle. The application receives about 20 million requests per day, and it takes 5 days to reach the memory limit, and it looks like when the volume of requests increases a lot, the memory keeps increasing at the same rhythm. We're currently letting a schedule redeploy the application until the new version with the old package is out. I think in my next sprint I'll be assigned this issue again and then I'll get some time to try to reproduce it with a code snippet.

I'll inform you as soon as can

Farenheith avatar Jan 11 '23 00:01 Farenheith

What version of @grpc/grpc-js are you using? And what is your memory limit? Or alternatively, what is the average memory growth rate you are observing?

murgatroid99 avatar Jan 11 '23 00:01 murgatroid99

We're using version 1.6.10, and our tasks have 1 GB of memory each. For the last seven days, our service was with 20 constant tasks. You can see the average memory graph below. image

The moments where the memory drops are the auto-deploys we had scheduled. The increasing rate is approximately 6 MB/h. We're suspecting @grpc/grpc-js because, previously, the memory leak was filling up the memory in 20 hours, and we had a bug where we were creating a new grpc connection for each request. After we fixed it, the memory leak became manageable.

So, we think, somehow, the connections were not being dropped, although we're not keeping a reference to the connection and the server always sends the closing sign. As you see, I can't be 100% sure that the problem is the @grpc/grpc-js, but we're thinking in remove this variable to see if the memory leak disappears. If so, we'll know where the problem is, and if it happens like that, then I'll work on a code snippet to test it deeply

Farenheith avatar Jan 11 '23 12:01 Farenheith

You might want to try updating to the latest version (1.8.3). A significant amount of the relevant code has been reworked/refactored since 1.6.10.

murgatroid99 avatar Jan 11 '23 18:01 murgatroid99

Alright, I'll try that, thank you!

Farenheith avatar Jan 12 '23 12:01 Farenheith

@murgatroid99 the leak keeps happening. But I tried with the legacy package and it is still happening, so I'm actually misleading here, the problem is somewhere else.

Farenheith avatar Jan 17 '23 00:01 Farenheith

Retryable long-lived streamed calls leak forever. There is no code to periodically remove messages that have been successfully sent so the writeBuffer grows indefinitely unless you churn connections. The channelz implementation may also be leaking - I seem to have a significant number of ChannelzCallTracker objects lying around in memory.

Matt-Esch avatar Feb 23 '23 20:02 Matt-Esch

@Matt-Esch Can you file a separate issue with the details of the problem you are seeing? Retries were introduced after this issue was filed so a retry-specific problem is a separate problem. Can you also file a separate issue with more details about the channelz leak you are seeing? I think it's probably not related.

murgatroid99 avatar Feb 23 '23 21:02 murgatroid99

@Matt-Esch Can you file a separate issue with the details of the problem you are seeing?

https://github.com/grpc/grpc-node/issues/2371

I'm not sure about the ChannelzCallTracker instances, I am not sure how to tally the expected number of these and they seem to be retained by in multiple places.

Matt-Esch avatar Feb 23 '23 22:02 Matt-Esch

If it's any help, my team did an upgrade from v1.7.3 to v1.8.7 and saw a massive memory leak.

I gleamed through the diff, but I'm completely shooting in the dark with help I can offer.

https://github.com/grpc/grpc-node/compare/%40grpc/grpc-js%401.7.3...%40grpc/grpc-js%401.8.7


Grain of salt/thinking aloud from here out

  • One thing @Matt-Esch mentioned was the ChannelzCallTracker instances. The only reference to that in that diff (that I saw) was in server.ts (here). It looks like it was substantially changed
  • I saw there was a maxConnectionAgeMs added which defaults to forever... was that how it worked previously? Maybe there used to be a default timeout for everything? (link)

duncanfinney avatar Feb 24 '23 00:02 duncanfinney

@Matt-Esch The expectation with ChannelzCallTracker is that there should be one per channel, one per server, and a couple per open connection. In other words, quite a few, but proportional to other active resources.

@duncanfinney It would really help if you describe your usage pattern that triggers the memory leak. If you are using long-running streams, you may be encountering the same retry-related leak that @Matt-Esch described in #2371. I am working on a fix for that.

The maxConnectionAge that you see is part of a feature that allows servers to limit how long any individual connection can exist. It is controlled using the grpc.max_connection_age_ms and grpc.max_connection_age_grace_ms options. The default behavior is the same as the previous behavior. This feature controls connection age and is unrelated to request timeouts.

murgatroid99 avatar Feb 24 '23 01:02 murgatroid99

If it's any help, my team did an upgrade from v1.7.3 to v1.8.7 and saw a massive memory leak.

I gleamed through the diff, but I'm completely shooting in the dark with help I can offer.

@grpc/[email protected]...@grpc/[email protected]

Grain of salt/thinking aloud from here out

  • One thing @Matt-Esch mentioned was the ChannelzCallTracker instances. The only reference to that in that diff (that I saw) was in server.ts (here). It looks like it was substantially changed
  • I saw there was a maxConnectionAgeMs added which defaults to forever... was that how it worked previously? Maybe there used to be a default timeout for everything? (link)

I can confirm we observed the same massive memory leak on our containers using grpc-js on the same version upgrade. Reverting to 1.7.3 is fixing the issue.

Klaitos avatar Feb 24 '23 17:02 Klaitos

I have published version 1.8.11 with a fix for the memory leak described in #2371. Please try that out to see if it fixes the problem. If it does not please file a separate issue with details of the memory leak you are experiencing.

murgatroid99 avatar Feb 24 '23 18:02 murgatroid99

We are also experiencing the memory leak since google Pub/Sub starting pulling the 1.8.x grpc as a transitive dependency of gax-nodejs Once we override the version of google-gax to 3.5.4 which pulls the 1.7.x version of grpc the leak is gone.

yianisn avatar Mar 10 '23 10:03 yianisn

I have published version 1.8.13 with a fix for a leak in channelz.

murgatroid99 avatar Mar 23 '23 18:03 murgatroid99