grpc-node
grpc-node copied to clipboard
Duplex stream leads to memory leak
Hi,
we experiencing a memory leak when using a duplex stream with grpc and node. The following simplified example growths:
const protoLoader = require("@grpc/proto-loader");
const grpc = require("@grpc/grpc-js");
const protoFile = "test.proto";
const packageDefinition = protoLoader.loadSync(protoFile, {
keepCase: true,
longs: String,
enums: String,
defaults: true,
oneofs: true
});
const definition = grpc.loadPackageDefinition(packageDefinition).test;
function getServer() {
const server = new grpc.Server();
server.addService(definition.TestService.service, {
startDataStream: call => {
call.on('data', () => {
const array = new Array(33000).fill(-6000);
call.write({values: array});
})
}
});
return server;
}
const routeServer = getServer();
routeServer.bindAsync('0.0.0.0:50051', grpc.ServerCredentials.createInsecure(), () => {
routeServer.start();
const client = new definition.TestService('localhost:50051', grpc.credentials.createInsecure());
const call = client.startDataStream();
call.on('data', () => {
call.write({});
});
call.write({});
});
And the corresponding proto file:
syntax = "proto3";
import "google/protobuf/empty.proto";
package test;
message Test {
repeated double values = 1;
}
service TestService {
rpc startDataStream(stream google.protobuf.Empty) returns(stream Test){}
}
I took the snapshots with Jetbrains Webstorm v8. That's my package.json:
{
"name": "tests",
"version": "1.0.0",
"main": "index.js",
"license": "MIT",
"dependencies": {
"@grpc/grpc-js": "0.5.0",
"@grpc/proto-loader": "0.5.1"
}
}
I use node 10.13.0 and yarn 1.19.1
BR
There is no reason to be using the google-protobuf
library here. The method definitions emitted by @grpc/proto-loader
use regular objects to represent proto messages, so an Empty
object can be represented as {}
. Does changing that fix the problem?
You also don't need to depend on protobufjs
for this code. @grpc/proto-loader
already pulls in that dependency automatically.
I changed the code above but that has no effect at all unfortunately. There is still a memory leak problem. Can you reproduce this behavior on your machine?
I can't reproduce a memory leak. On my own machine, I let it run for several minutes and I took several memory snapshots, and I see the memory snapshot size fluctuate between 5.7 and 7.1 MB, which doesn't look like a leak. What does the growth pattern look like in your runs? In your memory snapshots, is there any type of object that has a consistently increasing count?
Hmm I changed the code above to show maybe a bit better what the problem is actually. I also added some pics from my first snapshot the last (approx. 10 min.) where you can see, that it's growing. The second pic shows one retained function e.g.
It doesn't look like you're explicitly invoking the garbage collector before taking snapshots, so between any two individual snapshots you may see allocations of some objects increase or decrease.
I took your code and made a few changes to get this:
const protoLoader = require("@grpc/proto-loader");
const grpc = require("@grpc/grpc-js");
const protoFile = "test.proto";
const packageDefinition = protoLoader.loadSync(protoFile, {
keepCase: true,
longs: String,
enums: String,
defaults: true,
oneofs: true
});
const definition = grpc.loadPackageDefinition(packageDefinition).test;
function getServer() {
const server = new grpc.Server();
server.addService(definition.TestService.service, {
startDataStream: call => {
let iterations = 0;
call.on('data', () => {
iterations += 1;
if (iterations % 1000 === 0) {
global.gc();
debugger;
}
const array = new Array(33000).fill(-6000);
call.write({values: array});
})
}
});
return server;
}
const routeServer = getServer();
routeServer.bindAsync('0.0.0.0:50051', grpc.ServerCredentials.createInsecure(), () => {
routeServer.start();
const client = new definition.TestService('localhost:50051', grpc.credentials.createInsecure());
const call = client.startDataStream();
call.on('data', () => {
call.write({});
});
call.write({});
});
This can be run with node --expose-gc
to be able to call global.gc()
. Those changes pause exactly every thousand iterations, and call the garbage collector first for more consistency. When I run that and take several snapshots, I don't see the memory consistently increasing. I don't even see the number of arrays consistently increasing.
What problem did you experience that made you start investigating this in the first place?
Calling the gc manually isn‘t the real world example isn’t it? I will try this next week if this makes any changes. I encountered this problem after keeping this continuously running app a few days. Afterwards it crashes due to JavaScript heap exception and the memory was at around 2 gb. Then I made this simplified example where I see the leak. I tried it with require(„grpc“). There I cannot reproduce this issue. BTW why do I can’t change localhost to „127.0.0.1“ without getting the error „try to connect to a http/1 server“ if I use grpc instead of grpc-js?
Calling the gc manually cleans up memory that is allocated but not actually leaking, so it just helps narrow down where real leaks are. You shouldn't call global.gc()
in production. It's just a debugging tool.
I'm not familiar with the error you're referring to in the grpc
library.
@murgatroid99 Something similar to this issue.
If create a new connection repeatedly during periodic disconnections, the existing stream does not seem to disappear. Were there any further developments related to that?
stream and client connection was removed in order to be a gc target in code, but remains as shown in the picture.
info
"node v12.20.1"
"@grpc/grpc-js": "1.3.6",
"@grpc/proto-loader": "0.5.5",
Summary Repeat Flow
1. client.connect to server
2. recv stream data
3. server is doing "drop connection" for stream
4. client stream.cancel() & destory() + client.close();
5. other make client connect & stream
remain object memory
Your memory dump shows a single active stream object, which seems correct for the client actions you described. But an "Object" can be literally anything, so there is no way to draw any useful conclusions about why you have 30 of them.
I recently "migrated" a fairly large and complex application that was using the native grpc Node.JS package to @grpc/grpc-js
(the "migration" actually amounts to replacing const grpc = require('@grpc/grpc-js')
with import * as grpc from '@grpc/grpc-js'
, no other code was changed...) and I can report that memory steadily increases and never comes down. There definitely is a memory leak in @grpc/grpc-js
. I don't know where it is. I am planning to eventually capture heap dumps (using the heapdump
package) in production to see what's happening, but yeah, this is a massive problem...
I recently "migrated" a fairly large and complex application that was using the native grpc Node.JS package to
@grpc/grpc-js
(the "migration" actually amounts to replacingconst grpc = require('@grpc/grpc-js')
withimport * as grpc from '@grpc/grpc-js'
, no other code was changed...) and I can report that memory steadily increases and never comes down. There definitely is a memory leak in@grpc/grpc-js
. I don't know where it is. I am planning to eventually capture heap dumps (using theheapdump
package) in production to see what's happening, but yeah, this is a massive problem...
I also experienced a memory leak and I'm rolling back to the deprecated option as it is more stable and doesn't have this problem
If you provide a code snippet that demonstrates the leak, I would be happy to investigate.
Okay, I'll try to. But locally it's very hard to see any problem, I think it's something very subtle. The application receives about 20 million requests per day, and it takes 5 days to reach the memory limit, and it looks like when the volume of requests increases a lot, the memory keeps increasing at the same rhythm. We're currently letting a schedule redeploy the application until the new version with the old package is out. I think in my next sprint I'll be assigned this issue again and then I'll get some time to try to reproduce it with a code snippet.
I'll inform you as soon as can
What version of @grpc/grpc-js
are you using? And what is your memory limit? Or alternatively, what is the average memory growth rate you are observing?
We're using version 1.6.10, and our tasks have 1 GB of memory each. For the last seven days, our service was with 20 constant tasks. You can see the average memory graph below.
The moments where the memory drops are the auto-deploys we had scheduled. The increasing rate is approximately 6 MB/h. We're suspecting @grpc/grpc-js because, previously, the memory leak was filling up the memory in 20 hours, and we had a bug where we were creating a new grpc connection for each request. After we fixed it, the memory leak became manageable.
So, we think, somehow, the connections were not being dropped, although we're not keeping a reference to the connection and the server always sends the closing sign. As you see, I can't be 100% sure that the problem is the @grpc/grpc-js, but we're thinking in remove this variable to see if the memory leak disappears. If so, we'll know where the problem is, and if it happens like that, then I'll work on a code snippet to test it deeply
You might want to try updating to the latest version (1.8.3). A significant amount of the relevant code has been reworked/refactored since 1.6.10.
Alright, I'll try that, thank you!
@murgatroid99 the leak keeps happening. But I tried with the legacy package and it is still happening, so I'm actually misleading here, the problem is somewhere else.
Retryable long-lived streamed calls leak forever. There is no code to periodically remove messages that have been successfully sent so the writeBuffer grows indefinitely unless you churn connections. The channelz implementation may also be leaking - I seem to have a significant number of ChannelzCallTracker objects lying around in memory.
@Matt-Esch Can you file a separate issue with the details of the problem you are seeing? Retries were introduced after this issue was filed so a retry-specific problem is a separate problem. Can you also file a separate issue with more details about the channelz leak you are seeing? I think it's probably not related.
@Matt-Esch Can you file a separate issue with the details of the problem you are seeing?
https://github.com/grpc/grpc-node/issues/2371
I'm not sure about the ChannelzCallTracker
instances, I am not sure how to tally the expected number of these and they seem to be retained by in multiple places.
If it's any help, my team did an upgrade from v1.7.3
to v1.8.7
and saw a massive memory leak.
I gleamed through the diff, but I'm completely shooting in the dark with help I can offer.
https://github.com/grpc/grpc-node/compare/%40grpc/grpc-js%401.7.3...%40grpc/grpc-js%401.8.7
Grain of salt/thinking aloud from here out
- One thing @Matt-Esch mentioned was the
ChannelzCallTracker
instances. The only reference to that in that diff (that I saw) was inserver.ts
(here). It looks like it was substantially changed - I saw there was a
maxConnectionAgeMs
added which defaults to forever... was that how it worked previously? Maybe there used to be a default timeout for everything? (link)
@Matt-Esch The expectation with ChannelzCallTracker
is that there should be one per channel, one per server, and a couple per open connection. In other words, quite a few, but proportional to other active resources.
@duncanfinney It would really help if you describe your usage pattern that triggers the memory leak. If you are using long-running streams, you may be encountering the same retry-related leak that @Matt-Esch described in #2371. I am working on a fix for that.
The maxConnectionAge
that you see is part of a feature that allows servers to limit how long any individual connection can exist. It is controlled using the grpc.max_connection_age_ms
and grpc.max_connection_age_grace_ms
options. The default behavior is the same as the previous behavior. This feature controls connection age and is unrelated to request timeouts.
If it's any help, my team did an upgrade from
v1.7.3
tov1.8.7
and saw a massive memory leak.I gleamed through the diff, but I'm completely shooting in the dark with help I can offer.
@grpc/[email protected]...@grpc/[email protected]
Grain of salt/thinking aloud from here out
- One thing @Matt-Esch mentioned was the
ChannelzCallTracker
instances. The only reference to that in that diff (that I saw) was inserver.ts
(here). It looks like it was substantially changed- I saw there was a
maxConnectionAgeMs
added which defaults to forever... was that how it worked previously? Maybe there used to be a default timeout for everything? (link)
I can confirm we observed the same massive memory leak on our containers using grpc-js on the same version upgrade. Reverting to 1.7.3 is fixing the issue.
I have published version 1.8.11 with a fix for the memory leak described in #2371. Please try that out to see if it fixes the problem. If it does not please file a separate issue with details of the memory leak you are experiencing.
We are also experiencing the memory leak since google Pub/Sub starting pulling the 1.8.x grpc as a transitive dependency of gax-nodejs Once we override the version of google-gax to 3.5.4 which pulls the 1.7.x version of grpc the leak is gone.
I have published version 1.8.13 with a fix for a leak in channelz.