grpc-node RetryingCall leaks entries in the writeBuffer if the call is long lived

I have a retryable bidirectional stream that is long lived and leaks entries in the writeBuffer. There does not appear to be any code to recycle the writeBuffer so it seems to grow indefinitely.

Feb 23 '23 22:02 Matt-Esch

I have published version 1.8.11 with a change to this behavior so it will no longer accumulate those "FREED" objects. Please try it out.

Feb 24 '23 18:02 murgatroid99

I have published version 1.8.11 with a change to this behavior so it will no longer accumulate those "FREED" objects. Please try it out.

I will give this a go as soon as I can and report back :)

Feb 25 '23 01:02 Matt-Esch

Hello @murgatroid99

We tried the new version 1.8.11 today and we observe the same memory leak as 1.8.10. See attached screenshot, before 4pm its 1.7.3, after 4pm it's v1.8.11

I don't know if it is the exactly same issue as mentioned here because I cannot easily make a memory dump. We do not use the grpc-js library directly, but we do use subscription to pub/sub server and we communicate with cloud task using officiel Google node JS library which i assume use this grpc module.

Anyway we will stick to 1.7.3 for now.

Hope that helps

Mar 06 '23 16:03 Klaitos

Thank you for the information. If someone can provide more details, I can investigate.

Mar 06 '23 17:03 murgatroid99

We're unfortunately seeing a similar problem on 1.8.11 too, memory usage slowly gets to a point where our services are being OOM killed:

This is with very minimal traffic (< 10RPS).

We're doing a simple unary call, without any explicit retry logic.

Here's how we instantiate the client:

  return new OurServiceClient(endpoint, credentials.createInsecure(), {
    'grpc.keepalive_time_ms': 2000,
    'grpc.keepalive_permit_without_calls': 1,
  });

And how we call it:

  return async (params: Params): Promise<Response | undefined> => {
    const metadata = new Metadata();
    metadata.set('requestId', requestId);
    metadata.set('authentication', await getAuthTokenFn());

    return new Promise((resolve, reject) => {
      client.ourRpcMethod(
        params,
        metadata,
        {
          deadline: Date.now() + timeout,
        },
        (err, value) => {
          if (err) {
            return reject(err);
          }

          return resolve(value);
        },
      );
    });
  };

The server is running with:

  const server = new Server({
    'grpc.keepalive_permit_without_calls': 1,
    'grpc.max_connection_age_ms': 5000,
  });

And here's the (redacted) protobuf:

syntax = "proto3";

service OurService {
  rpc GetServiceData (Request) returns (Response) {}
}

message Request {
  string id = 1;
  string name = 2;
}

message Response {
  bool enabled = 1;
  some.Custom.Data some_custom_data = 2;
  string name = 3;
  uint64 created_at_timestamp = 4;
}

The endpoint is an internal Kubernetes service DNS endpoint (e.g. our-service.namespace.svc.cluster.local).

We can reproduce this very consistently unfortunately and have rolled the package version back for now.

Mar 07 '23 05:03 chinthakagodawita

@chinthakagodawita if you're able to take a heapdump this should be fairly straightforward to debug. Obviously this will contain sensitive information of your application so if you're not confident looking at the heapdump yourself I could look at this privately for you if you were willing to share it (I need to be confident this doesn't leak so I don't mind investing some time into this). Also note that I was also not using any explicit retry logic but it seems to be enabled by default, which is quite a dangerous change to apply by default IMO but this might be the expected default behaviour of gRPC. I disabled retries and the channelz feature (in both the client and server) and I'm seeing better behaviour. I was suspicious of the channelz feature leaking.

I did this by setting client options:

	'grpc.enable_retries': 0, // Retries cause BiDi calls to buffer indefinitely (memory leak)
	'grpc.enable_channelz': 0 // Channelz implementation in grpc-js appears to leak

server options:

        'grpc.enable_channelz': 0

It would be great if you could help unblock v1.8

Mar 07 '23 09:03 Matt-Esch

@Matt-Esch Are you still seeing the buffering in BiDi calls after upgrading to v1.8.11? If so, can you provide more details? Can you also share more details about the channelz memory leak you are seeing?

Mar 07 '23 17:03 murgatroid99

Sorry @Matt-Esch, we're quite under the pump at the moment and I haven't had a chance to stand up a test stack to capture a heapdump yet!

Mar 24 '23 00:03 chinthakagodawita

FYI today I published version 1.8.13 with a fix for a memory leak. Please test using that one in case it fixed the memory leak you are experiencing.

Mar 24 '23 00:03 murgatroid99

A little update - we upgraded to 1.8.17 which doesn't seem to exhibit the same memory leak issue we were seeing on 1.8.11

Jul 16 '23 23:07 chinthakagodawita

That's good to hear. Thanks for the update.

Jul 17 '23 16:07 murgatroid99

@Matt-Esch can you confirm whether you are still experiencing the memory leak with the latest version of grpc-js?

Sep 13 '23 00:09 murgatroid99