grpc-node icon indicating copy to clipboard operation
grpc-node copied to clipboard

RetryingCall leaks entries in the writeBuffer if the call is long lived

Open Matt-Esch opened this issue 2 years ago • 12 comments

I have a retryable bidirectional stream that is long lived and leaks entries in the writeBuffer. There does not appear to be any code to recycle the writeBuffer so it seems to grow indefinitely.

image

Matt-Esch avatar Feb 23 '23 22:02 Matt-Esch

I have published version 1.8.11 with a change to this behavior so it will no longer accumulate those "FREED" objects. Please try it out.

murgatroid99 avatar Feb 24 '23 18:02 murgatroid99

I have published version 1.8.11 with a change to this behavior so it will no longer accumulate those "FREED" objects. Please try it out.

I will give this a go as soon as I can and report back :)

Matt-Esch avatar Feb 25 '23 01:02 Matt-Esch

Hello @murgatroid99

We tried the new version 1.8.11 today and we observe the same memory leak as 1.8.10. See attached screenshot, before 4pm its 1.7.3, after 4pm it's v1.8.11

Capture d’écran 2023-03-06 à 17 19 22

I don't know if it is the exactly same issue as mentioned here because I cannot easily make a memory dump. We do not use the grpc-js library directly, but we do use subscription to pub/sub server and we communicate with cloud task using officiel Google node JS library which i assume use this grpc module.

Anyway we will stick to 1.7.3 for now.

Hope that helps

Klaitos avatar Mar 06 '23 16:03 Klaitos

Thank you for the information. If someone can provide more details, I can investigate.

murgatroid99 avatar Mar 06 '23 17:03 murgatroid99

We're unfortunately seeing a similar problem on 1.8.11 too, memory usage slowly gets to a point where our services are being OOM killed:

grpc-mem-usage

This is with very minimal traffic (< 10RPS).

We're doing a simple unary call, without any explicit retry logic.

Here's how we instantiate the client:

  return new OurServiceClient(endpoint, credentials.createInsecure(), {
    'grpc.keepalive_time_ms': 2000,
    'grpc.keepalive_permit_without_calls': 1,
  });

And how we call it:

  return async (params: Params): Promise<Response | undefined> => {
    const metadata = new Metadata();
    metadata.set('requestId', requestId);
    metadata.set('authentication', await getAuthTokenFn());

    return new Promise((resolve, reject) => {
      client.ourRpcMethod(
        params,
        metadata,
        {
          deadline: Date.now() + timeout,
        },
        (err, value) => {
          if (err) {
            return reject(err);
          }

          return resolve(value);
        },
      );
    });
  };

The server is running with:

  const server = new Server({
    'grpc.keepalive_permit_without_calls': 1,
    'grpc.max_connection_age_ms': 5000,
  });

And here's the (redacted) protobuf:

syntax = "proto3";

service OurService {
  rpc GetServiceData (Request) returns (Response) {}
}

message Request {
  string id = 1;
  string name = 2;
}

message Response {
  bool enabled = 1;
  some.Custom.Data some_custom_data = 2;
  string name = 3;
  uint64 created_at_timestamp = 4;
}

The endpoint is an internal Kubernetes service DNS endpoint (e.g. our-service.namespace.svc.cluster.local).

We can reproduce this very consistently unfortunately and have rolled the package version back for now.

chinthakagodawita avatar Mar 07 '23 05:03 chinthakagodawita

@chinthakagodawita if you're able to take a heapdump this should be fairly straightforward to debug. Obviously this will contain sensitive information of your application so if you're not confident looking at the heapdump yourself I could look at this privately for you if you were willing to share it (I need to be confident this doesn't leak so I don't mind investing some time into this). Also note that I was also not using any explicit retry logic but it seems to be enabled by default, which is quite a dangerous change to apply by default IMO but this might be the expected default behaviour of gRPC. I disabled retries and the channelz feature (in both the client and server) and I'm seeing better behaviour. I was suspicious of the channelz feature leaking.

I did this by setting client options:

	'grpc.enable_retries': 0, // Retries cause BiDi calls to buffer indefinitely (memory leak)
	'grpc.enable_channelz': 0 // Channelz implementation in grpc-js appears to leak

server options:

        'grpc.enable_channelz': 0 

It would be great if you could help unblock v1.8

Matt-Esch avatar Mar 07 '23 09:03 Matt-Esch

@Matt-Esch Are you still seeing the buffering in BiDi calls after upgrading to v1.8.11? If so, can you provide more details? Can you also share more details about the channelz memory leak you are seeing?

murgatroid99 avatar Mar 07 '23 17:03 murgatroid99

Sorry @Matt-Esch, we're quite under the pump at the moment and I haven't had a chance to stand up a test stack to capture a heapdump yet!

chinthakagodawita avatar Mar 24 '23 00:03 chinthakagodawita

FYI today I published version 1.8.13 with a fix for a memory leak. Please test using that one in case it fixed the memory leak you are experiencing.

murgatroid99 avatar Mar 24 '23 00:03 murgatroid99

A little update - we upgraded to 1.8.17 which doesn't seem to exhibit the same memory leak issue we were seeing on 1.8.11

chinthakagodawita avatar Jul 16 '23 23:07 chinthakagodawita

That's good to hear. Thanks for the update.

murgatroid99 avatar Jul 17 '23 16:07 murgatroid99

@Matt-Esch can you confirm whether you are still experiencing the memory leak with the latest version of grpc-js?

murgatroid99 avatar Sep 13 '23 00:09 murgatroid99