grpc-java icon indicating copy to clipboard operation
grpc-java copied to clipboard

When and why should users shut down a `ManagedChannel`?

Open Lincong opened this issue 4 months ago • 3 comments

IIUC users need to create a ManagedChannel first and then create client stub(s) from the ManagedChannel. I think this design shows a clean separation of concern since the ManagedChannel represents the transport layer whereas stubs carry RPC specific information. But I hope to learn:

  1. How did you decide to allow users to shut down the ManagedChannel (APIs)?
  2. Why not let the ManagedChannel manage/recycle its underlying resources (e.g. TCP connections) automatically? For example, having some idle connection timeout, etc.
  3. Have you encountered any incident caused by misusing the ManagedChannel shutdown APIs? For example, a ManagedChannel was mistakenly closed and caused unexpected RPC failures? I can imagine some buggy code doing that.
  4. I guess users can achieve some advanced use cases by leveraging these ManagedChannel shutdown APIs and creating/shutting down channels on the fly. Could you share a few examples of such advanced use cases?

Thanks!

Lincong avatar Mar 15 '24 17:03 Lincong

How did you decide to allow users to shut down the ManagedChannel

It holds resources, so it needs some way to clean it up. Especially when ClassLoaders come into play.

For example, having some idle connection timeout, etc.

We do have that. idleTimeout. Note that is just some of the resources of the channel. In particular, it does not shut down threads (timers, I/O) used by the channel.

For example, a ManagedChannel was mistakenly closed and caused unexpected RPC failures?

The RPC fails with a pretty clear error message, so when this happens users don't really have to talk to us about it. I agree it has probably happened, but I've not heard of it being a problem.

I guess users can achieve some advanced use cases

High-throughput cases may make multiple channels to increase the number of connections. We really would prefer this be solved within the Channel's LB policy, but it is easy enough to make multiple Channels that's what's been done up to this point.

Some servers need to contact "random" IPs for 1-2 RPCs and then can drop the connection. Having the explicit lifetime allows those connections to be closed when no longer needed.

ejona86 avatar Mar 18 '24 19:03 ejona86

Thanks Eric for these very helpful information! I have a few FLUP questions.

It holds resources, so it needs some way to clean it up. Especially when ClassLoaders come into play.

For my learning - could you elaborate a bit more on this? How exactly does ClassLoaders relate to the channel holding resources?

In theory, any type of resource can be automatically cleaned/released when it has been idle for a while (e.g. idle connection timeout). But clearly grpc-java decided to go with explicit channel shutdown. Is that because the complexity of the idle-resource-auto-cleanup implementation not justified by its benefits (e.g. free users from worrying about when to shut it down)?

Lincong avatar Mar 18 '24 20:03 Lincong

@Lincong The problem @ejona86 is referring to has to do with ThreadLocals. Ideally only types from the system classloader to avoid memory leaking, particularly in servlets. However, gRPC (and Netty) needs to store non-system values in ThreadLocals, f.e. io.grpc.Context. We have some workarounds to mitigate this issue as much as possible, but explicit shutdown/cleanup is always a best practice.

I found a decent StackOverflow comment exploring a similar case. You can learn more about the non-system types in ThreadLocals at https://stackoverflow.com/a/24862045/11697987

sergiitk avatar Mar 19 '24 22:03 sergiitk

How exactly does ClassLoaders relate to the channel holding resources?

If a thread is around holding objects of ours, then a ClassLoader couldn't be garbage collected. This is common need in servlet containers when you upgrade applications, so the old code is replaced with new code. That's a complicated discussion though (even before talking about the ThreadLocal problems).

any type of resource can be automatically cleaned/released when it has been idle for a while

Idle is fine when the application is running as long as not too many things will accumulate. But if you might do a single RPC to 1000 random servers each second, you don't want to wait for an idle period to elapse before releasing those resources.

I know HTTP libraries have idleness to free resources, but in HTTP/1 the idleness is commonly "5 seconds" or similar small periods. The upper limit is like 60 seconds. That's quite different than in gRPC where applications commonly want gRPC to keep an active connection to have low latency the moment they send an RPC. And gRPC client-side load balancers can have a good amount of state, so we want to keep that state around as long as the application remains interested in the server (because it can take a while to warm them up again). HTTP libraries generally have no load balancing.

ejona86 avatar Mar 20 '24 22:03 ejona86

Thanks @sergiitk and @ejona86 for the input. I will close this issue and re-open to FLUP if necessary :)

Lincong avatar Mar 20 '24 23:03 Lincong