grpc-java
grpc-java copied to clipboard
When and why should users shut down a `ManagedChannel`?
IIUC users need to create a ManagedChannel
first and then create client stub(s) from the ManagedChannel
. I think this design shows a clean separation of concern since the ManagedChannel
represents the transport layer whereas stubs carry RPC specific information. But I hope to learn:
- How did you decide to allow users to shut down the
ManagedChannel
(APIs)? - Why not let the
ManagedChannel
manage/recycle its underlying resources (e.g. TCP connections) automatically? For example, having some idle connection timeout, etc. - Have you encountered any incident caused by misusing the
ManagedChannel
shutdown APIs? For example, aManagedChannel
was mistakenly closed and caused unexpected RPC failures? I can imagine some buggy code doing that. - I guess users can achieve some advanced use cases by leveraging these
ManagedChannel
shutdown APIs and creating/shutting down channels on the fly. Could you share a few examples of such advanced use cases?
Thanks!
How did you decide to allow users to shut down the ManagedChannel
It holds resources, so it needs some way to clean it up. Especially when ClassLoaders come into play.
For example, having some idle connection timeout, etc.
We do have that. idleTimeout. Note that is just some of the resources of the channel. In particular, it does not shut down threads (timers, I/O) used by the channel.
For example, a ManagedChannel was mistakenly closed and caused unexpected RPC failures?
The RPC fails with a pretty clear error message, so when this happens users don't really have to talk to us about it. I agree it has probably happened, but I've not heard of it being a problem.
I guess users can achieve some advanced use cases
High-throughput cases may make multiple channels to increase the number of connections. We really would prefer this be solved within the Channel's LB policy, but it is easy enough to make multiple Channels that's what's been done up to this point.
Some servers need to contact "random" IPs for 1-2 RPCs and then can drop the connection. Having the explicit lifetime allows those connections to be closed when no longer needed.
Thanks Eric for these very helpful information! I have a few FLUP questions.
It holds resources, so it needs some way to clean it up. Especially when ClassLoaders come into play.
For my learning - could you elaborate a bit more on this? How exactly does ClassLoaders
relate to the channel holding resources?
In theory, any type of resource can be automatically cleaned/released when it has been idle for a while (e.g. idle connection timeout). But clearly grpc-java decided to go with explicit channel shutdown. Is that because the complexity of the idle-resource-auto-cleanup implementation not justified by its benefits (e.g. free users from worrying about when to shut it down)?
@Lincong The problem @ejona86 is referring to has to do with ThreadLocals. Ideally only types from the system classloader to avoid memory leaking, particularly in servlets. However, gRPC (and Netty) needs to store non-system values in ThreadLocals, f.e. io.grpc.Context. We have some workarounds to mitigate this issue as much as possible, but explicit shutdown/cleanup is always a best practice.
I found a decent StackOverflow comment exploring a similar case. You can learn more about the non-system types in ThreadLocals at https://stackoverflow.com/a/24862045/11697987
How exactly does ClassLoaders relate to the channel holding resources?
If a thread is around holding objects of ours, then a ClassLoader couldn't be garbage collected. This is common need in servlet containers when you upgrade applications, so the old code is replaced with new code. That's a complicated discussion though (even before talking about the ThreadLocal problems).
any type of resource can be automatically cleaned/released when it has been idle for a while
Idle is fine when the application is running as long as not too many things will accumulate. But if you might do a single RPC to 1000 random servers each second, you don't want to wait for an idle period to elapse before releasing those resources.
I know HTTP libraries have idleness to free resources, but in HTTP/1 the idleness is commonly "5 seconds" or similar small periods. The upper limit is like 60 seconds. That's quite different than in gRPC where applications commonly want gRPC to keep an active connection to have low latency the moment they send an RPC. And gRPC client-side load balancers can have a good amount of state, so we want to keep that state around as long as the application remains interested in the server (because it can take a while to warm them up again). HTTP libraries generally have no load balancing.
Thanks @sergiitk and @ejona86 for the input. I will close this issue and re-open to FLUP if necessary :)