grpc-java
grpc-java copied to clipboard
Send GOAWAY after X requests on the connection
I need a similar solution like kubernetes apiserver is doing https://github.com/kubernetes/kubernetes/pull/88567 Context: https://github.com/grpc/grpc-java/issues/10929#issuecomment-1960084653 @ejona86
Can you explain why you want it? (e.g., "using L4 load balancing and max connection age doesn't work because...")
The approach taken by the apiserver in kubernetes seems to be a per-request approximation of the nginx configuration keepalive_requests. If we count per-connection within gRPC like nginx does, there doesn't seem to be a need/benefit in it being probabilistic. Note that nginx claims to have it to handle memory leaks, which is not at all relevant to gRPC. But overall the approach is core to how nginx balances load across balancers.
This approach is simple, and can be used to combat rapid reset. But you also need to have a good idea of the qps of the client in order to tune. There is a risk with the approach though: if there is a high QPS client that causes excessive load, that client can bounce around between backends rapidly causing hot-spots. But if the bouncing is too rapid, those hot-spots won't be visible to monitoring systems. This had led to issues in the past where a server used a setting like this and thought they fixed their load imbalance problem, but tail latency didn't improve and made the problem invisible but still happening. You need connections to last twice the monitoring period in order to be visible.
We could supplement this setting with a "min connection age." Such a setting would need to be disabled to use the feature to combat rapid reset.
We maintains a private api gateway. When we upgrade our service, the last instance upgraded will not receive traffic for a very long time because our provider/consumer already connected to other instances and will keep use existent connection for http2 long tcp connection. Also when we scale out our service, new instances also will incur this problem. We need a rebalance mechnism, the simplest way is sending a GOAWAY let gracefully disconnect exist connection and let clients reconnect to other instances (round-robin controlled by client sdk), based on either requests number or probability is fine.
We already support time-based. That allows you to choose how rapidly clients rebalance themselves in terms that you can see in your metrics. You know how rapidly you roll out and can configure this appropriately.
Is there a reason the time-based is insufficient? It is graceful. See maxConnectionAgeGrace for how long connections are allowed to stay open for pre-existing RPCs. We do the appropriate double GOAWAY in HTTP/2.
We encounter many http2 implementation issues when server always send GOAWAY, like https://github.com/golang/go/issues/39086 https://github.com/kubernetes/kubernetes/issues/91131
So we prefer a settings which can be dynamic changed during application running to a static configuration only settable before application running.
We will enable GOAWAY settings only before upgrade/scale-out, and turn it off after that to minimize the impact.
Can maxConnectionAge related settings be changed during running?
No, maxConnectionAge is fixed when the service starts.
Are those GOAWAY issues you linked to related to gRPC? They are also relatively old at this point.
One of the benefits of doing the GOAWAY continually is you find bugs when it is less severe. You wouldn't want to wait until you are rolling out a new release to discover rolling out a new release kills clients. If something is painful, the way to fix it is to do it more frequently.
i am not blaming grpc for these issues, but my customers are not test materials either. We can run test more frequently, but not make customers less stable in production.
Nothing I said made your customers test materials. Unless you think they are test materials when you upgrade your service, since you want to send GOAWAY then. You aren't "testing" your service. You are "testing" your clients conform to proper operations.
The point is to make your users discover issues during development or worst-case just as they roll out to production. Not suddenly and without warning when your service does an upgrade.
It is really bad to find an issue during service upgrade, because if you notice it causes an issue you'll revert to the older binary and cause even more issues. It also becomes very hard to adjust the servers behavior to workaround broken clients, without rolling out a new server version.
You are instead continually "testing" clients for latent bugs. The approach of "do it more frequently" is time-tested and proven. It is the idea used for "clients failed when we added a new field" so the response being "the server will include fake, generated fields so that broken clients are noticed when developing instead of suddenly en masse when the service changes."
Upgrading itself cause issues but not new changes need to be upgraded. Let me explains:
Since nearly all java applications can't be hot upgraded(keep tcp connection), the normal upgrade process is remove traffic and stop application and start new one and restore traffic. Usually there is a load balancer in front of the application like kubeproxy+k8s service, this load balancer usually is a layer 3 one but not layer 7, it only controls new tcp connection to be routed to the application. When remove traffic, it means stop new tcp connection but still keeps existing connection. So the main point is how to deal these existing tcp connection (which is grpc connection).
If there is no GOAWAY mechinsm, we can't deal these connection but stop application brutally to close them and make all clients report errors for a while, no other way.
If we use maxConnectionAge, some buggy clients will constantly report errors after upgrade, force us to rollback.
If there is a dynamic switch for maxConnectionAge, we can enable it with a short value (like 5 minutes) before upgrade, wait for all connection gracefully closed, and disable it, stop application and start again. If there are buggy clients which will report errors during GOAWAY, it acts as no GOAWAY mechnism, only report errors for a while. Other normal clients will works fine without any error reports. A graceful service upgrade is done. We can decouple our service upgrade with client upgrade.
So dynamic switch for maxConnectionAge is very important.
If there is no GOAWAY mechinsm
We have multiple ways to trigger GOAWAY. Calling server.shutdown() triggers GOAWAY on all existing connections. That would be the normal way to deal with replacing a server.
// Closes listen socket and sends GOAWAY on existing connections
server.shutdown();
// Wait however long you are willing for RPCs to complete.
// Returns quickly if RPCs complete quickly.
server.awaitTermination(5, TimeUnit.MINUTES);
// Kill any remaining RPCs
server.shutdownNow();
// Some final waiting for clients to receive cancellations. Unlikely to take
// longer than a second.
server.awaitTermination(5, TimeUnit.MINUTES);
This sounds like exactly what you are looking for.
If we use maxConnectionAge, some buggy clients will constantly report errors after upgrade, force us to rollback.
Why would that find buggy clients but keepalive_requests wouldn't? And how does rolling back help? That just causes the clients to break a second time.
Thanks for server.shutdown() will send GOAWAY, which will help to solve the first problem i mentioned.
What i am asking for is
Can maxConnectionAge related settings be changed during running? Buggy client will always be affected during upgrade, but i can disable maxConnectionAge not to send GOAWAY after upgrade after all traffic are balanced among all instances.
The rolling back which disables maxConnectionAge will break a second time only once, but not constantly break them if keep enable maxConnectionAge. We need maxConnectionAge to balance traffic which is mentioned above, like k8s apiserver do.
Have you seen any actual buggy clients? There are an infinite number of ways a client can be buggy. And it is much better to exercise clients continually. If you only trigger this when upgrading the server then users have a tendency to say the bugs are your fault because you broke the client. If it is continual, then it is clearly the client's fault.
"Buggy client" here is "buggy grpc implementation." It isn't a buggy client application. GOAWAY is handled by the grpc implementation and should not cause application-visible errors. The most it can do is cause a latency blip.
The client is not under my team's control. Our team can communicate to the client team "we will have an upgrade which will make you receive some error reports for a while, plz ignore these alert", but we can not tell them "we will do sth which will make you always receive alerts forever". Buggy client are usually caused by buggy http2 implementation, i enumerate some above, which will return errors when send request.
Is there a broken client, or is this theoretical? The ones you enumerated aren't ever involved with the grpc java server. I don't see how this conversation matters unless that one client team has a busted client.
The ones i listed above is the real case i meet in old time: a grpc-go client which use http2 implementation in go runtime but buggy, connect to our grpc-java server (which is a api gateway), and encounter this strange error "request declared a Content-Length of N but only wrote 0 bytes" after we implement GOAWAY mechnism by ourself and deploy it. At that time, no maxConnectionAge offered.
Is that an actual client your server is supporting right now? That issue was 3.5 years ago. Based on the fix, that bug did not impact grpc-go, because grpc-go doesn't use roundTrip(). (I see there are statements of servers failing with it, but I think that was just being propagated from the generic HTTP client error.)
If you implemented GOAWAY yourself, then there are some issues you could have made in the server-side implementation. I'd ask more about that, but it probably doesn't matter much; maxConnectionAge has been available in grpc-java since 2017.
Our stance is clients need to handle GOAWAY. That's the Internet's stance. If they don't then we will work with the broken implementations to get them fixed, and we understand it can take a long time for the old clients to be replaced and we're willing to work through that. But the goal is to get clients working with GOAWAY, not to paper over the problem.
At that time, it is my server supporting. Nearly all go client using http2 have to use RoundTrip, explicit or implicit.
Yes, all client need to support GOAWAY, but until they support it correctly, we can have a mechnism to reduce the break impact.
We do not implement GOAWAY myself. The go client team mentioned above which is affected by GOAWAY, advises us for the implementation of k8s apiserver with GOAWAY probability, so we add it with a dynamic switch.
Recently my friend tells me they have face the same issue, and open an issue here for upstream help, so there is this issue.
Nearly all go client using http2 have to use RoundTrip, explicit or implicit.
grpc-go does not use RoundTrip. It has its own HTTP/2 implementation. It uses x/net/http2 for encoding/decoding HTTP/2 frames, but the actual processing of those frames is grpc-specific code.
I agree that regular HTTP clients in Go use RoundTrip.
We do not implement GOAWAY myself. The go client team mentioned above which is affected by GOAWAY, advises us for the implementation of k8s apiserver with GOAWAY probability, so we add it with a dynamic switch.
I don't follow. You said "At that time, it is my server supporting" and 'encounter this strange error "request declared a Content-Length of N but only wrote 0 bytes" after we implement GOAWAY mechnism by ourself and deploy it.' That means you already used this functionality. But that is a server-side feature, so that means you are using Go? How is grpc-java involved? How can you have seen those error messages using grpc-java? Did you enable maxConnectionAge, see the failures and then turn it off?
Do you perhaps have a Go proxy between the client and server: client → Go proxy → grpc-java? The Go proxy would then use the Go networking stack (nothing grpc-specific) and could trigger that go GOAWAY bug. You said your grpc-java server is an API gateway, so I'd normally take that to mean client → grpc-java → backends.
Yes, all client need to support GOAWAY, but until they support it correctly, we can have a mechnism to reduce the break impact.
You mentioned there was another team you could warn for increased error rates when rolling out. Are they not able to rebuild with a newer golang to pull in the years-old fix?
I feel like we may be close to understanding each other.
My server is a grpc-java application provides gateway function to cross language clients including go, we maintains a grpc-java fork years ago.
The cause of go side is told from my colleage at that time, since years past and he has left, i can not confirm the real cause, but i can confirm the problem has gone after we disable GOAWAY.
The problem is not happened recently, but at 2020. That team usually keep one version behind the latest go major version (e.g. upgrade to 1.20.x when 1.21 is released), and the problem happens after we do a deploy, so they are not willing to fix the problem from their side.
The Go fix was in 1.15. Go 1.17 was released 2021-08-16, so those broken clients may have been replaced two years ago.
I suggest you try out maxConnectionAge again. I don't see the point of us adding a feature for a problem that likely no longer exists.
It sounds like it will take some time for you to do that (like longer than a week), so I'll go ahead and close this. If you discover broken clients are contacting your service, you can comment on this issue for us to reopen it or you can just open a new issue.
Interesting, we are talking about a solution to avoid unexpected break but not a known issue which you constantly talk about. Anyway, you close as you like it.
If you have GOAWAY on all the time, then you don't need to worry about an unexpected break, because broken clients couldn't be rolled out to production to begin with. You prevent them from existing, instead of allowing them to remain hidden and only break at important times.
Sorry but they are born much earlier than my application.
I don't understand how that is relevant. There are two cases.
Case 1: Existing broken clients contact your application
- You check user-agent and find no broken go clients from 2020. There are no known broken clients and there is not an increased error rate when you deploy new server versions
- You tell client team you are rolling out a new version
- You roll out a new version of your service which uses
maxConnectionAge - Client team tells you there is an increased error rate
- You roll back and tell me what clients broke so we can work to get them fixed. We coordinate on the approach
Case 2: A "new" broken client contacts your application
- You roll out
maxConnectionAgefor your app, and there are not currently any broken clients, so it is on all-the-time. Let's assume you usemaxConnectionAge = 15 minutes - A broken client starts rolling out and contacts your application. It doesn't matter if this is a new bug or a bug from 2020
- In 15 minutes the broken client starts failing. They stop rolling out the client and investigate
I honestly am confused as to which case you are talking about. It sounds like Case 1 is not a problem, and you are concerned about Case 2. But Case 2 is "everything is working properly, and keeps working properly." The approach you recommend here breaks case 2 and opens your service to failures.
I always talks about case 1, which fixed by maintaining a fork, but not friendly when a new grpc-java user comes.
The step 5 is not always working. We need GOAWAY to solve traffic rebalance issue, so rollback the whole application is not the best solution. And client team rarely coordinates on this issue caused by our deployment. We must solve by ourselves. The best way we found is that, send an anouncement for a deployment, select a time(usually mid-night), turn GOAWAY on during deployment and turn if off after traffic is balanced. Not very perfect, but is enough to decouple our deployment with client team deployment.
Case 1 is a one-time process if it succeeds, as steps 4 and 5 don't necessarily happen. Right now it seems there's no reason to believe step 4 will happen. If step 4 doesn't happen you are done and you never have to worry about it again. There's no "not always working" aspect.
A tweak to the approach in Case 1 is to start with a conservative max connection age, like 1 hour, and then decrease it with later deployments to, say, 30 minutes then 15 minutes. This is the approach commonly done if you have poor visibility into clients.
This would have been a more interesting discussion if y'all didn't fork grpc years ago. At this point it seems there's not much for us to do until you find broken clients.
Right now it seems there's no reason to believe step 4 will happen.
Very ridiculous point, since there is a case i have reported and fixed in my side. If any issue that has been fixed can be said as never happen, that's fine. I'm talking about precaution while you talk about workaround.