grpc Need more intelligent re-resolution of names

For starters, let me state that this is not a bug. I'm trying to reach out for help because I can't simply piece together enough information to figure out how to address client-side load balancing properly within the node.js world of gRPC.

Here is my example code base: https://gist.github.com/carldanley/39d5a0d7f9b1ea865af94481da1e0cac. I deploy that to a kubernetes environment and use a load balancer to attempt to split the traffic (with no luck)... What am I doing wrong?

Aug 25 '17 06:08 carldanley

For future reference, for items like this that are not bugs, instead of filing issues, please send email to the grpc.io mailing list. That's a much better forum for this kind of discussion.

I don't know anything about node or speak javascript, so @murgatroid99 will have to help you with that side of things. But I can help answer your questions about how client-side load balancing is supposed to work.

I think a similar question came up a while back; see discussion in #11406. Glancing at your code, it looks like you're using the wrong name for the channel arg to select the LB policy. Try changing loadBalancingPolicy to grpc.lb_policy_name.

If that doesn't fix the problem, please let us know what you're trying to do, what you expected to see, and what you're actually seeing.

Aug 30 '17 16:08 markdroth

Hello @markdroth! First off, thank you!

Secondly, I tried changing the load balancing policy to grpc.lb_policy_name but had no luck with client-side load balancing occurring. Let me explain what I'm using for a setup:

I run kubernetes. I have a service for our gRPC server. I have a client that is told to use the internal DNS address for the service: server.grpc.svc.cluster.local which resolves to the 2 (for testing purposes) pods that run the gRPC server. When the client is instructed to connect to that service (via https://gist.github.com/carldanley/39d5a0d7f9b1ea865af94481da1e0cac#file-index-js-L50), it connects to the first one that the DNS resolves to and keeps creating gRPC requests to that single server (instead of round robin to the two servers). I'm not sure what I can do differently to make this work... I expect the code in that gist to have a single client connect to both hosts on start and then, per request, round-robin through each of the gRPC servers.

Aug 31 '17 00:08 carldanley

Ok, so this was a kubernetes-related issue. I had to turn off kubernetes load balancing for this specific server's service entry in k8s. However, the next thing I tested was destroying a server randomly. It appears that the client does not periodically re-evaluate the entries for the DNS name provided; as such it continued to send traffic to the remaining instance it knew of but never detected the new instance (until I restarted the client). How can I have my clients automatically start using a new server instance that came online?

Aug 31 '17 05:08 carldanley

Currently, we only re-resolve DNS names when all subchannels become disconnected. So if you restart both of the servers, the client will re-resolve and then connect to the new addresses. But if you move just one of the servers, we will just stop being able to use that one.

@dgquintas and I have talked about possible ways to address this problem. It's fairly tricky, because we can't really know when it's useful to re-resolve, and we need to avoid slamming the DNS server with constant queries. For example, if a server is crashing every 10 seconds, we don't want every single client to try to re-resolve the name every 10 seconds. And if this is an environment where servers have static addresses, then there's no point in re-resolving in the first place.

One possible solution would be to make the DNS resolver aware of DNS TTLs, so that we can automatically re-resolve after the previous results expire; this would essentially allow the DNS data to determine how often the clients re-resolve. However, while we could probably do this in the C-core gRPC implementation, it's not clear that we have reasonable ways to access DNS TTL information in Java or Go, which would make our clients' behavior inconsistent.

Another possibility is to provide the ability to configure the threshold for what percentage of subchannels need to become disconnected before we re-resolve. The default would be 100%, which would match the current behavior, but it would allow people to reduce the threshold to something more appropriate for their environment. We might also want to provide a way to set the minimum interval between re-resolutions, just to provide some additional safety against slamming the DNS server.

Anyway, we've had a lot of discussions about this but have not yet decided on any particular behavior or scheduled any work on this. But if this is something you'd like, let us know, and we can start figuring out how to prioritize it.

Aug 31 '17 14:08 markdroth

Okay, I understand. I did some testing and splitting traffic reliably in a CI/CD environment is hit or miss at the moment. Consider the following:

You have 2 instances (iA and iB) of service 1 (s1) running. You have 2 instances (iC and iD) of service 2 (s2) running. If s1 has a rolling update and shortly after s2 has a rolling update: the events could pan out like this:

s1 rolling update starts
iA goes down
iA comes up
s2 rolling update starts
iB goes down
iC goes down
iB comes up
iC comes up
s1 rolling update stops
iD goes down
iD comes up
s2 rolling update stops

In the scenario above, s1 instances (iA and iB) will only see 1/2 of s2 instances (iC) given. This is bad and means that none of my traffic is round-robin'ed and that 1 instance is getting slammed.

After reading the solutions you proposed, it kind of feels like all of those things should at least be available. Let the developer decided whether or not they want to take extra DNS traffic. Let the developer decided if they're using a client (in a language) which can access DNS TTL and so on.

Just thinking out loud here:

I think there are really 2 scenarios we care about:

1. Losing an established subchannel

When a subchannel disconnects (2/2 instances becomes 1/2 instances), we could attempt to re-resolve the DNS entry for some number (attempt-based or time-based) of times. This gives us a way to start listening until we're 2/2 again OR we were unsuccessful in getting back 2/2 (so we stay 1/2).

2. Discovery of new instances

Consider we had a healthy, happy service that was correctly load balancing 5/5 instances. What if we had autoscaling enabled and we had a sudden surge of traffic hit our servers. Now we're running 7 instances and because we never scaled the clients and only the servers, we have no way to serve the surge of traffic (because they'll never refresh the DNS pool) so we stay 5/7...

One possible solution is giving us a function to call that can refresh the pool of hosts via DNS resolution. This would give us a way to decide when we want to trigger it ourselves. For example, imagine that we performed a rolling update on a service that had 100 instances; we could publish an event onto our messaging queue (when 100 of 100 is up) that could tell clients to refresh their DNS hosts. Really, we could write whatever logic we want with this and perform the resolution whenever we saw fit; it gives us full control of something that works for us.

Anyways... </2cents>

Aug 31 '17 18:08 carldanley

I definitely agree that we should do better in the first scenario you mentioned, and some combination of the ideas we've been discussing could address that.

With regard to the second scenario you mention, I think it's worth noting that DNS is fundamentally unsuited to the kind of dynamic environment you're describing, because DNS is a polling-based mechanism, whereas what you really want is a push-based mechanism where the clients are proactively notified when addresses change. While we might be able to find a way to work around this with DNS with the DNS TTL solution I mentioned above, I think it will never really scale the way it needs to, because it really wasn't designed for this kind of usage. A better approach would be to write a new resolver mechanism that subscribes to notification from some centralized system as the servers move around. For example, I'm not sure what mechanism kubernetes uses to update DNS, but you could presumably have it also notify some other name service that would allow clients to subscribe to particular names and would proactively send them updates when kubernetes notifies them of changes to those names. Then your clients would be getting a constant stream of updates and would always have an up-to-date set of addresses.

Given that, I think that any changes we make here will likely be focused on the first scenario, not the second. But we'll have to talk further to decide exactly how we're going to handle this.

Aug 31 '17 20:08 markdroth

@markdroth DNS may not be the perfect solution, but it is ubiquitous and easy to integrate with. I would prefer to setup a DNS poll every 10-20 seconds for my microservices to at least get going with load balancing my gRPC services. When that produces too much load on the DNS servers, then I will start looking at a lookaside balancer.

Right now the cost to getting simple load balancing that we are used to with HTTP 1.1 is very high. The solutions are, as I see them:

Create your own lookaside load balancer, modify ALL of your clients code to use the lookaside balancer, maintain the code and integrations with your service discovery platform of choice (K8, Consul, ZooKeeper, etc.)
Create a service mesh using Istio or Linkerd, both of which have their own limitations, drawbacks, and advantages.
Use builtin DNS resolver, which means you can't scale your servers up without first scaling all of your servers down.

A DNS-based resolver with a refresh interval would be a very low-cost, low barrier-to-entry solution that lots of developers would be comfortable with and not require a huge investment in either infrastructure or coding.

Jan 17 '18 21:01 hollinwilkins

For anyone encountering issues and looking for a simple solution: https://github.com/grpc/proposal/pull/23/files

Using server-side connection options can cause load to redistribute in a fairly easy manner! Wish I had seen this document 2 weeks ago.

Jan 17 '18 23:01 hollinwilkins

We've recently done some work to make this somewhat better. The round_robin code now re-resolves whenever any individual backend connection fails, and the DNS resolver enforces a minimum time between lookups to ensure that we don't hammer the DNS server when backend connections are flapping.

This doesn't address the discovery case, but it does improve the scenario where only a subset of backends fail.

Mar 21 '18 21:03 markdroth

CC @jtattermusch

Apr 19 '18 11:04 jtattermusch

@hollinwilkins can you describe what changes to your setup you've made (in reference to https://github.com/grpc/grpc/issues/12295#issuecomment-358483266) and confirm that the "discovering of new endpoints" problem went away? I am currently facing the exact issue you were facing (losing an established instance is handled correctly, but new instances are not being discovered) while trying to make a simple RoundRobin LB scenario work out of the box on kubernetes.

Are there any other possible workarounds (like forcing re-resolution of backends)?

Apr 19 '18 11:04 jtattermusch

Ad workaround based on https://github.com/grpc/proposal/blob/master/A9-server-side-conn-mgt.md:

I tried setting grpc.max_connection_age_ms and grpc.max_connection_age_grace_ms channel arguments on the server that I'm trying to access with RoundRobin load balancing policy and it seems that it is helping: closing the connections occasionally leads to re-resolving the domain name and newly added service replicas are being picked up by the round-robin loadbalancer in a relatively short time.

Apr 19 '18 16:04 jtattermusch

@jtattermusch This is the approach I took. Not ideal, but works for now.

Apr 19 '18 17:04 hollinwilkins

grpc go client re-resolves DNS every 30 minutes. Could c++ client do the same, so we can configure the interval?

https://github.com/grpc/grpc-go/blob/master/resolver/dns/dns_resolver.go#L46

May 31 '18 20:05 wjywbs

After having read this thread and some of the linked issues I'm still not sure I understand why observing the DNS TTL for refresh would be a bad thing. From what I can tell it would just work. Be it scaling up or down, k8s or outside of it. I think its properties would fit the principle of least surprise. I cannot imagine many selecting a DNS based round-robin load balancing approach would be surprised by clients having to poll DNS in TTL interval and that producing load. However many will be surprised to learn it won't react to changes in DNS.

Load seems to be the most commonly stated reason why observing DNS TTL would be bad but I just don't see it. If my DNS service cannot handle the polling load I can easily trade-off with higher TTL, scale my DNS and ultimately once that no longer makes sense transition to another more scaleable LB approach. It is not like DNS RR LB in gRPC allows arbitrary scale to begin with so why pretend it has to? It is the simple solution for the simple cases. It should work as best as it can inside of those constraints.

Having to use MaxConnectionAge, which just happens to be coupled to re-resolution, to emulate a polling behaviour seems like a bad workaround to me. I don't see how making a DNS query to some DNS cache every X seconds would be seen as problematic but having to do a magnitudes more expensive re-connect plus (encryption-)handshake with each of the backends plus having to regularly refresh DNS anyway is an acceptable workaround to that.

Currently all the alternatives I can see are vastly more complex to run and expensive to implement. Why force users to use a service mesh or some custom look-aside load-balancing scheme when there's a way to make what is already supported just work for a lot of cases?

Oct 20 '18 13:10 hacst

Personally, I tend to agree that the max-connection-age approach is a fairly ugly solution to the problem of forcing re-resolution. However, I'm not sure that everyone on the team agrees with that.

I think the main argument against using TTLs is that we want consistent client behavior across languages, but while we would be able to access the TTL information in C-core, we have no reasonable mechanism for doing so in Java or Go. So it's not really a portable solution.

I do think we should consider providing a way for the client to be configured to periodically re-resolve at some fixed interval.

I'd like to get feedback from @ejona86, @zhangkun83, and @dfawley on this.

Oct 22 '18 14:10 markdroth

I see. Technically DNS TTL could be retrieved in any language by using a custom resolver (e.g. using something like miekg/dns in go or netty DNSResolver in Java. The latter would also get rid of the broken built-in DNS caching behaviour of the JVM...). That's basically what using c-ares in C-core amounts to. Whether that's a "reasonable" thing to do everywhere is of course debatable.

In any case I definitely would prefer a configurable polling interval for DNS to the current MaxConnectionAge approach. Maybe there could even be an opt-in flag that makes it use the DNS TTL when supported and fallback to the polling interval otherwise? I'm not sure whether such "extensions" is something that's done across the gRPC clients in different languages but I would be surprised if they are totally equal now. But as I said. Just having the configurable DNS polling interval would be a considerable improvement.

Oct 22 '18 22:10 hacst

Load seems to be the most commonly stated reason why observing DNS TTL would be bad but I just don't see it.

It's not quite that simple. When caching DNS resolvers are in place a single response from the authoritative DNS server can be sent to 1000s of clients. All those clients will have the TTL expire at the same time (independent of when they originally queried) so they form a "stampeding herd." Every time the TTL expires the entire "herd" will re-request DNS at the same time. Increasing the TTL would decrease average load but wouldn't reduce peak load.

With a limited number of clients, that can be fine. But the DNS resolver would do this in all cases, including in large-scale pick-first cases like googleapis.com. Using a consistent polling frequency doesn't cause herds, but configuration becomes a problem.

Having to use MaxConnectionAge, which just happens to be coupled to re-resolution, to emulate a polling behaviour seems like a bad workaround to me.

I would call it a "functional but non-optimal solution." We have to have MaxConnectionAge for other reasons, so the question is if the deficiencies are bad enough to warrant another solution for this specific case. Note that one great property of the current solution is that the configuration is the service's control, and we'd want to avoid losing that property with a new solution.

Note that I don't really consider the solution to be a "workaround" or "hack," in that most of the web relies on the behavior of re-issuing DNS when reconnecting. The problem for round-robin is that it can refresh too frequently.

I don't see how making a DNS query to some DNS cache every X seconds would be seen as problematic but having to do a magnitudes more expensive re-connect plus (encryption-)handshake with each of the backends plus having to regularly refresh DNS anyway is an acceptable workaround to that.

TLS Session Resumption should reduce the cost of the re-handshake to something fairly low. That said, I've not verified that our clients are using resumption and I think I saw that Java is not. But that's a clearly-defined problem that could be resolved.

Yes, reconnecting is more expensive than a DNS query, but it is small when amortized over the lifetime of the connection. We're not trying to fully 100% optimize this one use-case at any expense, eking out every last CPU cycle in code that runs once every O(minutes). We support many use cases and we want them to work reasonably well at reasonable cost.

So to me, the discussion shouldn't be narrowly focusing on whether some alternative is more efficient than what we have now. Instead, it should focus on the problems caused by the existing approach.

This issue was started in the days that C core had very poor re-resolution behavior (which changed sometime around March, based on the markdroth's comment). The problem then was "older clients virtually never connect to new servers," such that load was woefully underdistributed in normal use-cases. That has been resolved.

Oct 24 '18 21:10 ejona86

@ejona86 I see your point about the stampeding herd. I hadn't considered that each cache will return the current remaining TTL of its cached value (which is kinda obvious in hindsight) so any client talking to that cache instead of the authoritative source will sync up when polling. That definitely isn't a great behaviour if you want to have thousands of clients.

While it isn't as bad having these thousands of clients reconnecting to each single backend server with MaxConnectionAge period doesn't sound great to me either.

Besides pure re-connection cost, tearing down perfectly fine connections that could otherwise stay long-running can also have other side-effects at a higher level in the stack. E.g. assume a service offers very long-running bidir streaming calls with an expensive to re-create context on the server related to the running call. In that case using MaxConnectionAge will force the client to regularly end the call and disconnect. The next call will hit some other random backend in which the context has to be re-created.

Could you elaborate on the configuration issues you see with a configurable DNS polling interval disconnected from TTL? I would've thought it would just be a value like MaxConnectionAge that does nothing if not set.

Oct 25 '18 00:10 hacst

assume a service offers very long-running bidir streaming calls with an expensive to re-create context on the server related to the running call. In that case using MaxConnectionAge will force the client to regularly end the call and disconnect. The next call will hit some other random backend in which the context has to be re-created.

So two parts to this:

MaxConnectionAge itself doesn't require the stream to be torn down. MaxConnectionAgeGrace will control how long old streams can live. Old streams can stay on old connection indefinitely, but will also keep the old connection alive as long as they do so. In the worst-case, over time each old stream will basically each get its own connection if streams have infinite lifetime. But we tend to expect few long-lived streams per backend.
We've seen that service-owners frequently need to put a lifetime on streams, otherwise they aren't load balanced. If you bring up a new server, none of the existing clients will create the long-lived stream to the new server because they are happily connected to an old server.

It is possible to develop a client-side LB policy that uses affinity to consistently route long-lived streams back to the same warm backends, but because of (2) it puts you in a bit of a bind for distributing load when applied to this use-case. (This affinity-based system actually exists in gRPC Java today but the design went into a weird limbo state as we resolve some larger LB discussions. It was implemented in Go as well, but was reverted because the design went into limbo state. It is powered off service config, which isn't ready for prime-time, though.)

Could you elaborate on the configuration issues you see with a configurable DNS polling interval disconnected from TTL? I would've thought it would just be a value like MaxConnectionAge that does nothing if not set.

MaxConnectionAge is configured on server-side. So if the service owner needs to change the value, they can change it fairly rapidly. Most obvious forms of configuring DNS polling interval would place it hard-coded on client-side, which means it can take O(years) for clients to pick up any change. Yes, some service owners control their clients and so it wouldn't be a problem, but many don't and so the solution would have more limited applicability.

While it could be possible to provide live configuration to the client via the service config, that necessitates increasing the complexity of the solution.

Oct 25 '18 15:10 ejona86

In theory, the timing of connection aging can line up in such a way that all backends drop connections almost at the same time causing increased latency spikes on clients even in a steady-state system. This would not be the case with periodic DNS refresh.

Both solutions (periodic DNS refresh and max connection age) have pros and cons. Are there are any users who have run into practical issues with max connection age solution?

Feb 05 '19 21:02 srini100

bumping this - couldn't MaxConnectionAge approach also lead to a "stampeding herd"? Also curious if anyone has experienced issues with the connection-based approach.

We would like to make use of out-of-the-box autoscaling features, whereby DNS records are added when new boxes come online. It's preferable that we don't need to bounce a service or set MaxConnectionAge (as this reduces the responsiveness of overall autoscaling approach).

Mar 15 '19 20:03 schmohlio

couldn't MaxConnectionAge approach also lead to a "stampeding herd"?

No, because it is based on the original connection creation. It can perpetuate a stampeding herd (by making it reoccur), but wouldn't be the cause. This is also why it uses a 10% randomization factor.

It's preferable that we don't need to bounce a service or set MaxConnectionAge (as this reduces the responsiveness of overall autoscaling approach).

Services using DNS today create new connections very frequently. If you want a MaxConnectionAge of 30 seconds, that's okay (although low ages may be more problematic for Go clients). The connection has been utilized much better than it would have with HTTP/1 and the cost has likely been amortized over many RPCs. HTTP servers frequently shut down connections much younger.

(Go clients currently reconnect eagerly. So with a MaxConnectionAge of 30 seconds the clients will reconnect every 30 seconds, even if they only do an RPC once an hour. This is a TODO for Go, and doesn't impact many users.)

Apr 03 '19 00:04 ejona86

I tried to enable MaxConnectionAge on a server and call from a Go client. When the connection age was reached, the rpcs in the Go client failed with the unavailable error and the "transport is closing" message. Then I gave up on this MaxConnectionAge approach to avoid extra retry logics in the code.

Apr 03 '19 00:04 wjywbs

@wjywbs, is there a bug open for that? That should not happen. @dfawley, do you know what could tickle the Go client to do that?

Apr 03 '19 00:04 ejona86

The only thing I can think is if you have long-running RPCs, max connection age will eventually time out (after MaxConnectionAgeGrace) and hard-close the connection. Otherwise the RPCs are expected to complete successfully, and I'd be pretty surprised if we don't have tests for this. If the above doesn't explain the errors you're getting, please file a bug in the grpc-go repo about that.

Apr 03 '19 00:04 dfawley

Thanks for your help. I didn't set the grace time last time. However, when I tested again with both age and grace time set to one minute, the Go client reported lots of errors as well. Each rpc takes a few seconds to complete, within the one minute grace period.

grpc-go v1.18.0/v1.19.1
rpc error: code = Unavailable desc = all SubConns are in TransientFailure, latest connection error: <nil>
rpc error: code = Unavailable desc = the connection is draining
rpc error: code = Unavailable desc = transport is closing

Apr 03 '19 02:04 wjywbs

@wjywbs, please file that as an issue on grpc-go's repo. That does not appear to be expected behavior.

Apr 05 '19 20:04 ejona86

Hi, I wanted to chime in with a different suggestion to resolve similar problems.
My suggestion is discussed here https://github.com/grpc/grpc/issues/18743 (closed by @markdroth as dup of this issue) and here https://github.com/grpc/grpc-go/issues/2751 but let me reiterate the highlights for brevity.

I suggest to configure the DNS resolver using query parameters in the connection URL. This is similar to how JDBC, for example, works.
Example:

dns://127.0.0.1:8600/endpoint.service.consul.:8080?lookupSRV=0&lookupTXT=0&refreshRate=30

Where:

127.0.0.1 is the IP of the authority DNS server (this is not new)
8600 is the DNS server port (default is 53, this is not new)
endpoint.service.consul.:8080 is the endpoint:port to lookup (this is not new)
lookupSRV=0 disables SRV records lookups (this is new)
lookupTXT=0 disables TXT records lookup (this is new)
refreshRate=30 configures the refresh frequency to 30 seconds (this is new)

My suggestion is to be able to declaratively configure the DNS refresh rate (also called minFreq as in minimal lookup frequency), whether to lookup SRV records or not (default to true but not always necessary and when lookups are often, this can load on the DNS server) and whether to lookup TXT records or not (likewise default to true but not always necessary and can load)

Apr 18 '19 06:04 rantav

We have discussed query parameters before. I'm not against them in general. The main issue with query parameters for DNS refresh rate is that the service owner has no control of the setting. The main issue of doing DNS refresh rate is that we don't want a client-side option, and making it service-controlled has complexity.

Let's not bring SRV and TXT into this discussion. Although I will note there is a channel option to disable service config lookup (TXT), independent of the name resolver. In C it is GRPC_ARG_SERVICE_CONFIG_DISABLE_RESOLUTION, but it will exist in every language (Java got it in v1.20).

Apr 18 '19 17:04 ejona86