akka-http
akka-http copied to clipboard
Akka Http Client pool connections are not reestablished after DNS positive-ttl
We have found that under some circumstances, Akka's Http client is not honoring the positive-ttl expiry value, not picking up new DNS entries.
It looks as if under load, and using the default http connection pool, the client will never try resolving again the DNS entry if the connection is not closed, regardless of the positive-ttl value.
Steps to reproduce: 0. Using akka-http 10.0.6 and akka-core 2.5.2
- DNS resolves
test.com
toserver_A
withip_A
- Run akka http application with following settings:
dns.inet-address { positive-ttl = 30s negative-ttl = 30s }
- Run load continuously to this akka http app which does requests to
test.com
- Change DNS entry (in
/etc/hosts
, for instance) to point toip_B
. NOTE:server_A
withip_A
is still running. - Wait for positive-ttl dns cache expiry (30 seconds, in this example)
Expected behaviour:
- DNS cache expires, every new request should be sent to
ip_B
.
Current behaviour:
- New requests after DNS expiry time are still going to
ip_A
. - The only way to have the akka http application to pick up the new DNS entry is by restarting.
That seems to be the case because akka's DNS resolver is based on JVM's InetAddress.getAllByName
which introduces another layer of caching.
You can already observe the behavior by just using java.net.InetAddress.getAllByName("...")
and changing /etc/hosts
entries in between.
It seems the JVM DNS caching layer is configured using java.security.Security
properties which are defined in a security file if a SecurityManager
is installed, otherwise it can be overridden (or turned of in this case) by setting this JVM property using -Dsun.net.inetaddr.ttl=0
. Can you see if that works for you?
See also https://docs.oracle.com/javase/8/docs/technotes/guides/net/properties.html#nct and http://www.myhowto.org/java/42-understanding-host-name-resolution-and-dns-behavior-in-java/ and https://stackoverflow.com/questions/1256556/any-way-to-make-java-honor-the-dns-caching-timeout-ttl#1256609
Thank you @jrudolph for a quick response. The problem is not related to JVM's DNS resolver unfortunately, that would be something easy to fix. I've written a test that is failing and shows the issue, you can find it here: https://github.com/wojda/AkkaHttpDnsInvestigation. I wanted to make sure it's not the problem with JVM or other system config so the test starts three docker containers, one with akka http client, and two with the same server (but different ip). You can build and run the test with one command, please check readme.md. I hope the test will be useful.
I've done a quick investigation too. According to logs from Akka, a hostname is only resolved when a new connection is created. Example:
[DEBUG] [akka://client-system/system/IO-TCP/selectors/$a/7] Resolving server.com before connecting
[DEBUG] [akka://client-system/system/IO-TCP/selectors/$a/7] Attempting connection to [server.com/172.17.0.3:8080]
[DEBUG] [akka://client-system/system/IO-TCP/selectors/$a/7] Connection established to [server.com:8080]
In my case, because of high TPS (no idle connections) and the fact that the depricated server_A is still running and responding, a new connection is never created. After changing DNS entry, Akka Http client uses existing connection pool. Please correct me if I'm wrong, in that case 'positive-ttl' has no effect, because Akka Http client does not create new connections and not resolve the hostname.
Akka Http client does not create new connections and not resolve the hostname.
Yes, that's correct. I guess if you need connections to be refreshed we would need to add another feature to restrict the life-time of persistent connections (which could be a reasonable thing to do).
I suspect that this is causing issues on our end where the underlying host isn't getting updated due to dns timeout not being honoured, is there a workaround for this?
@mdedetrich I think so far it isn't clear what could or should be done on the Akka HTTP level.
So far, the only confirmed "issue" in Akka Http is that it keeps active persistent connections open for as long as possible. I'd say that's pretty reasonable behavior. Why make a new connection (potentially to a new IP address) when the old one is still alive and serving requests? Or are you seeing something different? Can the server be changed to close connections after a while?
Are there any other HTTP clients that actually couple DNS lifetimes with lifetimes of pool connections?
That said, we might want to an API to give users more control over the pools. This could e.g. be a method that requests to close all connections to a given host without shutting down that pool completely.
https://github.com/square/okhttp/issues/3374 also suggested to solve this on the server side / loadbalancer.
Why make a new connection (potentially to a new IP address) when the old one is still alive and serving requests?
This is the reason I got interested in the thread. Regardless of the mechanism that you use to convert a "host reference" to a pool of servers, (DNS, LB) you end up with a pool of persistent connections.
So let's say you have two servers A, B. Your client establishes a connection pool with roughly the same number of connections to each of the two, because balancing load is what your "host reference" is for. Now B goes down, maybe you just need to restart it. All the connections to B are broken, and the client establishes replacement connections to A to get the pool back up to the desired size. Now B comes back up, but there's no way (unless I'm missing something) to instruct the client to rebalance the persistent connections. All the traffic is now going to A until A closes its connections.
A connection lifetime (either in duration or request count) would help solve this by gradually rebalancing the connection pool.
A connection lifetime (either in duration or request count) would help solve this by gradually rebalancing the connection pool.
I agree that this would probably help. But also note, that you are pushing a backend issue to the client here. I think this issue can be seen as evidence that this is a brittle solution that requires full control over all sides of the connections.
@jrudolph My issue was actually unrelated, so you can ignore my earlier comment
Hi there
I guess, it would be nice and meaningful to have behaviour similar to what Finagle did for their client.
Watermark connection pool with lower and higher marks
@ImLiar could you explain how this is related to this ticket? I tried to understand the documentation but from a glance I didn't understand what this is about? Maybe it's because finagle is about services while akka-http is only concerned about http?
Graceful rotation of connections by TTL. Let's say, every few minutes (or any other configurable value) new connection is pushed in pool while old one is popped, in respect of low mark.
I guess you mean .withSession.maxLifeTime(20.seconds)
which seems to be similar to the suggestion above.
@jrudolph Yes, but no
Watermark connection pool is just one of mechanics for a pooling when you have minimal and maximal amount of connections in pool, and as far as you have more load than minimal amount of connections could serve, it will increase em up to higher mark.
Connection shut down could be achieved with any pooling, but with WM amount of connections will never go down to zero (unless it's not specifically defined by configuration) resulting to cold connection pool. It could be a different topic ofc
Sounds like our min-connections
/ max-connections
settings.
But also note, that you are pushing a backend issue to the client here. I think this issue can be seen as evidence that this is a brittle solution that requires full control over all sides of the connections.
@jrudolph an example that doesn't involve any back-ends failing is gradual traffic switch. If that is achieved by having two load balances and a weighted DNS resolution (e.g. how AWS Route 53 does it), then the issue cannot be solved on load balancer level (as suggested referencing okhttp), since the traffic is actually getting diverted from one LB to another.
In this case the only solution I'm aware of is to forcefully kill the "old" LB, thus throwing 5xx, which will kill connections on client's side and force akka-http to re-establish new connections, which in its turn resolves DNS. Doing so at high load, results in significant amount of errors and most likely opening a circuit breaker. And nether client nor server are happy about that.
If it's gradual the load balancer can start to close idle persistent connections and send connection:close headers, there's no need to send out 50x when it's not urgent.
If it's gradual the load balancer can start to close idle persistent connections
@jrudolph I feel like there is chicken and egg problem here. Connections will never become idle because client will never move traffic away from old LB / stack. This is actually what we are trying to achieve - force client to start sending requests to new stack.
@jrudolph do you consider any solution for the issue? Caching DNS entries forever is not the best idea for cloud. You can add or remove servers dynamically, so caching will not work.
Do you have any idea at least for workaround?
@jrudolph I agree with @agorski and @alivanni and don't see a way how to workaround outside of akka-http. I cannot make additional arguments, but please consider it a real issue, it's critical to my team by causing trouble in production and if there is no solution we have to migrate away from akka-http client unfortunately (and I know another team in Zalando who did also because of this).
There is a PR in progress which adds max-connection-keep-alive-time
.
Related to #1768. We are aware and agree this is an area where we plan to improve. We're currently working on improving our DNS infrastructure and one of the next steps is to also take into account the TTL.
My team is facing a related problem and needed some guidance regarding the same. We are using scredis library to connect to AWS redis. On change of ip of the redis node, re resolution of the host is not being attempted.
The relevant section application conf setting is :
"negative-ttl" : "never",
"positive-ttl" : "never",
"provider-object" : "akka.io.dns.internal.AsyncDnsProvider",
"resolve-timeout" : "5s",
"search-domains" : "default"
},
"dispatcher" : "akka.actor.internal-dispatcher",
"resolver" : "async-dns"
}
The relevant versions we are using are:
akkaHttp = "10.2.6"
scredis = "2.4.3"
akkaActor = "2.6.16"
From the logs we see when the older ip address ceases to respond the reconnection attempt happens to older ip rather than trying to re resolve the hostname:
[INFO] [akka://scredis/user/<$hostname>-6379-listener-actor] Connection has been shutdown abruptly
[INFO] [scredis-scredis.io.akka.io-dispatcher-20] [akka://scredis/user/<$hostname>-6379-listener-actor/<$hostname>-6379-io-actor-2] Connecting to <$hostname>/<old_ip>:6379
[ERROR][scredis-scredis.io.akka.io-dispatcher-20] [akka://scredis/user/<$hostname>-6379-listener-actor/<$hostname>-6379-io-actor-2] Could not connect to <$hostname>/<old_ip>:6379: Command failed
So in spite of connection being shut down the new connection does not re resolve the hostname. Could you direct us to what the reason could be.
You can enable debug logging to see what the AsyncDnsResolver
in action. I think by now this issue is largely resolved by using AsyncDnsResolver
with appropriate TTL settings for DNS and akka.http.host-connection-pool.max-connection-lifetime
for the pool. If you need more control than that you can also change ClientConnectionSettings
and set a custom ClientTransport
to implement whatever resolution logic is right.
A more automatic solution (implementing what the title of this issue says) could be to add some logic to do manual DNS resolution directly in the pool using the AsyncDnsResolver
and use the returned TTLs to automatically apply a max-connection-lieftime
setting for each connection.