dns-java
dns-java copied to clipboard
`SERVFAIL` when trying to resolve a service address
What happened?
When upgrading from com.spotify:dns version 3.1.5 to 3.2.2 some of the services started having SERVFAIL even though the service is there.
What was expected?
As there's no breaking change in the perceived API from com.spotify:dns, we expected the changes to not affect functionality.
How to reproduce
We didn't find a good way to reproduce. We didn't manage to pin down what is causing the problem. It seems related to some concurrency, as sometimes the problem doesn't appear. I am more than glad to show the issue happening in a service.
Context
We need to upgrade dnsjava:dnsjava to from version 2.x to 3.x. We checked that com.spotify:dns has done this change in version 3.2.0. We tested in some services and they seem to be working fine, so we decided to roll out the change for all of our users. What happened is that in some of them, from what we can see the ones using gRPC, they started getting SERVFAIL intermittently.
Here is an anonymised stack trace:
Jul 15, 2021 4:29:20 PM io.grpc.internal.ManagedChannelImpl$NameResolverListener handleErrorInSyncContext
WARNING: [Channel<38>: (${PROTOCOL}://${SERVICE})] Failed to resolve name. status=Status{code=UNAVAILABLE, description=null, cause=java.util.concurrent.CompletionException: com.spotify.dns.DnsException: Lookup of '${PREFIX}-${SERVICE}._grpc.services.${DOMAIN_ADDRESS}' failed with code: 2 - SERVFAIL
at java.base/java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:314)
at java.base/java.util.concurrent.CompletableFuture.uniApplyNow(CompletableFuture.java:683)
at java.base/java.util.concurrent.CompletableFuture.uniApplyStage(CompletableFuture.java:658)
at java.base/java.util.concurrent.CompletableFuture.thenApply(CompletableFuture.java:2094)
at com.spotify.grpc.DnsSrvNameResolver.lambda$resolver$4(DnsSrvNameResolver.java:160)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
at java.base/java.lang.Thread.run(Thread.java:829)
Caused by: com.spotify.dns.DnsException: Lookup of '${PREFIX}-${SERVICE}._grpc.services.${DOMAIN_ADDRESS}' failed with code: 2 - SERVFAIL
at com.spotify.dns.XBillDnsSrvResolver.resolve(XBillDnsSrvResolver.java:60)
at com.spotify.grpc.DnsSrvNameResolver.lambda$resolver$0(DnsSrvNameResolver.java:162)
at java.base/java.util.concurrent.CompletableFuture.uniApplyNow(CompletableFuture.java:680)
... 6 more
}
We tried bumping version of dnsjava:dnsjava from 3.0.2 to 3.4.0 and the problem seemed to go away, but after some minutes (around ~10min) of the service running it started again. I am not sure if this was a local problem.
When we did a dig srv ${PREFIX}-${SERVICE}._grpc.services.${DOMAIN_ADDRESS} some hosts are returned as expected. Changing the version back to com.spotify:dns:3.1.5 and dnsjava:dnsjava:2.x makes the problem go away.
Java version used during the test:
$ java -version
> openjdk version "11.0.10" 2021-01-19 LTS
> OpenJDK Runtime Environment Corretto-11.0.10.9.1 (build 11.0.10+9-LTS)
> OpenJDK 64-Bit Server VM Corretto-11.0.10.9.1 (build 11.0.10+9-LTS, mixed mode)
I wonder if there is a bit of confirmation bias going on in noticing of the SERVFAIL responses with the new library - in that there may have been sporadic or intermittent SERVFAIL responses on the old library as well, but no one was paying attention until the library was upgraded (and the owners were told to keep an eye out for any weirdness)