Druid 33.0.0: coordinators not reachable when using druid-kubernetes-extensions
Affected Version
Apache Druid 33.0.0
Description
When upgrading or deploying a new Druid cluster with the druid-kubernetes-extension, the broker, historical and router nodes cannot talk to the coordinator anymore. The coordinator itself does not log any errors or exceptions.
Exception from broker:
2025-04-29T19:05:51,152 WARN [FilteredHttpServerInventoryView-2] org.jboss.netty.channel.SimpleChannelUpstreamHandler - EXCEPTION, please implement org.jboss.netty.handler.codec.http.HttpContentDecompressor.exceptionCaught() for proper handli
java.nio.channels.UnresolvedAddressException: null
at java.base/sun.nio.ch.Net.checkAddress(Net.java:149) ~[?:?]
at java.base/sun.nio.ch.Net.checkAddress(Net.java:157) ~[?:?]
at java.base/sun.nio.ch.SocketChannelImpl.checkRemote(SocketChannelImpl.java:816) ~[?:?]
at java.base/sun.nio.ch.SocketChannelImpl.connect(SocketChannelImpl.java:839) ~[?:?]
at org.jboss.netty.channel.socket.nio.NioClientSocketPipelineSink.connect(NioClientSocketPipelineSink.java:108) ~[netty-3.10.6.Final.jar:?]
at org.jboss.netty.channel.socket.nio.NioClientSocketPipelineSink.eventSunk(NioClientSocketPipelineSink.java:70) ~[netty-3.10.6.Final.jar:?]
at org.jboss.netty.channel.DefaultChannelPipeline$DefaultChannelHandlerContext.sendDownstream(DefaultChannelPipeline.java:779) ~[netty-3.10.6.Final.jar:?]
at org.jboss.netty.handler.codec.oneone.OneToOneEncoder.handleDownstream(OneToOneEncoder.java:54) ~[netty-3.10.6.Final.jar:?]
at org.jboss.netty.handler.codec.http.HttpClientCodec.handleDownstream(HttpClientCodec.java:97) ~[netty-3.10.6.Final.jar:?]
at org.jboss.netty.channel.DefaultChannelPipeline.sendDownstream(DefaultChannelPipeline.java:591) ~[netty-3.10.6.Final.jar:?]
at org.jboss.netty.channel.DefaultChannelPipeline.sendDownstream(DefaultChannelPipeline.java:582) ~[netty-3.10.6.Final.jar:?]
at org.jboss.netty.channel.Channels.connect(Channels.java:634) ~[netty-3.10.6.Final.jar:?]
at org.jboss.netty.channel.AbstractChannel.connect(AbstractChannel.java:215) ~[netty-3.10.6.Final.jar:?]
at org.jboss.netty.bootstrap.ClientBootstrap.connect(ClientBootstrap.java:229) ~[netty-3.10.6.Final.jar:?]
at org.jboss.netty.bootstrap.ClientBootstrap.connect(ClientBootstrap.java:182) ~[netty-3.10.6.Final.jar:?]
at org.apache.druid.java.util.http.client.pool.ChannelResourceFactory.generate(ChannelResourceFactory.java:198) ~[druid-processing-33.0.0.jar:33.0.0]
at org.apache.druid.java.util.http.client.pool.ChannelResourceFactory.generate(ChannelResourceFactory.java:59) ~[druid-processing-33.0.0.jar:33.0.0]
at org.apache.druid.java.util.http.client.pool.ResourcePool$ResourceHolderPerKey.get(ResourcePool.java:285) ~[druid-processing-33.0.0.jar:33.0.0]
at org.apache.druid.java.util.http.client.pool.ResourcePool.take(ResourcePool.java:109) ~[druid-processing-33.0.0.jar:33.0.0]
at org.apache.druid.java.util.http.client.NettyHttpClient.go(NettyHttpClient.java:127) ~[druid-processing-33.0.0.jar:33.0.0]
at org.apache.druid.server.coordination.ChangeRequestHttpSyncer.sendSyncRequest(ChangeRequestHttpSyncer.java:247) ~[druid-server-33.0.0.jar:33.0.0]
at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:539) [?:?]
at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264) [?:?]
at java.base/java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:304) [?:?]
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136) [?:?]
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) [?:?]
at java.base/java.lang.Thread.run(Thread.java:840) [?:?]
Exception from historical:
025-04-29T19:05:15,778 ERROR [main] org.apache.druid.query.lookup.LookupReferencesManager - Error while trying to get lookup list from coordinator for tier[__default]
rg.apache.druid.java.util.common.IOE: Retries exhausted, couldn't fulfill request to [http://druid-druid-coordinators-7889b9b98d-jcrxs:8088/druid/coordinator/v1/lookups/config/__default?detailed=true].
at org.apache.druid.discovery.DruidLeaderClient.go(DruidLeaderClient.java:219) ~[druid-server-33.0.0.jar:33.0.0]
at org.apache.druid.discovery.DruidLeaderClient.go(DruidLeaderClient.java:133) ~[druid-server-33.0.0.jar:33.0.0]
at org.apache.druid.query.lookup.LookupReferencesManager.fetchLookupsForTier(LookupReferencesManager.java:626) ~[druid-server-33.0.0.jar:33.0.0]
at org.apache.druid.query.lookup.LookupReferencesManager.tryGetLookupListFromCoordinator(LookupReferencesManager.java:474) ~[druid-server-33.0.0.jar:33.0.0]
at org.apache.druid.query.lookup.LookupReferencesManager.lambda$getLookupListFromCoordinator$5(LookupReferencesManager.java:451) ~[druid-server-33.0.0.jar:33.0.0]
at org.apache.druid.java.util.common.RetryUtils.retry(RetryUtils.java:129) ~[druid-processing-33.0.0.jar:33.0.0]
at org.apache.druid.java.util.common.RetryUtils.retry(RetryUtils.java:81) ~[druid-processing-33.0.0.jar:33.0.0]
at org.apache.druid.java.util.common.RetryUtils.retry(RetryUtils.java:163) ~[druid-processing-33.0.0.jar:33.0.0]
at org.apache.druid.java.util.common.RetryUtils.retry(RetryUtils.java:153) ~[druid-processing-33.0.0.jar:33.0.0]
at org.apache.druid.query.lookup.LookupReferencesManager.getLookupListFromCoordinator(LookupReferencesManager.java:441) [druid-server-33.0.0.jar:33.0.0]
at org.apache.druid.query.lookup.LookupReferencesManager.getLookupsList(LookupReferencesManager.java:418) [druid-server-33.0.0.jar:33.0.0]
at org.apache.druid.query.lookup.LookupReferencesManager.loadLookupsAndInitStateRef(LookupReferencesManager.java:394) [druid-server-33.0.0.jar:33.0.0]
at org.apache.druid.query.lookup.LookupReferencesManager.start(LookupReferencesManager.java:171) [druid-server-33.0.0.jar:33.0.0]
at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method) ~[?:?]
at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:77) ~[?:?]
at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) ~[?:?]
at java.base/java.lang.reflect.Method.invoke(Method.java:569) ~[?:?]
at org.apache.druid.java.util.common.lifecycle.Lifecycle$AnnotationBasedHandler.start(Lifecycle.java:446) [druid-processing-33.0.0.jar:33.0.0]
at org.apache.druid.java.util.common.lifecycle.Lifecycle.start(Lifecycle.java:341) [druid-processing-33.0.0.jar:33.0.0]
at org.apache.druid.guice.LifecycleModule$2.start(LifecycleModule.java:152) [druid-processing-33.0.0.jar:33.0.0]
at org.apache.druid.cli.GuiceRunnable.initLifecycle(GuiceRunnable.java:136) [druid-services-33.0.0.jar:33.0.0]
at org.apache.druid.cli.GuiceRunnable.initLifecycle(GuiceRunnable.java:94) [druid-services-33.0.0.jar:33.0.0]
at org.apache.druid.cli.ServerRunnable.run(ServerRunnable.java:70) [druid-services-33.0.0.jar:33.0.0]
at org.apache.druid.cli.Main.main(Main.java:112) [druid-services-33.0.0.jar:33.0.0]
Exception from router:
2025-04-29T20:02:09,326 WARN [CoordinatorRuleManager-Exec--0] org.apache.druid.discovery.DruidLeaderClient - Request[http://druid-druid-coordinators-7889b9b98d-jcrxs:8088/druid/coordinator/v1/rules] failed.
org.jboss.netty.channel.ChannelException: Faulty channel in resource pool
at org.apache.druid.java.util.http.client.NettyHttpClient.go(NettyHttpClient.java:134) ~[druid-processing-33.0.0.jar:33.0.0]
at org.apache.druid.java.util.http.client.AbstractHttpClient.go(AbstractHttpClient.java:33) ~[druid-processing-33.0.0.jar:33.0.0]
at org.apache.druid.discovery.DruidLeaderClient.go(DruidLeaderClient.java:158) ~[druid-server-33.0.0.jar:33.0.0]
at org.apache.druid.discovery.DruidLeaderClient.go(DruidLeaderClient.java:133) ~[druid-server-33.0.0.jar:33.0.0]
at org.apache.druid.server.router.CoordinatorRuleManager.poll(CoordinatorRuleManager.java:135) ~[druid-services-33.0.0.jar:33.0.0]
at org.apache.druid.java.util.common.concurrent.ScheduledExecutors$1.call(ScheduledExecutors.java:55) [druid-processing-33.0.0.jar:33.0.0]
at org.apache.druid.java.util.common.concurrent.ScheduledExecutors$1.call(ScheduledExecutors.java:51) [druid-processing-33.0.0.jar:33.0.0]
at org.apache.druid.java.util.common.concurrent.ScheduledExecutors$2.run(ScheduledExecutors.java:87) [druid-processing-33.0.0.jar:33.0.0]
at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:539) [?:?]
at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264) [?:?]
at java.base/java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:304) [?:?]
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136) [?:?]
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) [?:?]
at java.base/java.lang.Thread.run(Thread.java:840) [?:?]
Caused by: java.nio.channels.UnresolvedAddressException
at java.base/sun.nio.ch.Net.checkAddress(Net.java:149) ~[?:?]
at java.base/sun.nio.ch.Net.checkAddress(Net.java:157) ~[?:?]
at java.base/sun.nio.ch.SocketChannelImpl.checkRemote(SocketChannelImpl.java:816) ~[?:?]
at java.base/sun.nio.ch.SocketChannelImpl.connect(SocketChannelImpl.java:839) ~[?:?]
at org.jboss.netty.channel.socket.nio.NioClientSocketPipelineSink.connect(NioClientSocketPipelineSink.java:108) ~[netty-3.10.6.Final.jar:?]
at org.jboss.netty.channel.socket.nio.NioClientSocketPipelineSink.eventSunk(NioClientSocketPipelineSink.java:70) ~[netty-3.10.6.Final.jar:?]
at org.jboss.netty.channel.DefaultChannelPipeline$DefaultChannelHandlerContext.sendDownstream(DefaultChannelPipeline.java:779) ~[netty-3.10.6.Final.jar:?]
at org.jboss.netty.handler.codec.oneone.OneToOneEncoder.handleDownstream(OneToOneEncoder.java:54) ~[netty-3.10.6.Final.jar:?]
at org.jboss.netty.handler.codec.http.HttpClientCodec.handleDownstream(HttpClientCodec.java:97) ~[netty-3.10.6.Final.jar:?]
at org.jboss.netty.channel.DefaultChannelPipeline.sendDownstream(DefaultChannelPipeline.java:591) ~[netty-3.10.6.Final.jar:?]
at org.jboss.netty.channel.DefaultChannelPipeline.sendDownstream(DefaultChannelPipeline.java:582) ~[netty-3.10.6.Final.jar:?]
at org.jboss.netty.channel.Channels.connect(Channels.java:634) ~[netty-3.10.6.Final.jar:?]
at org.jboss.netty.channel.AbstractChannel.connect(AbstractChannel.java:215) ~[netty-3.10.6.Final.jar:?]
at org.jboss.netty.bootstrap.ClientBootstrap.connect(ClientBootstrap.java:229) ~[netty-3.10.6.Final.jar:?]
at org.jboss.netty.bootstrap.ClientBootstrap.connect(ClientBootstrap.java:182) ~[netty-3.10.6.Final.jar:?]
at org.apache.druid.java.util.http.client.pool.ChannelResourceFactory.generate(ChannelResourceFactory.java:198) ~[druid-processing-33.0.0.jar:33.0.0]
at org.apache.druid.java.util.http.client.pool.ChannelResourceFactory.generate(ChannelResourceFactory.java:59) ~[druid-processing-33.0.0.jar:33.0.0]
at org.apache.druid.java.util.http.client.pool.ResourcePool$ResourceHolderPerKey.get(ResourcePool.java:285) ~[druid-processing-33.0.0.jar:33.0.0]
at org.apache.druid.java.util.http.client.pool.ResourcePool.take(ResourcePool.java:109) ~[druid-processing-33.0.0.jar:33.0.0]
at org.apache.druid.java.util.http.client.NettyHttpClient.go(NettyHttpClient.java:127) ~[druid-processing-33.0.0.jar:33.0.0]
... 13 more
For reproduction, please use my Helm chart. It works fine with Apache Druid 32.0.1, but breaks with version 33.0.0:
helm repo add druid-charts https://bsure-analytics.github.io/druid-charts
helm repo update
helm upgrade druid druid-charts/druid-dev --create-namespace --install --namespace druid --set druid.spec.image.tag=33.0.0
The error messages seem to be talking about the host druid-druid-coordinators-7889b9b98d-jcrxs not being resolvable. I'm not a kubernetes expert so I am not sure what exactly could be causing that. But I wonder what might have changed since Druid 32.0.1. In the older version, was the Coordinator advertising itself with a different hostname? If so, you could customize that with the druid.host runtime property.
Druid 33 changed the defaulit behaviour to use host name instead of IP for internal communication.
But for coordinator, it's a deploymentset, which means the hostname druid-druid-coordinators-7889b9b98d-jcrxs is not a FQDN that can be resolved.
I recommend u to set the DRUID_SET_HOST_IP env to 1 to restore previous behaviour.
See the release notes: https://github.com/apache/druid/releases/tag/druid-33.0.0#33.0.0-upgrade-notes-and-incompatible-changes
https://github.com/apache/druid/pull/17680
[#](https://github.com/apache/druid/releases/tag/druid-33.0.0#33.0.0-upgrade-notes-and-incompatible-changes-upgrade-notes-kubernetes-deployments) Kubernetes deployments
By default, the Docker image now uses the canonical hostname to register services in ZooKeeper for internal communication if you're running Druid in Kubernetes. Otherwise, it uses the IP address. https://github.com/apache/druid/pull/17697.
You can set the environment variable DRUID_SET_HOST_IP to 1 to restore old behavior.
So this change breaks communication with nodes that are deployed as Deployment kind in Kubernetes. To recover the old behavior, I shall set DRUID_SET_HOST_IP=1. My Helm chart is not using environment variables, but generating Java system properties instead. I would like to keep it that way for consistency, so can I use -Ddruid.set.host.ip=1 instead?
So this change breaks communication with nodes that are deployed as
Deploymentkind in Kubernetes. To recover the old behavior, I shall setDRUID_SET_HOST_IP=1. My Helm chart is not using environment variables, but generating Java system properties instead. I would like to keep it that way for consistency, so can I use-Ddruid.set.host.ip=1instead?
Looking at the source code, it seems like the answer is "no". Here's the relevant code:
if [ -z "${KUBERNETES_SERVICE_HOST}" ]
then
# Running outside kubernetes, use IP addresses
DRUID_SET_HOST_IP=${DRUID_SET_HOST_IP:-1}
else
# Running in kubernetes, so use canonical names
DRUID_SET_HOST_IP=${DRUID_SET_HOST_IP:-0}
fi
if [ "${DRUID_SET_HOST_IP}" = "1" ]
then
setKey $SERVICE druid.host $(ip r get 1 | awk '{print $7;exit}')
fi
Well, that clearly doesn't work for Deployment resource kinds.
Next issue:
08:18:37.017 [main] ERROR org.apache.druid.cli.PullDependencies - Unable to resolve artifacts for [org.apache.druid.extensions.contrib:druid-kubernetes-overlord-extensions:jar:33.0.0 (runtime) -> [] < [central (https://repo1.maven.org/maven2/
org.eclipse.aether.resolution.DependencyResolutionException: Could not find artifact org.apache.druid.extensions.contrib:druid-kubernetes-overlord-extensions:jar:33.0.0 in central (https://repo1.maven.org/maven2/)
at org.eclipse.aether.internal.impl.DefaultRepositorySystem.resolveDependencies(DefaultRepositorySystem.java:342) ~[maven-resolver-impl-1.3.1.jar:1.3.1]
at org.apache.druid.cli.PullDependencies.downloadExtension(PullDependencies.java:392) [druid-services-33.0.0.jar:33.0.0]
at org.apache.druid.cli.PullDependencies.downloadExtension(PullDependencies.java:346) [druid-services-33.0.0.jar:33.0.0]
at org.apache.druid.cli.PullDependencies.run(PullDependencies.java:292) [druid-services-33.0.0.jar:33.0.0]
at org.apache.druid.cli.Main.main(Main.java:112) [druid-services-33.0.0.jar:33.0.0]
Caused by: org.eclipse.aether.resolution.ArtifactResolutionException: Could not find artifact org.apache.druid.extensions.contrib:druid-kubernetes-overlord-extensions:jar:33.0.0 in central (https://repo1.maven.org/maven2/)
at org.eclipse.aether.internal.impl.DefaultArtifactResolver.resolve(DefaultArtifactResolver.java:413) ~[maven-resolver-impl-1.3.1.jar:1.3.1]
at org.eclipse.aether.internal.impl.DefaultArtifactResolver.resolveArtifacts(DefaultArtifactResolver.java:215) ~[maven-resolver-impl-1.3.1.jar:1.3.1]
at org.eclipse.aether.internal.impl.DefaultRepositorySystem.resolveDependencies(DefaultRepositorySystem.java:325) ~[maven-resolver-impl-1.3.1.jar:1.3.1]
... 4 more
Caused by: org.eclipse.aether.transfer.ArtifactNotFoundException: Could not find artifact org.apache.druid.extensions.contrib:druid-kubernetes-overlord-extensions:jar:33.0.0 in central (https://repo1.maven.org/maven2/)
at org.eclipse.aether.connector.basic.ArtifactTransportListener.transferFailed(ArtifactTransportListener.java:48) ~[maven-resolver-connector-basic-1.3.1.jar:1.3.1]
at org.eclipse.aether.connector.basic.BasicRepositoryConnector$TaskRunner.run(BasicRepositoryConnector.java:368) ~[maven-resolver-connector-basic-1.3.1.jar:1.3.1]
at org.eclipse.aether.util.concurrency.RunnableErrorForwarder$1.run(RunnableErrorForwarder.java:75) ~[maven-resolver-util-1.3.1.jar:1.3.1]
at org.eclipse.aether.connector.basic.BasicRepositoryConnector$DirectExecutor.execute(BasicRepositoryConnector.java:642) ~[maven-resolver-connector-basic-1.3.1.jar:1.3.1]
at org.eclipse.aether.connector.basic.BasicRepositoryConnector.get(BasicRepositoryConnector.java:262) ~[maven-resolver-connector-basic-1.3.1.jar:1.3.1]
at org.eclipse.aether.internal.impl.DefaultArtifactResolver.performDownloads(DefaultArtifactResolver.java:489) ~[maven-resolver-impl-1.3.1.jar:1.3.1]
at org.eclipse.aether.internal.impl.DefaultArtifactResolver.resolve(DefaultArtifactResolver.java:390) ~[maven-resolver-impl-1.3.1.jar:1.3.1]
at org.eclipse.aether.internal.impl.DefaultArtifactResolver.resolveArtifacts(DefaultArtifactResolver.java:215) ~[maven-resolver-impl-1.3.1.jar:1.3.1]
at org.eclipse.aether.internal.impl.DefaultRepositorySystem.resolveDependencies(DefaultRepositorySystem.java:325) ~[maven-resolver-impl-1.3.1.jar:1.3.1]
... 4 more
Looks like the druid-kubernetes-overlord-extensions haven't been bumped to version 33.0.0 yet.
Looks like the
druid-kubernetes-overlord-extensionshaven't been bumped to version33.0.0yet.
No, it's been moved to the core, so I don't have to use pull-deps anymore, that's a welcome change!
So this change breaks communication with nodes that are deployed as
Deploymentkind in Kubernetes. To recover the old behavior, I shall setDRUID_SET_HOST_IP=1. My Helm chart is not using environment variables, but generating Java system properties instead. I would like to keep it that way for consistency, so can I use-Ddruid.set.host.ip=1instead?Looking at the source code, it seems like the answer is "no". Here's the relevant code:
if [ -z "${KUBERNETES_SERVICE_HOST}" ] then
Running outside kubernetes, use IP addresses
DRUID_SET_HOST_IP=${DRUID_SET_HOST_IP:-1} else
Running in kubernetes, so use canonical names
DRUID_SET_HOST_IP=${DRUID_SET_HOST_IP:-0} fi
if [ "${DRUID_SET_HOST_IP}" = "1" ] then setKey $SERVICE druid.host $(ip r get 1 | awk '{print $7;exit}') fi Well, that clearly doesn't work for
Deploymentresource kinds.
Put the DRUID_SET_HOST_IP in the helm value files.
Is there something we can or should change in the bundled script to improve this case? Otherwise I suppose we should close the issue, since the cause has been figured out.
Well, honestly configuring the Druid is no smooth ride because of its wild mix of Java system properties, environment variables, XML (for logging), JSON (for metrics) etc. Unfortunately, this is architectural, so it can't be easily fixed at the root. However, it can be abstracted over and that's why I've created the Druid charts where you can configure everything using YAML, even the Java system properties: https://github.com/bsure-analytics/druid-charts
I apologize for the shameless self-plug.