druid
druid copied to clipboard
Coordinator does not handle ZK restart gracefully.
Affected Version
0.23.0
Description
I was running a simple test on how robust Coordinator deals with Zookeeper restart. Unfortunately Coordinator crashed immediately.
This is the logs that shows the crash:
2022-10-02T00:41:08,427 INFO [main-EventThread] org.apache.curator.framework.imps.EnsembleTracker - New config event received: {server.2=bitnami-zk-zookeeper-1.bitnami-zk-zookeeper-headless.druid.svc.cluster.local:2888:3888:participant;0.0.0.0:2181, server.1=bitnami-zk-zookeeper-0.bitnami-zk-zookeeper-headless.druid.svc.cluster.local:2888:3888:participant;0.0.0.0:2181, server.3=bitnami-zk-zookeeper-2.bitnami-zk-zookeeper-headless.druid.svc.cluster.local:2888:3888:participant;0.0.0.0:2181, version=0}
2022-10-02T00:41:08,434 ERROR [main-EventThread] org.apache.curator.framework.imps.CuratorFrameworkImpl - Background exception was not retry-able or retry gave up
java.lang.NullPointerException: Cannot invoke "java.net.InetAddress.getHostAddress()" because the return value of "java.net.InetSocketAddress.getAddress()" is null
at org.apache.curator.framework.imps.EnsembleTracker.configToConnectionString(EnsembleTracker.java:185) ~[curator-framework-4.3.0.jar:4.3.0]
at org.apache.curator.framework.imps.EnsembleTracker.processConfigData(EnsembleTracker.java:206) ~[curator-framework-4.3.0.jar:4.3.0]
at org.apache.curator.framework.imps.EnsembleTracker.access$300(EnsembleTracker.java:50) ~[curator-framework-4.3.0.jar:4.3.0]
at org.apache.curator.framework.imps.EnsembleTracker$2.processResult(EnsembleTracker.java:150) ~[curator-framework-4.3.0.jar:4.3.0]
at org.apache.curator.framework.imps.CuratorFrameworkImpl.sendToBackgroundCallback(CuratorFrameworkImpl.java:883) [curator-framework-4.3.0.jar:4.3.0]
at org.apache.curator.framework.imps.CuratorFrameworkImpl.processBackgroundOperation(CuratorFrameworkImpl.java:653) [curator-framework-4.3.0.jar:4.3.0]
at org.apache.curator.framework.imps.WatcherRemovalFacade.processBackgroundOperation(WatcherRemovalFacade.java:152) [curator-framework-4.3.0.jar:4.3.0]
at org.apache.curator.framework.imps.GetConfigBuilderImpl$2.processResult(GetConfigBuilderImpl.java:222) [curator-framework-4.3.0.jar:4.3.0]
at org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:598) [zookeeper-3.5.9.jar:3.5.9]
at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:510) [zookeeper-3.5.9.jar:3.5.9]
2022-10-02T00:41:08,434 ERROR [main-EventThread] org.apache.druid.curator.CuratorModule - Unhandled error in Curator, stopping server.
java.lang.NullPointerException: Cannot invoke "java.net.InetAddress.getHostAddress()" because the return value of "java.net.InetSocketAddress.getAddress()" is null
at org.apache.curator.framework.imps.EnsembleTracker.configToConnectionString(EnsembleTracker.java:185) ~[curator-framework-4.3.0.jar:4.3.0]
at org.apache.curator.framework.imps.EnsembleTracker.processConfigData(EnsembleTracker.java:206) ~[curator-framework-4.3.0.jar:4.3.0]
at org.apache.curator.framework.imps.EnsembleTracker.access$300(EnsembleTracker.java:50) ~[curator-framework-4.3.0.jar:4.3.0]
at org.apache.curator.framework.imps.EnsembleTracker$2.processResult(EnsembleTracker.java:150) ~[curator-framework-4.3.0.jar:4.3.0]
at org.apache.curator.framework.imps.CuratorFrameworkImpl.sendToBackgroundCallback(CuratorFrameworkImpl.java:883) [curator-framework-4.3.0.jar:4.3.0]
at org.apache.curator.framework.imps.CuratorFrameworkImpl.processBackgroundOperation(CuratorFrameworkImpl.java:653) [curator-framework-4.3.0.jar:4.3.0]
at org.apache.curator.framework.imps.WatcherRemovalFacade.processBackgroundOperation(WatcherRemovalFacade.java:152) [curator-framework-4.3.0.jar:4.3.0]
at org.apache.curator.framework.imps.GetConfigBuilderImpl$2.processResult(GetConfigBuilderImpl.java:222) [curator-framework-4.3.0.jar:4.3.0]
at org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:598) [zookeeper-3.5.9.jar:3.5.9]
at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:510) [zookeeper-3.5.9.jar:3.5.9]
2022-10-02T00:41:08,435 INFO [main-EventThread] org.apache.druid.java.util.common.lifecycle.Lifecycle - Stopping lifecycle [module] stage [ANNOUNCEMENTS]
2022-10-02T00:41:08,436 INFO [main-EventThread] org.apache.druid.curator.announcement.Announcer - Unannouncing [/druid/internal-discovery/COORDINATOR/druid-coordinator-0.druid-coordinator-headless.druid.svc.cluster.local:8081]
2022-10-02T00:41:08,443 INFO [main-EventThread] org.apache.druid.curator.discovery.CuratorDruidNodeAnnouncer - Unannounced self [{"druidNode":{"service":"druid/coordinator","host":"druid-coordinator-0.druid-coordinator-headless.druid.svc.cluster.local","bindOnHost":false,"plaintextPort":8081,"port":-1,"tlsPort":-1,"enablePlaintextPort":true,"enableTlsPort":false},"nodeType":"coordinator","services":{}}].
2022-10-02T00:41:08,444 INFO [main-EventThread] org.apache.druid.java.util.common.lifecycle.Lifecycle - Stopping lifecycle [module] stage [SERVER]
2022-10-02T00:41:08,448 INFO [main-EventThread] org.eclipse.jetty.server.AbstractConnector - Stopped ServerConnector@3fe3ebb3{HTTP/1.1, (http/1.1)}{0.0.0.0:8081}
2022-10-02T00:41:08,448 INFO [main-EventThread] org.eclipse.jetty.server.session - node0 Stopped scavenging
2022-10-02T00:41:08,449 INFO [main-EventThread] org.eclipse.jetty.server.handler.ContextHandler - Stopped o.e.j.s.ServletContextHandler@54cce500{/,jar:file:/opt/apache-druid-0.23.0/lib/druid-console-0.23.0.jar!/org/apache/druid/console,STOPPED}
2022-10-02T00:41:08,458 INFO [main-EventThread] org.apache.druid.java.util.common.lifecycle.Lifecycle - Stopping lifecycle [module] stage [NORMAL]
2022-10-02T00:41:08,468 INFO [main-EventThread] org.apache.druid.security.basic.authorization.db.updater.CoordinatorBasicAuthorizerMetadataStorageUpdater - CoordinatorBasicAuthorizerMetadataStorageUpdater is stopping.
2022-10-02T00:41:08,468 INFO [main-EventThread] org.apache.druid.security.basic.authorization.db.updater.CoordinatorBasicAuthorizerMetadataStorageUpdater - CoordinatorBasicAuthorizerMetadataStorageUpdater is stopped.
2022-10-02T00:41:08,468 INFO [main-EventThread] org.apache.druid.security.basic.authentication.db.updater.CoordinatorBasicAuthenticatorMetadataStorageUpdater - CoordinatorBasicAuthenticatorMetadataStorageUpdater is stopping.
2022-10-02T00:41:08,468 INFO [main-EventThread] org.apache.druid.security.basic.authentication.db.updater.CoordinatorBasicAuthenticatorMetadataStorageUpdater - CoordinatorBasicAuthenticatorMetadataStorageUpdater is stopped.
2022-10-02T00:41:08,468 INFO [CoordinatorBasicAuthorizerCacheNotifier-notifierThread-0] org.apache.druid.security.basic.CommonCacheNotifier - CoordinatorBasicAuthorizerCacheNotifier: Interrupted while handling updates for cachedUserMaps. (java.lang.InterruptedException)
2022-10-02T00:41:08,469 INFO [CoordinatorBasicAuthorizerCacheNotifier-notifierThread-0] org.apache.druid.security.basic.CommonCacheNotifier - CoordinatorBasicAuthorizerCacheNotifier: Interrupted while handling updates for cachedUserMaps. (java.lang.InterruptedException)
2022-10-02T00:41:08,469 INFO [CoordinatorBasicAuthenticatorCacheNotifier-notifierThread-0] org.apache.druid.security.basic.CommonCacheNotifier - CoordinatorBasicAuthenticatorCacheNotifier: Interrupted while handling updates for cachedUserMaps. (java.lang.InterruptedException)
2022-10-02T00:41:08,474 INFO [main-EventThread] org.apache.druid.java.util.emitter.core.ComposingEmitter - Closing Composing Emitter.
2022-10-02T00:41:08,475 INFO [main-EventThread] org.apache.druid.java.util.emitter.core.ComposingEmitter - Closing emitter org.apache.druid.emitter.statsd.StatsDEmitter.
2022-10-02T00:41:08,475 INFO [main-EventThread] org.apache.druid.java.util.emitter.core.ComposingEmitter - Closing Composing Emitter.
2022-10-02T00:41:08,475 INFO [main-EventThread] org.apache.druid.java.util.emitter.core.ComposingEmitter - Closing emitter org.apache.druid.emitter.statsd.StatsDEmitter.
2022-10-02T00:41:08,476 INFO [Curator-Framework-0] org.apache.curator.framework.imps.CuratorFrameworkImpl - backgroundOperationsLoop exiting
2022-10-02T00:41:08,581 INFO [main-EventThread] org.apache.zookeeper.ZooKeeper - Session: 0x10061d8cd080000 closed
2022-10-02T00:41:08,581 INFO [main-EventThread] org.apache.druid.java.util.common.lifecycle.Lifecycle - Stopping lifecycle [module] stage [INIT]
I saw a recent Curator patch being merged related to this issue: https://github.com/apache/curator/pull/433.
Maybe it is time to upgrade Curator?
This problem is affecting us quite badly. Since every coordinator goes through this same error almost together, there is no leader in the cluster until they complete the restart, which then causes the other services to fail.
The coordinators spin up quite fast, but the other services are already lost by that time. Some of them (like the historicals) can take a long time to be available in the cluster again.
Small cluster deployments (3 nodes for ZK and Coordinators each) can frequently see this issue when deploying inside K8s.
Merged #13295 to upgrade curator. The error should go away now.
Great!
We've been using a custom image built from that PR and it's working without any issues. Looking forward to a release so we can get back to the official images.
The next release of Druid (25.0) is scheduled for early December.
We've been running Druid 25.0 for about a month and we didn't have this problem anymore. It was happening about twice a day on 24.0 and 0.23 on the same cluster, so I think this can be closed.
Is it still a problem for you @didip?
Yeah, this seemed to be fixed. Closing.