druid icon indicating copy to clipboard operation
druid copied to clipboard

Coordinator does not handle ZK restart gracefully.

Open didip opened this issue 2 years ago • 2 comments

Affected Version

0.23.0

Description

I was running a simple test on how robust Coordinator deals with Zookeeper restart. Unfortunately Coordinator crashed immediately.

This is the logs that shows the crash:

2022-10-02T00:41:08,427 INFO [main-EventThread] org.apache.curator.framework.imps.EnsembleTracker - New config event received: {server.2=bitnami-zk-zookeeper-1.bitnami-zk-zookeeper-headless.druid.svc.cluster.local:2888:3888:participant;0.0.0.0:2181, server.1=bitnami-zk-zookeeper-0.bitnami-zk-zookeeper-headless.druid.svc.cluster.local:2888:3888:participant;0.0.0.0:2181, server.3=bitnami-zk-zookeeper-2.bitnami-zk-zookeeper-headless.druid.svc.cluster.local:2888:3888:participant;0.0.0.0:2181, version=0}
2022-10-02T00:41:08,434 ERROR [main-EventThread] org.apache.curator.framework.imps.CuratorFrameworkImpl - Background exception was not retry-able or retry gave up
java.lang.NullPointerException: Cannot invoke "java.net.InetAddress.getHostAddress()" because the return value of "java.net.InetSocketAddress.getAddress()" is null
	at org.apache.curator.framework.imps.EnsembleTracker.configToConnectionString(EnsembleTracker.java:185) ~[curator-framework-4.3.0.jar:4.3.0]
	at org.apache.curator.framework.imps.EnsembleTracker.processConfigData(EnsembleTracker.java:206) ~[curator-framework-4.3.0.jar:4.3.0]
	at org.apache.curator.framework.imps.EnsembleTracker.access$300(EnsembleTracker.java:50) ~[curator-framework-4.3.0.jar:4.3.0]
	at org.apache.curator.framework.imps.EnsembleTracker$2.processResult(EnsembleTracker.java:150) ~[curator-framework-4.3.0.jar:4.3.0]
	at org.apache.curator.framework.imps.CuratorFrameworkImpl.sendToBackgroundCallback(CuratorFrameworkImpl.java:883) [curator-framework-4.3.0.jar:4.3.0]
	at org.apache.curator.framework.imps.CuratorFrameworkImpl.processBackgroundOperation(CuratorFrameworkImpl.java:653) [curator-framework-4.3.0.jar:4.3.0]
	at org.apache.curator.framework.imps.WatcherRemovalFacade.processBackgroundOperation(WatcherRemovalFacade.java:152) [curator-framework-4.3.0.jar:4.3.0]
	at org.apache.curator.framework.imps.GetConfigBuilderImpl$2.processResult(GetConfigBuilderImpl.java:222) [curator-framework-4.3.0.jar:4.3.0]
	at org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:598) [zookeeper-3.5.9.jar:3.5.9]
	at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:510) [zookeeper-3.5.9.jar:3.5.9]
2022-10-02T00:41:08,434 ERROR [main-EventThread] org.apache.druid.curator.CuratorModule - Unhandled error in Curator, stopping server.
java.lang.NullPointerException: Cannot invoke "java.net.InetAddress.getHostAddress()" because the return value of "java.net.InetSocketAddress.getAddress()" is null
	at org.apache.curator.framework.imps.EnsembleTracker.configToConnectionString(EnsembleTracker.java:185) ~[curator-framework-4.3.0.jar:4.3.0]
	at org.apache.curator.framework.imps.EnsembleTracker.processConfigData(EnsembleTracker.java:206) ~[curator-framework-4.3.0.jar:4.3.0]
	at org.apache.curator.framework.imps.EnsembleTracker.access$300(EnsembleTracker.java:50) ~[curator-framework-4.3.0.jar:4.3.0]
	at org.apache.curator.framework.imps.EnsembleTracker$2.processResult(EnsembleTracker.java:150) ~[curator-framework-4.3.0.jar:4.3.0]
	at org.apache.curator.framework.imps.CuratorFrameworkImpl.sendToBackgroundCallback(CuratorFrameworkImpl.java:883) [curator-framework-4.3.0.jar:4.3.0]
	at org.apache.curator.framework.imps.CuratorFrameworkImpl.processBackgroundOperation(CuratorFrameworkImpl.java:653) [curator-framework-4.3.0.jar:4.3.0]
	at org.apache.curator.framework.imps.WatcherRemovalFacade.processBackgroundOperation(WatcherRemovalFacade.java:152) [curator-framework-4.3.0.jar:4.3.0]
	at org.apache.curator.framework.imps.GetConfigBuilderImpl$2.processResult(GetConfigBuilderImpl.java:222) [curator-framework-4.3.0.jar:4.3.0]
	at org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:598) [zookeeper-3.5.9.jar:3.5.9]
	at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:510) [zookeeper-3.5.9.jar:3.5.9]
2022-10-02T00:41:08,435 INFO [main-EventThread] org.apache.druid.java.util.common.lifecycle.Lifecycle - Stopping lifecycle [module] stage [ANNOUNCEMENTS]
2022-10-02T00:41:08,436 INFO [main-EventThread] org.apache.druid.curator.announcement.Announcer - Unannouncing [/druid/internal-discovery/COORDINATOR/druid-coordinator-0.druid-coordinator-headless.druid.svc.cluster.local:8081]
2022-10-02T00:41:08,443 INFO [main-EventThread] org.apache.druid.curator.discovery.CuratorDruidNodeAnnouncer - Unannounced self [{"druidNode":{"service":"druid/coordinator","host":"druid-coordinator-0.druid-coordinator-headless.druid.svc.cluster.local","bindOnHost":false,"plaintextPort":8081,"port":-1,"tlsPort":-1,"enablePlaintextPort":true,"enableTlsPort":false},"nodeType":"coordinator","services":{}}].
2022-10-02T00:41:08,444 INFO [main-EventThread] org.apache.druid.java.util.common.lifecycle.Lifecycle - Stopping lifecycle [module] stage [SERVER]
2022-10-02T00:41:08,448 INFO [main-EventThread] org.eclipse.jetty.server.AbstractConnector - Stopped ServerConnector@3fe3ebb3{HTTP/1.1, (http/1.1)}{0.0.0.0:8081}
2022-10-02T00:41:08,448 INFO [main-EventThread] org.eclipse.jetty.server.session - node0 Stopped scavenging
2022-10-02T00:41:08,449 INFO [main-EventThread] org.eclipse.jetty.server.handler.ContextHandler - Stopped o.e.j.s.ServletContextHandler@54cce500{/,jar:file:/opt/apache-druid-0.23.0/lib/druid-console-0.23.0.jar!/org/apache/druid/console,STOPPED}
2022-10-02T00:41:08,458 INFO [main-EventThread] org.apache.druid.java.util.common.lifecycle.Lifecycle - Stopping lifecycle [module] stage [NORMAL]
2022-10-02T00:41:08,468 INFO [main-EventThread] org.apache.druid.security.basic.authorization.db.updater.CoordinatorBasicAuthorizerMetadataStorageUpdater - CoordinatorBasicAuthorizerMetadataStorageUpdater is stopping.
2022-10-02T00:41:08,468 INFO [main-EventThread] org.apache.druid.security.basic.authorization.db.updater.CoordinatorBasicAuthorizerMetadataStorageUpdater - CoordinatorBasicAuthorizerMetadataStorageUpdater is stopped.
2022-10-02T00:41:08,468 INFO [main-EventThread] org.apache.druid.security.basic.authentication.db.updater.CoordinatorBasicAuthenticatorMetadataStorageUpdater - CoordinatorBasicAuthenticatorMetadataStorageUpdater is stopping.
2022-10-02T00:41:08,468 INFO [main-EventThread] org.apache.druid.security.basic.authentication.db.updater.CoordinatorBasicAuthenticatorMetadataStorageUpdater - CoordinatorBasicAuthenticatorMetadataStorageUpdater is stopped.
2022-10-02T00:41:08,468 INFO [CoordinatorBasicAuthorizerCacheNotifier-notifierThread-0] org.apache.druid.security.basic.CommonCacheNotifier - CoordinatorBasicAuthorizerCacheNotifier: Interrupted while handling updates for cachedUserMaps. (java.lang.InterruptedException)
2022-10-02T00:41:08,469 INFO [CoordinatorBasicAuthorizerCacheNotifier-notifierThread-0] org.apache.druid.security.basic.CommonCacheNotifier - CoordinatorBasicAuthorizerCacheNotifier: Interrupted while handling updates for cachedUserMaps. (java.lang.InterruptedException)
2022-10-02T00:41:08,469 INFO [CoordinatorBasicAuthenticatorCacheNotifier-notifierThread-0] org.apache.druid.security.basic.CommonCacheNotifier - CoordinatorBasicAuthenticatorCacheNotifier: Interrupted while handling updates for cachedUserMaps. (java.lang.InterruptedException)
2022-10-02T00:41:08,474 INFO [main-EventThread] org.apache.druid.java.util.emitter.core.ComposingEmitter - Closing Composing Emitter.
2022-10-02T00:41:08,475 INFO [main-EventThread] org.apache.druid.java.util.emitter.core.ComposingEmitter - Closing emitter org.apache.druid.emitter.statsd.StatsDEmitter.
2022-10-02T00:41:08,475 INFO [main-EventThread] org.apache.druid.java.util.emitter.core.ComposingEmitter - Closing Composing Emitter.
2022-10-02T00:41:08,475 INFO [main-EventThread] org.apache.druid.java.util.emitter.core.ComposingEmitter - Closing emitter org.apache.druid.emitter.statsd.StatsDEmitter.
2022-10-02T00:41:08,476 INFO [Curator-Framework-0] org.apache.curator.framework.imps.CuratorFrameworkImpl - backgroundOperationsLoop exiting
2022-10-02T00:41:08,581 INFO [main-EventThread] org.apache.zookeeper.ZooKeeper - Session: 0x10061d8cd080000 closed
2022-10-02T00:41:08,581 INFO [main-EventThread] org.apache.druid.java.util.common.lifecycle.Lifecycle - Stopping lifecycle [module] stage [INIT]

didip avatar Oct 02 '22 00:10 didip

I saw a recent Curator patch being merged related to this issue: https://github.com/apache/curator/pull/433.

Maybe it is time to upgrade Curator?

didip avatar Oct 02 '22 01:10 didip

This problem is affecting us quite badly. Since every coordinator goes through this same error almost together, there is no leader in the cluster until they complete the restart, which then causes the other services to fail.

The coordinators spin up quite fast, but the other services are already lost by that time. Some of them (like the historicals) can take a long time to be available in the cluster again.

Small cluster deployments (3 nodes for ZK and Coordinators each) can frequently see this issue when deploying inside K8s.

Fryuni avatar Oct 07 '22 16:10 Fryuni

Merged #13295 to upgrade curator. The error should go away now.

kfaraz avatar Nov 21 '22 07:11 kfaraz

Great!

We've been using a custom image built from that PR and it's working without any issues. Looking forward to a release so we can get back to the official images.

Fryuni avatar Nov 21 '22 11:11 Fryuni

The next release of Druid (25.0) is scheduled for early December.

kfaraz avatar Nov 23 '22 03:11 kfaraz

We've been running Druid 25.0 for about a month and we didn't have this problem anymore. It was happening about twice a day on 24.0 and 0.23 on the same cluster, so I think this can be closed.

Is it still a problem for you @didip?

Fryuni avatar Mar 07 '23 12:03 Fryuni

Yeah, this seemed to be fixed. Closing.

didip avatar Mar 07 '23 18:03 didip