druid
druid copied to clipboard
Middlemanager Task failing on apparently graceful shutdown
Affected Version
0.20.0
Description
Seen the issue happening twice in two tasks created in the same minute. The tasks are reported as failed, but the logs says the task was successful (?).
Cluster size: 27 Dataservers (Middlemanager + Historical), 2 Brokers, 2 Routers, 2 Overlords, 2 Coordinator, 3 Zookeeper (in dedicated hosts). Task status:
"statusCode": "FAILED",
"status": "FAILED",
"runnerStatusCode": "WAITING",
Log:
2021-02-08T22:35:36,275 INFO [parent-monitor-0] org.apache.druid.indexing.worker.executor.ExecutorLifecycle - Triggering JVM shutdown.
2021-02-08T22:35:36,277 INFO [Thread-60] org.apache.druid.cli.CliPeon - Running shutdown hook
2021-02-08T22:35:36,278 INFO [Thread-60] org.apache.druid.java.util.common.lifecycle.Lifecycle - Stopping lifecycle [module] stage [ANNOUNCEMENTS]
2021-02-08T22:35:36,281 INFO [Thread-60] org.apache.druid.curator.announcement.Announcer - Unannouncing [/druid/announcements/druid-dataserver-9:8100]
2021-02-08T22:35:36,302 INFO [Thread-60] org.apache.druid.curator.announcement.Announcer - Unannouncing [/druid/segments/druid-dataserver-9:8100/druid-dataserver-9:8100_indexer-executor__default_tier_2021-02-08T22:29:32.209Z_4e114ad4878f4dec80462dea24d108fc0]
2021-02-08T22:35:36,307 INFO [Thread-60] org.apache.druid.curator.announcement.Announcer - Unannouncing [/druid/internal-discovery/PEON/druid-dataserver-9:8100]
2021-02-08T22:35:36,327 INFO [Thread-60] org.apache.druid.java.util.common.lifecycle.Lifecycle - Stopping lifecycle [module] stage [SERVER]
2021-02-08T22:35:45,829 INFO [[index_kafka_ssp-events-hourly_2a3e52186bc4cdd_ddcbmbfg]-appenderator-persist] org.apache.druid.segment.realtime.appenderator.AppenderatorImpl - Flushed in-memory data for segment[ssp-events-hourly_2021-02-08T20:00:00.000Z_2021-02-08T21:00:00.000Z_2021-02-08T21:14:50.337Z_139] spill[6] to disk in [5,245] ms (592,218 rows).
2021-02-08T22:35:46,969 INFO [[index_kafka_ssp-events-hourly_2a3e52186bc4cdd_ddcbmbfg]-appenderator-persist] org.apache.druid.segment.realtime.appenderator.AppenderatorImpl - Flushed in-memory data with commit metadata [AppenderatorDriverMetadata{segments={index_kafka_ssp-events-hourly_2a3e52186bc4cdd_0=[SegmentWithState{segmentIdentifier=ssp-events-hourly_2021-02-08T20:00:00.000Z_2021-02-08T21:00:00.000Z_2021-02-08T21:14:50.337Z_139, state=APPENDING}]}, lastSegmentIds={index_kafka_ssp-events-hourly_2a3e52186bc4cdd_0=ssp-events-hourly_2021-02-08T20:00:00.000Z_2021-02-08T21:00:00.000Z_2021-02-08T21:14:50.337Z_139}, callerMetadata={nextPartitions=SeekableStreamEndSequenceNumbers{stream='ssp_report', partitionSequenceNumberMap={12=92381673345}}}}] for segments: ssp-events-hourly_2021-02-08T20:00:00.000Z_2021-02-08T21:00:00.000Z_2021-02-08T21:14:50.337Z_139
2021-02-08T22:36:06,347 INFO [Thread-60] org.eclipse.jetty.server.AbstractConnector - Stopped ServerConnector@3f45dfec{HTTP/1.1, (http/1.1)}{0.0.0.0:8100}
2021-02-08T22:36:06,349 INFO [Thread-60] org.eclipse.jetty.server.session - node0 Stopped scavenging
2021-02-08T22:36:06,354 INFO [Thread-60] org.eclipse.jetty.server.handler.ContextHandler - Stopped o.e.j.s.ServletContextHandler@6b7c4734{/,null,UNAVAILABLE}
2021-02-08T22:36:06,367 ERROR [Thread-60] org.apache.druid.server.initialization.jetty.JettyServerModule - Jetty lifecycle event failed [class org.eclipse.jetty.server.Server]
java.util.concurrent.TimeoutException: null
at org.eclipse.jetty.util.FutureCallback.get(FutureCallback.java:130) ~[jetty-util-9.4.30.v20200611.jar:9.4.30.v20200611]
at org.eclipse.jetty.util.FutureCallback.get(FutureCallback.java:30) ~[jetty-util-9.4.30.v20200611.jar:9.4.30.v20200611]
at org.eclipse.jetty.server.handler.AbstractHandlerContainer.doShutdown(AbstractHandlerContainer.java:175) ~[jetty-server-9.4.30.v20200611.jar:9.4.30.v20200611]
at org.eclipse.jetty.server.Server.doStop(Server.java:432) ~[jetty-server-9.4.30.v20200611.jar:9.4.30.v20200611]
at org.eclipse.jetty.util.component.AbstractLifeCycle.stop(AbstractLifeCycle.java:93) ~[jetty-util-9.4.30.v20200611.jar:9.4.30.v20200611]
at org.apache.druid.server.initialization.jetty.JettyServerModule$2.stop(JettyServerModule.java:454) ~[druid-server-0.20.0.jar:0.20.0]
at org.apache.druid.java.util.common.lifecycle.Lifecycle.stop(Lifecycle.java:368) ~[druid-core-0.20.0.jar:0.20.0]
at org.apache.druid.cli.CliPeon.lambda$run$0(CliPeon.java:291) ~[druid-services-0.20.0.jar:0.20.0]
at java.lang.Thread.run(Thread.java:748) [?:1.8.0_275]
2021-02-08T22:36:06,373 WARN [Thread-60] org.apache.druid.server.initialization.jetty.JettyServerModule - Unable to stop Jetty server.
java.util.concurrent.TimeoutException: null
at org.eclipse.jetty.util.FutureCallback.get(FutureCallback.java:130) ~[jetty-util-9.4.30.v20200611.jar:9.4.30.v20200611]
at org.eclipse.jetty.util.FutureCallback.get(FutureCallback.java:30) ~[jetty-util-9.4.30.v20200611.jar:9.4.30.v20200611]
at org.eclipse.jetty.server.handler.AbstractHandlerContainer.doShutdown(AbstractHandlerContainer.java:175) ~[jetty-server-9.4.30.v20200611.jar:9.4.30.v20200611]
at org.eclipse.jetty.server.Server.doStop(Server.java:432) ~[jetty-server-9.4.30.v20200611.jar:9.4.30.v20200611]
at org.eclipse.jetty.util.component.AbstractLifeCycle.stop(AbstractLifeCycle.java:93) ~[jetty-util-9.4.30.v20200611.jar:9.4.30.v20200611]
at org.apache.druid.server.initialization.jetty.JettyServerModule$2.stop(JettyServerModule.java:454) ~[druid-server-0.20.0.jar:0.20.0]
at org.apache.druid.java.util.common.lifecycle.Lifecycle.stop(Lifecycle.java:368) ~[druid-core-0.20.0.jar:0.20.0]
at org.apache.druid.cli.CliPeon.lambda$run$0(CliPeon.java:291) ~[druid-services-0.20.0.jar:0.20.0]
at java.lang.Thread.run(Thread.java:748) [?:1.8.0_275]
2021-02-08T22:36:06,375 INFO [Thread-60] org.apache.druid.java.util.common.lifecycle.Lifecycle - Stopping lifecycle [module] stage [NORMAL]
2021-02-08T22:36:06,376 INFO [Thread-60] org.apache.druid.server.coordination.ZkCoordinator - Stopping ZkCoordinator for [DruidServerMetadata{name='druid-dataserver-9:8100', hostAndPort='druid-dataserver-9:8100', hostAndTlsPort='null', maxSize=0, tier='_default_tier', type=indexer-executor, priority=0}]
2021-02-08T22:36:06,376 INFO [Thread-60] org.apache.druid.server.coordination.SegmentLoadDropHandler - Stopping...
2021-02-08T22:36:06,376 INFO [Thread-60] org.apache.druid.server.coordination.SegmentLoadDropHandler - Stopped.
2021-02-08T22:36:06,388 INFO [Thread-60] org.apache.druid.indexing.overlord.SingleTaskBackgroundRunner - Starting graceful shutdown of task[index_kafka_ssp-events-hourly_2a3e52186bc4cdd_ddcbmbfg].
2021-02-08T22:36:06,389 INFO [Thread-60] org.apache.druid.indexing.seekablestream.SeekableStreamIndexTaskRunner - Stopping gracefully (status: [READING])
2021-02-08T22:36:09,308 INFO [[index_kafka_ssp-events-hourly_2a3e52186bc4cdd_ddcbmbfg]-appenderator-persist] org.apache.druid.segment.realtime.appenderator.AppenderatorImpl - Flushed in-memory data for segment[ssp-events-hourly_2021-02-08T20:00:00.000Z_2021-02-08T21:00:00.000Z_2021-02-08T21:14:50.337Z_139] spill[7] to disk in [2,714] ms (305,020 rows).
2021-02-08T22:36:09,895 INFO [[index_kafka_ssp-events-hourly_2a3e52186bc4cdd_ddcbmbfg]-appenderator-persist] org.apache.druid.segment.realtime.appenderator.AppenderatorImpl - Flushed in-memory data with commit metadata [AppenderatorDriverMetadata{segments={index_kafka_ssp-events-hourly_2a3e52186bc4cdd_0=[SegmentWithState{segmentIdentifier=ssp-events-hourly_2021-02-08T20:00:00.000Z_2021-02-08T21:00:00.000Z_2021-02-08T21:14:50.337Z_139, state=APPENDING}]}, lastSegmentIds={index_kafka_ssp-events-hourly_2a3e52186bc4cdd_0=ssp-events-hourly_2021-02-08T20:00:00.000Z_2021-02-08T21:00:00.000Z_2021-02-08T21:14:50.337Z_139}, callerMetadata={nextPartitions=SeekableStreamEndSequenceNumbers{stream='ssp_report', partitionSequenceNumberMap={12=92381740546}}}}] for segments: ssp-events-hourly_2021-02-08T20:00:00.000Z_2021-02-08T21:00:00.000Z_2021-02-08T21:14:50.337Z_139
2021-02-08T22:36:09,912 INFO [task-runner-0-priority-0] org.apache.druid.server.coordination.BatchDataSegmentAnnouncer - Unannouncing segment[ssp-events-hourly_2021-02-08T20:00:00.000Z_2021-02-08T21:00:00.000Z_2021-02-08T21:14:50.337Z_139] at path[/druid/segments/druid-dataserver-9:8100/druid-dataserver-9:8100_indexer-executor__default_tier_2021-02-08T22:29:32.209Z_4e114ad4878f4dec80462dea24d108fc0]
2021-02-08T22:36:09,915 INFO [task-runner-0-priority-0] org.apache.druid.curator.discovery.CuratorDruidNodeAnnouncer - Unannounced self [{"druidNode":{"service":"druid/middleManager","host":"druid-dataserver-9","bindOnHost":false,"plaintextPort":8100,"port":-1,"tlsPort":-1,"enablePlaintextPort":true,"enableTlsPort":false},"nodeType":"peon","services":{"dataNodeService":{"type":"dataNodeService","tier":"_default_tier","maxSize":0,"type":"indexer-executor","priority":0},"lookupNodeService":{"type":"lookupNodeService","lookupTier":"__default"}}}].
2021-02-08T22:36:09,926 INFO [Thread-60] org.apache.druid.indexing.overlord.SingleTaskBackgroundRunner - Graceful shutdown of task[index_kafka_ssp-events-hourly_2a3e52186bc4cdd_ddcbmbfg] finished in 3,538ms.
2021-02-08T22:36:09,931 INFO [task-runner-0-priority-0] org.apache.druid.indexing.worker.executor.ExecutorLifecycle - Task completed with status: {
"id" : "index_kafka_ssp-events-hourly_2a3e52186bc4cdd_ddcbmbfg",
"status" : "SUCCESS",
"duration" : 401261,
"errorMsg" : null,
"location" : {
"host" : null,
"port" : -1,
"tlsPort" : -1
}
}
2021-02-08T22:36:09,940 INFO [main] org.apache.druid.cli.CliPeon - Thread [Thread[Thread-60,5,main]] is non daemon.
2021-02-08T22:36:09,940 INFO [main] org.apache.druid.java.util.common.lifecycle.Lifecycle - Lifecycle [module] already stopped and stop was called. Silently skipping
Cannot remove shutdown hook, already shutting down!
Finished peon task
2021-02-08T22:36:09,953 INFO [LookupExtractorFactoryContainerProvider-MainThread] org.apache.druid.query.lookup.LookupReferencesManager - Lookup Management loop exited. Lookup notices are not handled anymore.
Hi @tanisdlj ,could you give more details? Did the data ingest successfully?
Apparently and from what I can see in the middlemanager and overlord, nope, they didn't. The logs from the middlemanager regarding this task:
user@druid-dataserver-5$ sudo journalctl -eu druid-middlemanager --since "2021-02-08 22:28:00" --until "2021-02-08 22:38:00" | grep gnnoiele
Feb 08 22:29:20 druid-dataserver-5 java[25971]: 2021-02-08T22:29:20,129 INFO [WorkerTaskManager-NoticeHandler] org.apache.druid.indexing.worker.WorkerTaskManager - Task[index_kafka_ssp-events-hourly_2a3e52186bc4cdd_gnnoiele] started.
Feb 08 22:29:20 druid-dataserver-5 java[25971]: tup.logging.logProperties=true -Ddruid.indexer.fork.property.druid.lookup.enableLookupSyncOnStartup=true -Ddruid.zk.service.host=druid-zookeeper-1:2181,druid-zookeeper-2:2181,druid-zookeeper-3:2181 -Ddruid.indexer.logs.bucket=seedtag_analytics_druid_logs -Ddruid.zk.paths.base=/druid -Dfile.encoding=UTF-8 -Ddruid.storage.type=google -Ddruid.indexer.fork.property.druid.processing.numMergeBuffers=2 -Ddruid.indexer.logs.type=google -Ddruid.processing.numThreads=2 -Ddruid.processing.buffer.sizeBytes=104857600 -Ddruid.lookup.enableLookupSyncOnStartup=true -Ddruid.processing.numMergeBuffers=2 -Ddruid.metrics.emitter.dimension.dataSource=ssp-events-hourly -Ddruid.metrics.emitter.dimension.taskId=index_kafka_ssp-events-hourly_2a3e52186bc4cdd_gnnoiele -Ddruid.metrics.emitter.dimension.taskType=index_kafka -Ddruid.host=druid-dataserver-5 -Ddruid.plaintextPort=8100 -Ddruid.tlsPort=-1 -Ddruid.task.executor.service=druid/middleManager -Ddruid.task.executor.host=druid-dataserver-5 -Ddruid.task.executor.plaintextPort=8091 -Ddruid.task.executor.enablePlaintextPort=true -Ddruid.task.executor.tlsPort=-1 -Ddruid.task.executor.enableTlsPort=false org.apache.druid.cli.Main internal peon /druid/task/index_kafka_ssp-events-hourly_2a3e52186bc4cdd_gnnoiele/task.json /druid/task/index_kafka_ssp-events-hourly_2a3e52186bc4cdd_gnnoiele/bfdfd9f8-94d1-4c07-8627-455025cd6f6d/status.json /druid/task/index_kafka_ssp-events-hourly_2a3e52186bc4cdd_gnnoiele/bfdfd9f8-94d1-4c07-8627-455025cd6f6d/report.json --loadBroadcastSegments true
Feb 08 22:29:20 druid-dataserver-5 java[25971]: 2021-02-08T22:29:20,136 INFO [forking-task-runner-8] org.apache.druid.indexing.overlord.ForkingTaskRunner - Logging task index_kafka_ssp-events-hourly_2a3e52186bc4cdd_gnnoiele output to: /druid/task/index_kafka_ssp-events-hourly_2a3e52186bc4cdd_gnnoiele/log
Feb 08 22:35:36 druid-dataserver-5 java[25971]: 2021-02-08T22:35:36,303 INFO [qtp40910225-120] org.apache.druid.indexing.overlord.ForkingTaskRunner - Shutdown [index_kafka_ssp-events-hourly_2a3e52186bc4cdd_gnnoiele] because: [shut down request via HTTP endpoint]
Feb 08 22:35:36 druid-dataserver-5 java[25971]: 2021-02-08T22:35:36,303 INFO [qtp40910225-120] org.apache.druid.indexing.overlord.ForkingTaskRunner - Closing output stream to task[index_kafka_ssp-events-hourly_2a3e52186bc4cdd_gnnoiele].
Feb 08 22:35:36 druid-dataserver-5 java[25971]: 2021-02-08T22:35:36,332 INFO [qtp40910225-118] org.apache.druid.indexing.overlord.ForkingTaskRunner - Shutdown [index_kafka_ssp-events-hourly_2a3e52186bc4cdd_gnnoiele] because: [shut down request via HTTP endpoint]
Feb 08 22:35:36 druid-dataserver-5 java[25971]: 2021-02-08T22:35:36,332 INFO [qtp40910225-118] org.apache.druid.indexing.overlord.ForkingTaskRunner - Closing output stream to task[index_kafka_ssp-events-hourly_2a3e52186bc4cdd_gnnoiele].
Feb 08 22:35:36 druid-dataserver-5 java[25971]: 2021-02-08T22:35:36,559 INFO [qtp40910225-132] org.apache.druid.indexing.overlord.ForkingTaskRunner - Shutdown [index_kafka_ssp-events-hourly_2a3e52186bc4cdd_gnnoiele] because: [shut down request via HTTP endpoint]
Feb 08 22:35:36 druid-dataserver-5 java[25971]: 2021-02-08T22:35:36,559 INFO [qtp40910225-132] org.apache.druid.indexing.overlord.ForkingTaskRunner - Closing output stream to task[index_kafka_ssp-events-hourly_2a3e52186bc4cdd_gnnoiele].
Feb 08 22:35:36 druid-dataserver-5 java[25971]: 2021-02-08T22:35:36,586 INFO [qtp40910225-86] org.apache.druid.indexing.overlord.ForkingTaskRunner - Shutdown [index_kafka_ssp-events-hourly_2a3e52186bc4cdd_gnnoiele] because: [shut down request via HTTP endpoint]
Feb 08 22:35:36 druid-dataserver-5 java[25971]: 2021-02-08T22:35:36,587 INFO [qtp40910225-86] org.apache.druid.indexing.overlord.ForkingTaskRunner - Closing output stream to task[index_kafka_ssp-events-hourly_2a3e52186bc4cdd_gnnoiele].
Feb 08 22:36:09 druid-dataserver-5 java[25971]: 2021-02-08T22:36:09,776 INFO [forking-task-runner-8-[index_kafka_ssp-events-hourly_2a3e52186bc4cdd_gnnoiele]] org.apache.druid.indexing.overlord.ForkingTaskRunner - Process exited with status[2] for task: index_kafka_ssp-events-hourly_2a3e52186bc4cdd_gnnoiele
Feb 08 22:36:09 druid-dataserver-5 java[25971]: 2021-02-08T22:36:09,778 INFO [forking-task-runner-8] org.apache.druid.storage.google.GoogleTaskLogs - Pushing task log /druid/task/index_kafka_ssp-events-hourly_2a3e52186bc4cdd_gnnoiele/log to: indexing-logs/index_kafka_ssp-events-hourly_2a3e52186bc4cdd_gnnoiele
Feb 08 22:36:10 druid-dataserver-5 java[25971]: 2021-02-08T22:36:10,019 INFO [forking-task-runner-8] org.apache.druid.storage.google.GoogleTaskLogs - Pushing task reports /druid/task/index_kafka_ssp-events-hourly_2a3e52186bc4cdd_gnnoiele/bfdfd9f8-94d1-4c07-8627-455025cd6f6d/report.json to: indexing-logs/index_kafka_ssp-events-hourly_2a3e52186bc4cdd_gnnoiele.report.json
Feb 08 22:36:10 druid-dataserver-5 java[25971]: 2021-02-08T22:36:10,152 INFO [forking-task-runner-8] org.apache.druid.indexing.overlord.ForkingTaskRunner - Removing task directory: /druid/task/index_kafka_ssp-events-hourly_2a3e52186bc4cdd_gnnoiele
Feb 08 22:36:10 druid-dataserver-5 java[25971]: 2021-02-08T22:36:10,172 INFO [WorkerTaskManager-NoticeHandler] org.apache.druid.indexing.worker.WorkerTaskManager - Task [index_kafka_ssp-events-hourly_2a3e52186bc4cdd_gnnoiele] completed with status [FAILED].
From the overlord:
Feb 8 22:35:36 druid-master-2 java[15379]: 2021-02-08T22:35:36,262 WARN [KafkaSupervisor-ssp-events-hourly] org.apache.druid.indexing.seekablestream.supervisor.SeekableStreamSupervisor - All tasks in group [12] failed to publish, killing all tasks for these partitions
Feb 8 22:35:36 druid-master-2 java[15379]: 2021-02-08T22:35:36,262 INFO [KafkaSupervisor-ssp-events-hourly] org.apache.druid.indexing.overlord.RemoteTaskRunner - Shutdown [index_kafka_ssp-events-hourly_2a3e52186bc4cdd_ddcbmbfg] because: [No task in the corresponding pending completion taskGroup[12] succeeded before completion timeout elapsed]
Feb 8 22:35:36 druid-master-2 java[15379]: 2021-02-08T22:35:36,277 INFO [KafkaSupervisor-ssp-events-hourly] org.apache.druid.indexing.overlord.RemoteTaskRunner - Sent shutdown message to worker: druid-dataserver-9:8091, status 200 OK, response: {"task":"index_kafka_ssp-events-hourly_2a3e52186bc4cdd_ddcbmbfg"}
Feb 8 22:35:36 druid-master-2 java[15379]: 2021-02-08T22:35:36,277 INFO [KafkaSupervisor-ssp-events-hourly] org.apache.druid.indexing.overlord.TaskLockbox - Removing task[index_kafka_ssp-events-hourly_2a3e52186bc4cdd_ddcbmbfg] from activeTasks
Feb 8 22:35:36 druid-master-2 java[15379]: 2021-02-08T22:35:36,277 INFO [KafkaSupervisor-ssp-events-hourly] org.apache.druid.indexing.overlord.TaskLockbox - Removing task[index_kafka_ssp-events-hourly_2a3e52186bc4cdd_ddcbmbfg] from TaskLock[SegmentLock{lockType=EXCLUSIVE, groupId='index_kafka_ssp-events-hourly', dataSource='ssp-events-hourly', interval=2021-02-08T20:00:00.000Z/2021-02-08T21:00:00.000Z, version='2021-02-08T21:14:50.337Z', partitionId=139, priority=75, revoked=false}]
Feb 8 22:35:36 druid-master-2 java[15379]: 2021-02-08T22:35:36,279 INFO [KafkaSupervisor-ssp-events-hourly] org.apache.druid.indexing.overlord.MetadataTaskStorage - Deleting TaskLock with id[2032501]: SegmentLock{lockType=EXCLUSIVE, groupId='index_kafka_ssp-events-hourly', dataSource='ssp-events-hourly', interval=2021-02-08T20:00:00.000Z/2021-02-08T21:00:00.000Z, version='2021-02-08T21:14:50.337Z', partitionId=139, priority=75, revoked=false}
Feb 8 22:35:36 druid-master-2 java[15379]: 2021-02-08T22:35:36,289 INFO [KafkaSupervisor-ssp-events-hourly] org.apache.druid.indexing.overlord.MetadataTaskStorage - Updating task index_kafka_ssp-events-hourly_2a3e52186bc4cdd_ddcbmbfg to status: TaskStatus{id=index_kafka_ssp-events-hourly_2a3e52186bc4cdd_ddcbmbfg, status=FAILED, duration=-1, errorMsg=null}
Feb 8 22:35:36 druid-master-2 java[15379]: 2021-02-08T22:35:36,297 INFO [KafkaSupervisor-ssp-events-hourly] org.apache.druid.indexing.overlord.TaskQueue - Task done: AbstractTask{id='index_kafka_ssp-events-hourly_2a3e52186bc4cdd_ddcbmbfg', groupId='index_kafka_ssp-events-hourly', taskResource=TaskResource{availabilityGroup='index_kafka_ssp-events-hourly_2a3e52186bc4cdd', requiredCapacity=1}, dataSource='ssp-events-hourly', context={checkpoints={"0":{"12":92380626703}}, IS_INCREMENTAL_HANDOFF_SUPPORTED=true, forceTimeChunkLock=false}}
Feb 8 22:35:36 druid-master-2 java[15379]: 2021-02-08T22:35:36,298 INFO [KafkaSupervisor-ssp-events-hourly] org.apache.druid.indexing.overlord.RemoteTaskRunner - Shutdown [index_kafka_ssp-events-hourly_2a3e52186bc4cdd_gnnoiele] because: [No task in the corresponding pending completion taskGroup[12] succeeded before completion timeout elapsed]
Feb 8 22:35:36 druid-master-2 java[15379]: 2021-02-08T22:35:36,308 INFO [KafkaSupervisor-ssp-events-hourly] org.apache.druid.indexing.overlord.RemoteTaskRunner - Sent shutdown message to worker: druid-dataserver-5:8091, status 200 OK, response: {"task":"index_kafka_ssp-events-hourly_2a3e52186bc4cdd_gnnoiele"}
Feb 8 22:35:36 druid-master-2 java[15379]: 2021-02-08T22:35:36,309 INFO [KafkaSupervisor-ssp-events-hourly] org.apache.druid.indexing.overlord.TaskLockbox - Removing task[index_kafka_ssp-events-hourly_2a3e52186bc4cdd_gnnoiele] from activeTasks
Feb 8 22:35:36 druid-master-2 java[15379]: 2021-02-08T22:35:36,309 INFO [KafkaSupervisor-ssp-events-hourly] org.apache.druid.indexing.overlord.TaskLockbox - Removing task[index_kafka_ssp-events-hourly_2a3e52186bc4cdd_gnnoiele] from TaskLock[SegmentLock{lockType=EXCLUSIVE, groupId='index_kafka_ssp-events-hourly', dataSource='ssp-events-hourly', interval=2021-02-08T20:00:00.000Z/2021-02-08T21:00:00.000Z, version='2021-02-08T21:14:50.337Z', partitionId=139, priority=75, revoked=false}]
Feb 8 22:35:36 druid-master-2 java[15379]: 2021-02-08T22:35:36,309 INFO [KafkaSupervisor-ssp-events-hourly] org.apache.druid.indexing.overlord.TaskLockbox - TaskLock is now empty: SegmentLock{lockType=EXCLUSIVE, groupId='index_kafka_ssp-events-hourly', dataSource='ssp-events-hourly', interval=2021-02-08T20:00:00.000Z/2021-02-08T21:00:00.000Z, version='2021-02-08T21:14:50.337Z', partitionId=139, priority=75, revoked=false}
Feb 8 22:35:36 druid-master-2 java[15379]: 2021-02-08T22:35:36,310 INFO [KafkaSupervisor-ssp-events-hourly] org.apache.druid.indexing.overlord.MetadataTaskStorage - Deleting TaskLock with id[2032502]: SegmentLock{lockType=EXCLUSIVE, groupId='index_kafka_ssp-events-hourly', dataSource='ssp-events-hourly', interval=2021-02-08T20:00:00.000Z/2021-02-08T21:00:00.000Z, version='2021-02-08T21:14:50.337Z', partitionId=139, priority=75, revoked=false}
Feb 8 22:35:36 druid-master-2 java[15379]: 2021-02-08T22:35:36,319 INFO [KafkaSupervisor-ssp-events-hourly] org.apache.druid.indexing.overlord.MetadataTaskStorage - Updating task index_kafka_ssp-events-hourly_2a3e52186bc4cdd_gnnoiele to status: TaskStatus{id=index_kafka_ssp-events-hourly_2a3e52186bc4cdd_gnnoiele, status=FAILED, duration=-1, errorMsg=null}
Feb 8 22:35:36 druid-master-2 java[23035]: 2021-02-08T22:35:36,327 INFO [NodeRoleWatcher[PEON]] org.apache.druid.curator.discovery.CuratorDruidNodeDiscoveryProvider$NodeRoleWatcher - Node[http://druid-dataserver-9:8100] of role[peon] went offline.
From the overlord, I would say the important ones are:
Feb 8 22:35:36 druid-master-2 java[15379]: 2021-02-08T22:35:36,262 WARN [KafkaSupervisor-ssp-events-hourly] org.apache.druid.indexing.seekablestream.supervisor.SeekableStreamSupervisor - All tasks in group [12] failed to publish, killing all tasks for these partitions
Feb 8 22:35:36 druid-master-2 java[15379]: 2021-02-08T22:35:36,298 INFO [KafkaSupervisor-ssp-events-hourly] org.apache.druid.indexing.overlord.RemoteTaskRunner - Shutdown [index_kafka_ssp-events-hourly_2a3e52186bc4cdd_gnnoiele] because: [No task in the corresponding pending completion taskGroup[12] succeeded before completion timeout elapsed]
A bit more info, might be useful. Another task killed the same way, from the overlord:
Mar 22 19:07:00 druid-master-1 java[9686]: 2021-03-22T19:07:00,487 INFO [KafkaSupervisor-events-hourly] org.apache.druid.indexing.overlord.RemoteTaskRunner - Shutdown [index_kafka_events-hourly_6b40993d6374451_mfkjmnbg] because: [No task in the corresponding pending completion taskGroup[6] succeeded before completion timeout elapsed]
@tanisdlj same for me .
Saw same issue and similar error msg: task in the corresponding pending completion taskGroup[1] succeeded before completion timeout ela... any updates? @Gabriel39
Same issue in some of our druid clusters @Gabriel39
Same happened with me. The data source has big load and every task is failing.
Having same issue on Druid 26.0.0 - no apparent error in task log - the task seems to have finished successfully according to the task log:
2023-10-31T10:46:32,380 INFO [task-runner-0-priority-0] org.apache.druid.indexing.worker.executor.ExecutorLifecycle - Task completed with status: {
"id" : "index_kafka_data_source_name_dcfc6edf6a7ae39_klpgcjle",
"status" : "SUCCESS",
"duration" : 3780597,
"errorMsg" : null,
"location" : {
"host" : null,
"port" : -1,
"tlsPort" : -1
}
}
Yet, the task status is FAILED:
"statusCode": "FAILED",
"status": "FAILED",
"runnerStatusCode": "WAITING",
"errorMsg": "No task in the corresponding pending completion taskGroup[0] succeeded before completion timeout ela..."
Indexer have the same issue. Under severe lag, they fail to complete segments...
We have seen a number of this recently.
In our particular success-task-misreported-as-failure cases, sometimes the task shutdowns (triggered from registered shutdown hooks or @LifecycleStop
annotations here ) aren't clean and result in non-zero exit codes, even though the tasks have reported their status as SUCCESS
in both their logs and their task_folder/status.json
file. These non-zero exit codes along with the MM's task-reaping logic together lead to the misreportings.
IMHO the task-reaping logic prioritizing exit codes over the content in task_folder/status.json
file is wrong. A task should be reported as a SUCCESS
as long as it has written a task_folder/status.json
file with "status" : "SUCCESS"
in it. Any number of exceptions can be thrown during the shutdown sequence but that shouldn't affect the contract already delivered by the task.