shardingsphere-elasticjob icon indicating copy to clipboard operation
shardingsphere-elasticjob copied to clipboard

Job may not be scheduled because Quartz logic may stuck into infinity loop

Open TeslaCN opened this issue 1 year ago • 2 comments

Bug Report

This may be similar with https://github.com/apache/shardingsphere-elasticjob/issues/2038.

Which version of ElasticJob did you use?

3.0.1 & 8b64a8c94830aaa96b9b0ddcba80bc31b146eecd

Which project did you use? ElasticJob-Lite or ElasticJob-Cloud?

Lite

Expected behavior

Job keep running.

Actual behavior

Job could not be triggered after a while.

2022-10-15 18:38:19
Full thread dump Java HotSpot(TM) 64-Bit Server VM (17.0.2+8-LTS-86 mixed mode, sharing):

Threads class SMR info:
_java_thread_list=0x000060000344c8c0, length=21, elements={
0x000000011f018200, 0x000000010e8c6600, 0x000000010e8c8e00, 0x000000011f04e200,
0x000000011f04e800, 0x000000011e813000, 0x000000011f04ee00, 0x000000011f04f400,
0x000000011f04fa00, 0x0000000105014600, 0x000000011e81be00, 0x000000011c80aa00,
0x000000010ea63c00, 0x000000011f054a00, 0x000000011f0fee00, 0x000000011f14b400,
0x000000011f14ac00, 0x000000011f38c000, 0x000000010eab0a00, 0x000000011f38f800,
0x000000013f02ac00
}

"main" #1 prio=5 os_prio=31 cpu=26553.26ms elapsed=1297.94s tid=0x000000011f018200 nid=0x1903 in Object.wait()  [0x000000016b806000]
   java.lang.Thread.State: WAITING (on object monitor)
	at java.lang.Object.wait([email protected]/Native Method)
	- waiting on <no object reference available>
	at java.lang.Object.wait([email protected]/Object.java:338)
	at org.apache.zookeeper.ClientCnxn.submitRequest(ClientCnxn.java:1561)
	- locked <0x000000071ad02728> (a org.apache.zookeeper.ClientCnxn$Packet)
	at org.apache.zookeeper.ClientCnxn.submitRequest(ClientCnxn.java:1533)
	at org.apache.zookeeper.ZooKeeper.getChildren(ZooKeeper.java:2734)
	at org.apache.curator.utils.ZKPaths.deleteChildren(ZKPaths.java:379)
	at org.apache.curator.framework.imps.DeleteBuilderImpl$5.call(DeleteBuilderImpl.java:287)
	at org.apache.curator.framework.imps.DeleteBuilderImpl$5.call(DeleteBuilderImpl.java:268)
	at org.apache.curator.RetryLoop.callWithRetry(RetryLoop.java:93)
	at org.apache.curator.framework.imps.DeleteBuilderImpl.pathInForeground(DeleteBuilderImpl.java:265)
	at org.apache.curator.framework.imps.DeleteBuilderImpl.forPath(DeleteBuilderImpl.java:249)
	at org.apache.curator.framework.imps.DeleteBuilderImpl.forPath(DeleteBuilderImpl.java:34)
	at org.apache.shardingsphere.elasticjob.reg.zookeeper.ZookeeperRegistryCenter.remove(ZookeeperRegistryCenter.java:272)
	at org.apache.shardingsphere.elasticjob.lite.internal.storage.JobNodeStorage.removeJobNodeIfExisted(JobNodeStorage.java:129)
	at org.apache.shardingsphere.elasticjob.lite.internal.instance.InstanceService.triggerAllInstances(InstanceService.java:88)
	at org.apache.shardingsphere.elasticjob.lite.api.bootstrap.impl.OneOffJobBootstrap.execute(OneOffJobBootstrap.java:54)
	at icu.wwj.hello.world.elasticjob.HelloOneOffJob.main(HelloOneOffJob.java:21)

"Reference Handler" #2 daemon prio=10 os_prio=31 cpu=0.98ms elapsed=1297.93s tid=0x000000010e8c6600 nid=0x4403 waiting on condition  [0x000000016c65a000]
   java.lang.Thread.State: RUNNABLE
	at java.lang.ref.Reference.waitForReferencePendingList([email protected]/Native Method)
	at java.lang.ref.Reference.processPendingReferences([email protected]/Reference.java:253)
	at java.lang.ref.Reference$ReferenceHandler.run([email protected]/Reference.java:215)

"Finalizer" #3 daemon prio=8 os_prio=31 cpu=0.56ms elapsed=1297.93s tid=0x000000010e8c8e00 nid=0x3f03 in Object.wait()  [0x000000016c866000]
   java.lang.Thread.State: WAITING (on object monitor)
	at java.lang.Object.wait([email protected]/Native Method)
	- waiting on <no object reference available>
	at java.lang.ref.ReferenceQueue.remove([email protected]/ReferenceQueue.java:155)
	- locked <0x00000007002b68e8> (a java.lang.ref.ReferenceQueue$Lock)
	at java.lang.ref.ReferenceQueue.remove([email protected]/ReferenceQueue.java:176)
	at java.lang.ref.Finalizer$FinalizerThread.run([email protected]/Finalizer.java:172)

"Signal Dispatcher" #4 daemon prio=9 os_prio=31 cpu=0.41ms elapsed=1297.92s tid=0x000000011f04e200 nid=0xa703 waiting on condition  [0x0000000000000000]
   java.lang.Thread.State: RUNNABLE

"Service Thread" #5 daemon prio=9 os_prio=31 cpu=5.36ms elapsed=1297.92s tid=0x000000011f04e800 nid=0x5b03 runnable  [0x0000000000000000]
   java.lang.Thread.State: RUNNABLE

"Monitor Deflation Thread" #6 daemon prio=9 os_prio=31 cpu=320.30ms elapsed=1297.92s tid=0x000000011e813000 nid=0xa403 runnable  [0x0000000000000000]
   java.lang.Thread.State: RUNNABLE

"C2 CompilerThread0" #7 daemon prio=9 os_prio=31 cpu=7765.56ms elapsed=1297.92s tid=0x000000011f04ee00 nid=0xa303 waiting on condition  [0x0000000000000000]
   java.lang.Thread.State: RUNNABLE
   No compile task

"C1 CompilerThread0" #10 daemon prio=9 os_prio=31 cpu=910.64ms elapsed=1297.92s tid=0x000000011f04f400 nid=0xa103 waiting on condition  [0x0000000000000000]
   java.lang.Thread.State: RUNNABLE
   No compile task

"Sweeper thread" #11 daemon prio=9 os_prio=31 cpu=21.76ms elapsed=1297.92s tid=0x000000011f04fa00 nid=0x5e03 runnable  [0x0000000000000000]
   java.lang.Thread.State: RUNNABLE

"Common-Cleaner" #12 daemon prio=8 os_prio=31 cpu=2.96ms elapsed=1297.91s tid=0x0000000105014600 nid=0x9f03 in Object.wait()  [0x000000016d7d2000]
   java.lang.Thread.State: TIMED_WAITING (on object monitor)
	at java.lang.Object.wait([email protected]/Native Method)
	- waiting on <no object reference available>
	at java.lang.ref.ReferenceQueue.remove([email protected]/ReferenceQueue.java:155)
	- locked <0x00000007002b7da8> (a java.lang.ref.ReferenceQueue$Lock)
	at jdk.internal.ref.CleanerImpl.run([email protected]/CleanerImpl.java:140)
	at java.lang.Thread.run([email protected]/Thread.java:833)
	at jdk.internal.misc.InnocuousThread.run([email protected]/InnocuousThread.java:162)

"Monitor Ctrl-Break" #13 daemon prio=5 os_prio=31 cpu=11.83ms elapsed=1297.85s tid=0x000000011e81be00 nid=0x9d03 runnable  [0x000000016d9de000]
   java.lang.Thread.State: RUNNABLE
	at sun.nio.ch.SocketDispatcher.read0([email protected]/Native Method)
	at sun.nio.ch.SocketDispatcher.read([email protected]/SocketDispatcher.java:47)
	at sun.nio.ch.NioSocketImpl.tryRead([email protected]/NioSocketImpl.java:261)
	at sun.nio.ch.NioSocketImpl.implRead([email protected]/NioSocketImpl.java:312)
	at sun.nio.ch.NioSocketImpl.read([email protected]/NioSocketImpl.java:350)
	at sun.nio.ch.NioSocketImpl$1.read([email protected]/NioSocketImpl.java:803)
	at java.net.Socket$SocketInputStream.read([email protected]/Socket.java:966)
	at sun.nio.cs.StreamDecoder.readBytes([email protected]/StreamDecoder.java:270)
	at sun.nio.cs.StreamDecoder.implRead([email protected]/StreamDecoder.java:313)
	at sun.nio.cs.StreamDecoder.read([email protected]/StreamDecoder.java:188)
	- locked <0x00000007002b97b8> (a java.io.InputStreamReader)
	at java.io.InputStreamReader.read([email protected]/InputStreamReader.java:177)
	at java.io.BufferedReader.fill([email protected]/BufferedReader.java:162)
	at java.io.BufferedReader.readLine([email protected]/BufferedReader.java:329)
	- locked <0x00000007002b97b8> (a java.io.InputStreamReader)
	at java.io.BufferedReader.readLine([email protected]/BufferedReader.java:396)
	at com.intellij.rt.execution.application.AppMainV2$1.run(AppMainV2.java:55)

"Notification Thread" #14 daemon prio=9 os_prio=31 cpu=0.04ms elapsed=1297.85s tid=0x000000011c80aa00 nid=0x9b03 runnable  [0x0000000000000000]
   java.lang.Thread.State: RUNNABLE

"Curator-ConnectionStateManager-0" #18 daemon prio=5 os_prio=31 cpu=39.60ms elapsed=1297.52s tid=0x000000010ea63c00 nid=0x9407 waiting on condition  [0x000000016ea3e000]
   java.lang.Thread.State: TIMED_WAITING (parking)
	at jdk.internal.misc.Unsafe.park([email protected]/Native Method)
	- parking to wait for  <0x00000007006b4fc0> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
	at java.util.concurrent.locks.LockSupport.parkNanos([email protected]/LockSupport.java:252)
	at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos([email protected]/AbstractQueuedSynchronizer.java:1672)
	at java.util.concurrent.ArrayBlockingQueue.poll([email protected]/ArrayBlockingQueue.java:435)
	at org.apache.curator.framework.state.ConnectionStateManager.processEvents(ConnectionStateManager.java:271)
	at org.apache.curator.framework.state.ConnectionStateManager.access$000(ConnectionStateManager.java:43)
	at org.apache.curator.framework.state.ConnectionStateManager$1.call(ConnectionStateManager.java:132)
	at java.util.concurrent.FutureTask.run([email protected]/FutureTask.java:264)
	at java.util.concurrent.ThreadPoolExecutor.runWorker([email protected]/ThreadPoolExecutor.java:1136)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run([email protected]/ThreadPoolExecutor.java:635)
	at java.lang.Thread.run([email protected]/Thread.java:833)

"main-SendThread(127.0.0.1:2181)" #19 daemon prio=5 os_prio=31 cpu=46532.31ms elapsed=1297.49s tid=0x000000011f054a00 nid=0x9203 runnable  [0x000000016ec4a000]
   java.lang.Thread.State: RUNNABLE
	at sun.nio.ch.KQueue.poll([email protected]/Native Method)
	at sun.nio.ch.KQueueSelectorImpl.doSelect([email protected]/KQueueSelectorImpl.java:122)
	at sun.nio.ch.SelectorImpl.lockAndDoSelect([email protected]/SelectorImpl.java:129)
	- locked <0x00000007006b3ea8> (a sun.nio.ch.Util$2)
	- locked <0x00000007006b3e50> (a sun.nio.ch.KQueueSelectorImpl)
	at sun.nio.ch.SelectorImpl.select([email protected]/SelectorImpl.java:141)
	at org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:330)
	at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1262)

"main-EventThread" #20 daemon prio=5 os_prio=31 cpu=13926.12ms elapsed=1297.49s tid=0x000000011f0fee00 nid=0x9003 waiting on condition  [0x000000016ee56000]
   java.lang.Thread.State: WAITING (parking)
	at jdk.internal.misc.Unsafe.park([email protected]/Native Method)
	- parking to wait for  <0x00000007006b47e0> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
	at java.util.concurrent.locks.LockSupport.park([email protected]/LockSupport.java:341)
	at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionNode.block([email protected]/AbstractQueuedSynchronizer.java:506)
	at java.util.concurrent.ForkJoinPool.unmanagedBlock([email protected]/ForkJoinPool.java:3463)
	at java.util.concurrent.ForkJoinPool.managedBlock([email protected]/ForkJoinPool.java:3434)
	at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await([email protected]/AbstractQueuedSynchronizer.java:1623)
	at java.util.concurrent.LinkedBlockingQueue.take([email protected]/LinkedBlockingQueue.java:435)
	at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:547)

"Curator-Framework-0" #21 daemon prio=5 os_prio=31 cpu=5.64ms elapsed=1297.49s tid=0x000000011f14b400 nid=0x8f03 waiting on condition  [0x000000016f062000]
   java.lang.Thread.State: WAITING (parking)
	at jdk.internal.misc.Unsafe.park([email protected]/Native Method)
	- parking to wait for  <0x00000007006b41f0> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
	at java.util.concurrent.locks.LockSupport.park([email protected]/LockSupport.java:341)
	at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionNode.block([email protected]/AbstractQueuedSynchronizer.java:506)
	at java.util.concurrent.ForkJoinPool.unmanagedBlock([email protected]/ForkJoinPool.java:3463)
	at java.util.concurrent.ForkJoinPool.managedBlock([email protected]/ForkJoinPool.java:3434)
	at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await([email protected]/AbstractQueuedSynchronizer.java:1623)
	at java.util.concurrent.DelayQueue.take([email protected]/DelayQueue.java:217)
	at java.util.concurrent.DelayQueue.take([email protected]/DelayQueue.java:77)
	at org.apache.curator.framework.imps.CuratorFrameworkImpl.backgroundOperationsLoop(CuratorFrameworkImpl.java:947)
	at org.apache.curator.framework.imps.CuratorFrameworkImpl.access$300(CuratorFrameworkImpl.java:66)
	at org.apache.curator.framework.imps.CuratorFrameworkImpl$4.call(CuratorFrameworkImpl.java:342)
	at java.util.concurrent.FutureTask.run([email protected]/FutureTask.java:264)
	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run([email protected]/ScheduledThreadPoolExecutor.java:304)
	at java.util.concurrent.ThreadPoolExecutor.runWorker([email protected]/ThreadPoolExecutor.java:1136)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run([email protected]/ThreadPoolExecutor.java:635)
	at java.lang.Thread.run([email protected]/Thread.java:833)

"oneoff_Worker-1" #24 prio=5 os_prio=31 cpu=7651.79ms elapsed=1297.38s tid=0x000000011f14ac00 nid=0x6703 in Object.wait()  [0x000000016f26e000]
   java.lang.Thread.State: TIMED_WAITING (on object monitor)
	at java.lang.Object.wait([email protected]/Native Method)
	- waiting on <no object reference available>
	at org.quartz.simpl.SimpleThreadPool$WorkerThread.run(SimpleThreadPool.java:568)
	- locked <0x000000070096c8a8> (a java.lang.Object)

"oneoff_QuartzSchedulerThread" #25 prio=5 os_prio=31 cpu=951581.34ms elapsed=1297.38s tid=0x000000011f38c000 nid=0x8b03 runnable  [0x000000016f47a000]
   java.lang.Thread.State: RUNNABLE
	at java.util.TreeMap.put([email protected]/TreeMap.java:795)
	at java.util.TreeMap.put([email protected]/TreeMap.java:534)
	at java.util.TreeSet.add([email protected]/TreeSet.java:255)
	at org.quartz.simpl.RAMJobStore.acquireNextTriggers(RAMJobStore.java:1476)
	- locked <0x00000007009ae4e0> (a java.lang.Object)
	at org.quartz.core.QuartzSchedulerThread.run(QuartzSchedulerThread.java:287)

"Curator-SafeNotifyService-0" #27 daemon prio=5 os_prio=31 cpu=20479.43ms elapsed=1297.37s tid=0x000000010eab0a00 nid=0x6803 waiting for monitor entry  [0x000000016f686000]
   java.lang.Thread.State: BLOCKED (on object monitor)
	at org.quartz.simpl.RAMJobStore.getTriggerState(RAMJobStore.java:641)
	- waiting to lock <0x00000007009ae4e0> (a java.lang.Object)
	at org.quartz.core.QuartzScheduler.getTriggerState(QuartzScheduler.java:1562)
	at org.quartz.impl.StdScheduler.getTriggerState(StdScheduler.java:518)
	at org.apache.shardingsphere.elasticjob.lite.internal.schedule.JobScheduleController.isPaused(JobScheduleController.java:124)
	- locked <0x00000007009fa8c0> (a org.apache.shardingsphere.elasticjob.lite.internal.schedule.JobScheduleController)
	at org.apache.shardingsphere.elasticjob.lite.internal.instance.ShutdownListenerManager$InstanceShutdownStatusJobListener.dataChanged(ShutdownListenerManager.java:56)
	at org.apache.shardingsphere.elasticjob.lite.internal.listener.AbstractJobListener.event(AbstractJobListener.java:40)
	at org.apache.curator.framework.recipes.cache.CuratorCacheImpl.lambda$null$8(CuratorCacheImpl.java:286)
	at org.apache.curator.framework.recipes.cache.CuratorCacheImpl$$Lambda$182/0x0000000800e025d8.accept(Unknown Source)
	at org.apache.curator.framework.listen.MappingListenerManager.lambda$forEach$0(MappingListenerManager.java:92)
	at org.apache.curator.framework.listen.MappingListenerManager$$Lambda$114/0x0000000800d864b0.run(Unknown Source)
	at org.apache.curator.framework.listen.MappingListenerManager$$Lambda$108/0x0000000800d768f8.execute(Unknown Source)
	at org.apache.curator.framework.listen.MappingListenerManager.forEach(MappingListenerManager.java:89)
	at org.apache.curator.framework.listen.StandardListenerManager.forEach(StandardListenerManager.java:89)
	at org.apache.curator.framework.recipes.cache.CuratorCacheImpl.lambda$callListeners$10(CuratorCacheImpl.java:293)
	at org.apache.curator.framework.recipes.cache.CuratorCacheImpl$$Lambda$168/0x0000000800de0b28.run(Unknown Source)
	at java.util.concurrent.CompletableFuture$AsyncRun.run([email protected]/CompletableFuture.java:1804)
	at java.util.concurrent.ThreadPoolExecutor.runWorker([email protected]/ThreadPoolExecutor.java:1136)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run([email protected]/ThreadPoolExecutor.java:635)
	at java.lang.Thread.run([email protected]/Thread.java:833)

"ReconcileService" #29 prio=5 os_prio=31 cpu=28.02ms elapsed=1297.24s tid=0x000000011f38f800 nid=0x6907 waiting on condition  [0x000000016f892000]
   java.lang.Thread.State: TIMED_WAITING (parking)
	at jdk.internal.misc.Unsafe.park([email protected]/Native Method)
	- parking to wait for  <0x0000000700957298> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
	at java.util.concurrent.locks.LockSupport.parkNanos([email protected]/LockSupport.java:252)
	at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos([email protected]/AbstractQueuedSynchronizer.java:1672)
	at java.util.concurrent.ScheduledThreadPoolExecutor$DelayedWorkQueue.take([email protected]/ScheduledThreadPoolExecutor.java:1182)
	at java.util.concurrent.ScheduledThreadPoolExecutor$DelayedWorkQueue.take([email protected]/ScheduledThreadPoolExecutor.java:899)
	at java.util.concurrent.ThreadPoolExecutor.getTask([email protected]/ThreadPoolExecutor.java:1062)
	at java.util.concurrent.ThreadPoolExecutor.runWorker([email protected]/ThreadPoolExecutor.java:1122)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run([email protected]/ThreadPoolExecutor.java:635)
	at java.lang.Thread.run([email protected]/Thread.java:833)

"Attach Listener" #65 daemon prio=9 os_prio=31 cpu=1.55ms elapsed=990.49s tid=0x000000013f02ac00 nid=0x1514f waiting on condition  [0x0000000000000000]
   java.lang.Thread.State: RUNNABLE

"VM Thread" os_prio=31 cpu=847.32ms elapsed=1297.93s tid=0x0000000104e0a400 nid=0x3c03 runnable  

"GC Thread#0" os_prio=31 cpu=969.57ms elapsed=1297.94s tid=0x0000000120107530 nid=0x4d03 runnable  

"GC Thread#1" os_prio=31 cpu=997.27ms elapsed=1297.75s tid=0x0000000120113520 nid=0x9903 runnable  

"GC Thread#2" os_prio=31 cpu=1003.96ms elapsed=1297.75s tid=0x00000001201139a0 nid=0x9703 runnable  

"GC Thread#3" os_prio=31 cpu=1006.63ms elapsed=1297.75s tid=0x0000000120114210 nid=0x6203 runnable  

"GC Thread#4" os_prio=31 cpu=989.71ms elapsed=1297.75s tid=0x0000000120114a80 nid=0x6303 runnable  

"GC Thread#5" os_prio=31 cpu=1008.97ms elapsed=1297.75s tid=0x00000001201152f0 nid=0x6403 runnable  

"GC Thread#6" os_prio=31 cpu=976.70ms elapsed=1290.64s tid=0x0000000104b09170 nid=0x5707 runnable  

"GC Thread#7" os_prio=31 cpu=988.94ms elapsed=1290.64s tid=0x0000000104b095f0 nid=0x15203 runnable  

"G1 Main Marker" os_prio=31 cpu=0.04ms elapsed=1297.94s tid=0x0000000120107be0 nid=0x3103 runnable  

"G1 Conc#0" os_prio=31 cpu=0.03ms elapsed=1297.94s tid=0x0000000104e041e0 nid=0x4a03 runnable  

"G1 Refine#0" os_prio=31 cpu=0.03ms elapsed=1297.94s tid=0x0000000104e05f10 nid=0x3303 runnable  

"G1 Service" os_prio=31 cpu=166.60ms elapsed=1297.94s tid=0x0000000104e067b0 nid=0x4703 runnable  

"VM Periodic Task Thread" os_prio=31 cpu=373.61ms elapsed=1297.85s tid=0x0000000105a046f0 nid=0x9a03 waiting on condition  

JNI global refs: 19, weak refs: 0

Example codes for reproduce this issue (such as a github link).

Start multiple processes of the following codes.

JobConfiguration jobConfiguration = JobConfiguration.newBuilder("oneoff", 32)
        .overwrite(true)
        .failover(true)
        .build();
OneOffJobBootstrap oneOffJobBootstrap = new OneOffJobBootstrap(REGISTRY_CENTER, (SimpleJob) shardingContext -> log.info("{}", shardingContext), jobConfiguration);
for (;;) {
    oneOffJobBootstrap.execute();
}

TeslaCN avatar Oct 15 '22 10:10 TeslaCN

One thread blocked for a while, maybe not deadlock.

TeslaCN avatar Oct 16 '22 10:10 TeslaCN

I found this was not caused by deadlock. Maybe infinity loop occurred in Quartz.

There are many tasks were not consumed. image

The mod count in TreeSet instance was keep changing.

image image image

The Quartz thread was keep consuming CPU.

image

TeslaCN avatar Oct 16 '22 10:10 TeslaCN