[Bug]: TaskRuntime.stateLock and TableOptimizingProcess.lock can cause deadlocks.
What happened?
TaskRuntime.stateLock and TableOptimizingProcess.lock can cause deadlocks.This can cause tables in a certain resource group to malfunction, and OptimizerKeeper will also fail to function properly.
Thread1: Holding: 0x000000059a732858 Waiting: 0x000000059a739970
stackTrace:
java.lang.Thread.State: WAITING (parking)
at sun.misc.Unsafe.park(Native Method)
parking to wait for <0x000000059a739970> (a java.util.concurrent.locks.ReentrantLock$NonfairSync)
at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
at java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836)
at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireQueued(AbstractQueuedSynchronizer.java:870)
at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(AbstractQueuedSynchronizer.java:1199)
at java.util.concurrent.locks.ReentrantLock$NonfairSync.lock(ReentrantLock.java:209)
at java.util.concurrent.locks.ReentrantLock.lock(ReentrantLock.java:285)
at org.apache.amoro.server.persistence.StatedPersistentBase.invokeConsistency(StatedPersistentBase.java:48)
at org.apache.amoro.server.optimizing.TaskRuntime.tryCanceling(TaskRuntime.java:148)
at org.apache.amoro.server.optimizing.OptimizingQueue$TableOptimizingProcess$$Lambda$2143/1314391570.accept(Unknown Source)
at java.util.HashMap$Values.forEach(HashMap.java:982)
at org.apache.amoro.server.optimizing.OptimizingQueue$TableOptimizingProcess.cancelTasks(OptimizingQueue.java:768)
at org.apache.amoro.server.optimizing.OptimizingQueue$TableOptimizingProcess.lambda$persistAndSetCompleted$13(OptimizingQueue.java:748)
at org.apache.amoro.server.optimizing.OptimizingQueue$TableOptimizingProcess$$Lambda$2139/44375492.run(Unknown Source)
at org.apache.amoro.server.persistence.PersistentBase$$Lambda$573/1461930169.accept(Unknown Source)
at java.util.Spliterators$ArraySpliterator.forEachRemaining(Spliterators.java:948)
at java.util.stream.ReferencePipeline$Head.forEach(ReferencePipeline.java:647)
at org.apache.amoro.server.persistence.PersistentBase.doAsTransaction(PersistentBase.java:77)
at org.apache.amoro.server.optimizing.OptimizingQueue.access$500(OptimizingQueue.java:74)
at org.apache.amoro.server.optimizing.OptimizingQueue$TableOptimizingProcess.persistAndSetCompleted(OptimizingQueue.java:745)
at org.apache.amoro.server.optimizing.OptimizingQueue$TableOptimizingProcess.acceptResult(OptimizingQueue.java:539)
at org.apache.amoro.server.optimizing.TaskRuntime.lambda$complete$0(TaskRuntime.java:104)
at org.apache.amoro.server.optimizing.TaskRuntime$$Lambda$2004/1538729933.run(Unknown Source)
at org.apache.amoro.server.persistence.PersistentBase$$Lambda$573/1461930169.accept(Unknown Source)
at java.util.Spliterators$ArraySpliterator.forEachRemaining(Spliterators.java:948)
at java.util.stream.ReferencePipeline$Head.forEach(ReferencePipeline.java:647)
at org.apache.amoro.server.persistence.PersistentBase.doAsTransaction(PersistentBase.java:77)
at org.apache.amoro.server.persistence.StatedPersistentBase.invokeConsistency(StatedPersistentBase.java:51)
at org.apache.amoro.server.optimizing.TaskRuntime.complete(TaskRuntime.java:77)
at org.apache.amoro.server.DefaultOptimizingService.completeTask(DefaultOptimizingService.java:266)
at sun.reflect.GeneratedMethodAccessor354.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.amoro.server.utils.ThriftServiceProxy.invoke(ThriftServiceProxy.java:56)
at com.sun.proxy.$Proxy60.completeTask(Unknown Source)
at org.apache.amoro.api.OptimizingService$Processor$completeTask.getResult(OptimizingService.java:629)
at org.apache.amoro.api.OptimizingService$Processor$completeTask.getResult(OptimizingService.java:605)
at org.apache.amoro.shade.thrift.org.apache.thrift.ProcessFunction.process(ProcessFunction.java:40)
at org.apache.amoro.shade.thrift.org.apache.thrift.TBaseProcessor.process(TBaseProcessor.java:40)
at org.apache.amoro.shade.thrift.org.apache.thrift.TMultiplexedProcessor.process(TMultiplexedProcessor.java:147)
at org.apache.amoro.shade.thrift.org.apache.thrift.server.AbstractNonblockingServer$FrameBuffer.invoke(AbstractNonblockingServer.java:492)
at org.apache.amoro.shade.thrift.org.apache.thrift.server.Invocation.run(Invocation.java:19)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:750)
Locked ownable synchronizers:
- <0x000000059a732858> (a java.util.concurrent.locks.ReentrantLock$NonfairSync)
- <0x000000059a79be88> (a java.util.concurrent.locks.ReentrantLock$NonfairSync)
- <0x000000059bbf71e8> (a java.util.concurrent.ThreadPoolExecutor$Worker)
Thread2: Holding: 0x000000059a739970 Waiting: 0x000000059a732858
stackTrace:
java.lang.Thread.State: WAITING (parking)
at sun.misc.Unsafe.park(Native Method)
- parking to wait for <0x000000059a732858> (a java.util.concurrent.locks.ReentrantLock$NonfairSync)
at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
at java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836)
at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireQueued(AbstractQueuedSynchronizer.java:870)
at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(AbstractQueuedSynchronizer.java:1199)
at java.util.concurrent.locks.ReentrantLock$NonfairSync.lock(ReentrantLock.java:209)
at java.util.concurrent.locks.ReentrantLock.lock(ReentrantLock.java:285)
at org.apache.amoro.server.optimizing.OptimizingQueue$TableOptimizingProcess.acceptResult(OptimizingQueue.java:487)
at org.apache.amoro.server.optimizing.TaskRuntime.lambda$complete$0(TaskRuntime.java:104)
at org.apache.amoro.server.optimizing.TaskRuntime$$Lambda$2004/1538729933.run(Unknown Source)
at org.apache.amoro.server.persistence.PersistentBase$$Lambda$573/1461930169.accept(Unknown Source)
at java.util.Spliterators$ArraySpliterator.forEachRemaining(Spliterators.java:948)
at java.util.stream.ReferencePipeline$Head.forEach(ReferencePipeline.java:647)
at org.apache.amoro.server.persistence.PersistentBase.doAsTransaction(PersistentBase.java:77)
at org.apache.amoro.server.persistence.StatedPersistentBase.invokeConsistency(StatedPersistentBase.java:51)
at org.apache.amoro.server.optimizing.TaskRuntime.complete(TaskRuntime.java:77)
at org.apache.amoro.server.DefaultOptimizingService.completeTask(DefaultOptimizingService.java:266)
at sun.reflect.GeneratedMethodAccessor354.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.amoro.server.utils.ThriftServiceProxy.invoke(ThriftServiceProxy.java:56)
at com.sun.proxy.$Proxy60.completeTask(Unknown Source)
at org.apache.amoro.api.OptimizingService$Processor$completeTask.getResult(OptimizingService.java:629)
at org.apache.amoro.api.OptimizingService$Processor$completeTask.getResult(OptimizingService.java:605)
at org.apache.amoro.shade.thrift.org.apache.thrift.ProcessFunction.process(ProcessFunction.java:40)
at org.apache.amoro.shade.thrift.org.apache.thrift.TBaseProcessor.process(TBaseProcessor.java:40)
at org.apache.amoro.shade.thrift.org.apache.thrift.TMultiplexedProcessor.process(TMultiplexedProcessor.java:147)
at org.apache.amoro.shade.thrift.org.apache.thrift.server.AbstractNonblockingServer$FrameBuffer.invoke(AbstractNonblockingServer.java:492)
at org.apache.amoro.shade.thrift.org.apache.thrift.server.Invocation.run(Invocation.java:19)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:750)
Locked ownable synchronizers:
- <0x000000059a1e0228> (a java.util.concurrent.ThreadPoolExecutor$Worker)
- <0x000000059a739970> (a java.util.concurrent.locks.ReentrantLock$NonfairSync)
The root cause of the deadlock
Nested lock acquisition leading to a circular wait
- Thread 1 holds a
TaskRuntime.stateLock, waiting forTableOptimizingProcess.lock. - Thread 2 holds a
TableOptimizingProcess.lock, waiting for aTaskRuntime.stateLock.
Specific problem
In the acceptResult() method, when holding TableOptimizingProcess.lock, persistAndSetCompleted(false) is called. This method calls cancelTasks() in doAsTransaction(). cancelTasks() iterates through all tasks and calls TaskRuntime.tryCanceling(), which requires acquiring the stateLock of the TaskRuntime.
If another thread (Thread 1) already holds a stateLock of a TaskRuntime and is waiting for TableOptimizingProcess.lock, a deadlock will occur.
Affects Versions
0.7.1
What table formats are you seeing the problem on?
Iceberg
What engines are you seeing the problem on?
AMS
How to reproduce
No response
Relevant log output
Anything else
No response
Are you willing to submit a PR?
- [x] Yes I am willing to submit a PR!
Code of Conduct
- [x] I agree to follow this project's Code of Conduct