amoro icon indicating copy to clipboard operation
amoro copied to clipboard

[Bug]: TaskRuntime.stateLock and TableOptimizingProcess.lock can cause deadlocks.

Open wardlican opened this issue 3 weeks ago • 0 comments

What happened?

TaskRuntime.stateLock and TableOptimizingProcess.lock can cause deadlocks.This can cause tables in a certain resource group to malfunction, and OptimizerKeeper will also fail to function properly.

Thread1: Holding: 0x000000059a732858 Waiting: 0x000000059a739970

stackTrace:
java.lang.Thread.State: WAITING (parking)
at sun.misc.Unsafe.park(Native Method)
parking to wait for <0x000000059a739970> (a java.util.concurrent.locks.ReentrantLock$NonfairSync)
at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
at java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836)
at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireQueued(AbstractQueuedSynchronizer.java:870)
at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(AbstractQueuedSynchronizer.java:1199)
at java.util.concurrent.locks.ReentrantLock$NonfairSync.lock(ReentrantLock.java:209)
at java.util.concurrent.locks.ReentrantLock.lock(ReentrantLock.java:285)
at org.apache.amoro.server.persistence.StatedPersistentBase.invokeConsistency(StatedPersistentBase.java:48)
at org.apache.amoro.server.optimizing.TaskRuntime.tryCanceling(TaskRuntime.java:148)
at org.apache.amoro.server.optimizing.OptimizingQueue$TableOptimizingProcess$$Lambda$2143/1314391570.accept(Unknown Source)
at java.util.HashMap$Values.forEach(HashMap.java:982)
at org.apache.amoro.server.optimizing.OptimizingQueue$TableOptimizingProcess.cancelTasks(OptimizingQueue.java:768)
at org.apache.amoro.server.optimizing.OptimizingQueue$TableOptimizingProcess.lambda$persistAndSetCompleted$13(OptimizingQueue.java:748)
at org.apache.amoro.server.optimizing.OptimizingQueue$TableOptimizingProcess$$Lambda$2139/44375492.run(Unknown Source)
at org.apache.amoro.server.persistence.PersistentBase$$Lambda$573/1461930169.accept(Unknown Source)
at java.util.Spliterators$ArraySpliterator.forEachRemaining(Spliterators.java:948)
at java.util.stream.ReferencePipeline$Head.forEach(ReferencePipeline.java:647)
at org.apache.amoro.server.persistence.PersistentBase.doAsTransaction(PersistentBase.java:77)
at org.apache.amoro.server.optimizing.OptimizingQueue.access$500(OptimizingQueue.java:74)
at org.apache.amoro.server.optimizing.OptimizingQueue$TableOptimizingProcess.persistAndSetCompleted(OptimizingQueue.java:745)
at org.apache.amoro.server.optimizing.OptimizingQueue$TableOptimizingProcess.acceptResult(OptimizingQueue.java:539)
at org.apache.amoro.server.optimizing.TaskRuntime.lambda$complete$0(TaskRuntime.java:104)
at org.apache.amoro.server.optimizing.TaskRuntime$$Lambda$2004/1538729933.run(Unknown Source)
at org.apache.amoro.server.persistence.PersistentBase$$Lambda$573/1461930169.accept(Unknown Source)
at java.util.Spliterators$ArraySpliterator.forEachRemaining(Spliterators.java:948)
at java.util.stream.ReferencePipeline$Head.forEach(ReferencePipeline.java:647)
at org.apache.amoro.server.persistence.PersistentBase.doAsTransaction(PersistentBase.java:77)
at org.apache.amoro.server.persistence.StatedPersistentBase.invokeConsistency(StatedPersistentBase.java:51)
at org.apache.amoro.server.optimizing.TaskRuntime.complete(TaskRuntime.java:77)
at org.apache.amoro.server.DefaultOptimizingService.completeTask(DefaultOptimizingService.java:266)
at sun.reflect.GeneratedMethodAccessor354.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.amoro.server.utils.ThriftServiceProxy.invoke(ThriftServiceProxy.java:56)
at com.sun.proxy.$Proxy60.completeTask(Unknown Source)
at org.apache.amoro.api.OptimizingService$Processor$completeTask.getResult(OptimizingService.java:629)
at org.apache.amoro.api.OptimizingService$Processor$completeTask.getResult(OptimizingService.java:605)
at org.apache.amoro.shade.thrift.org.apache.thrift.ProcessFunction.process(ProcessFunction.java:40)
at org.apache.amoro.shade.thrift.org.apache.thrift.TBaseProcessor.process(TBaseProcessor.java:40)
at org.apache.amoro.shade.thrift.org.apache.thrift.TMultiplexedProcessor.process(TMultiplexedProcessor.java:147)
at org.apache.amoro.shade.thrift.org.apache.thrift.server.AbstractNonblockingServer$FrameBuffer.invoke(AbstractNonblockingServer.java:492)
at org.apache.amoro.shade.thrift.org.apache.thrift.server.Invocation.run(Invocation.java:19)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:750)
Locked ownable synchronizers:
- <0x000000059a732858> (a java.util.concurrent.locks.ReentrantLock$NonfairSync)
- <0x000000059a79be88> (a java.util.concurrent.locks.ReentrantLock$NonfairSync)
- <0x000000059bbf71e8> (a java.util.concurrent.ThreadPoolExecutor$Worker)

Thread2: Holding: 0x000000059a739970 Waiting: 0x000000059a732858

stackTrace:
java.lang.Thread.State: WAITING (parking)
at sun.misc.Unsafe.park(Native Method)
- parking to wait for <0x000000059a732858> (a java.util.concurrent.locks.ReentrantLock$NonfairSync)
at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
at java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836)
at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireQueued(AbstractQueuedSynchronizer.java:870)
at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(AbstractQueuedSynchronizer.java:1199)
at java.util.concurrent.locks.ReentrantLock$NonfairSync.lock(ReentrantLock.java:209)
at java.util.concurrent.locks.ReentrantLock.lock(ReentrantLock.java:285)
at org.apache.amoro.server.optimizing.OptimizingQueue$TableOptimizingProcess.acceptResult(OptimizingQueue.java:487)
at org.apache.amoro.server.optimizing.TaskRuntime.lambda$complete$0(TaskRuntime.java:104)
at org.apache.amoro.server.optimizing.TaskRuntime$$Lambda$2004/1538729933.run(Unknown Source)
at org.apache.amoro.server.persistence.PersistentBase$$Lambda$573/1461930169.accept(Unknown Source)
at java.util.Spliterators$ArraySpliterator.forEachRemaining(Spliterators.java:948)
at java.util.stream.ReferencePipeline$Head.forEach(ReferencePipeline.java:647)
at org.apache.amoro.server.persistence.PersistentBase.doAsTransaction(PersistentBase.java:77)
at org.apache.amoro.server.persistence.StatedPersistentBase.invokeConsistency(StatedPersistentBase.java:51)
at org.apache.amoro.server.optimizing.TaskRuntime.complete(TaskRuntime.java:77)
at org.apache.amoro.server.DefaultOptimizingService.completeTask(DefaultOptimizingService.java:266)
at sun.reflect.GeneratedMethodAccessor354.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.amoro.server.utils.ThriftServiceProxy.invoke(ThriftServiceProxy.java:56)
at com.sun.proxy.$Proxy60.completeTask(Unknown Source)
at org.apache.amoro.api.OptimizingService$Processor$completeTask.getResult(OptimizingService.java:629)
at org.apache.amoro.api.OptimizingService$Processor$completeTask.getResult(OptimizingService.java:605)
at org.apache.amoro.shade.thrift.org.apache.thrift.ProcessFunction.process(ProcessFunction.java:40)
at org.apache.amoro.shade.thrift.org.apache.thrift.TBaseProcessor.process(TBaseProcessor.java:40)
at org.apache.amoro.shade.thrift.org.apache.thrift.TMultiplexedProcessor.process(TMultiplexedProcessor.java:147)
at org.apache.amoro.shade.thrift.org.apache.thrift.server.AbstractNonblockingServer$FrameBuffer.invoke(AbstractNonblockingServer.java:492)
at org.apache.amoro.shade.thrift.org.apache.thrift.server.Invocation.run(Invocation.java:19)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:750)
Locked ownable synchronizers:
- <0x000000059a1e0228> (a java.util.concurrent.ThreadPoolExecutor$Worker)
- <0x000000059a739970> (a java.util.concurrent.locks.ReentrantLock$NonfairSync)

The root cause of the deadlock

Nested lock acquisition leading to a circular wait

  • Thread 1 holds a TaskRuntime.stateLock, waiting for TableOptimizingProcess.lock.
  • Thread 2 holds a TableOptimizingProcess.lock, waiting for a TaskRuntime.stateLock.

Specific problem

In the acceptResult() method, when holding TableOptimizingProcess.lock, persistAndSetCompleted(false) is called. This method calls cancelTasks() in doAsTransaction(). cancelTasks() iterates through all tasks and calls TaskRuntime.tryCanceling(), which requires acquiring the stateLock of the TaskRuntime.

If another thread (Thread 1) already holds a stateLock of a TaskRuntime and is waiting for TableOptimizingProcess.lock, a deadlock will occur.

Affects Versions

0.7.1

What table formats are you seeing the problem on?

Iceberg

What engines are you seeing the problem on?

AMS

How to reproduce

No response

Relevant log output


Anything else

No response

Are you willing to submit a PR?

  • [x] Yes I am willing to submit a PR!

Code of Conduct

  • [x] I agree to follow this project's Code of Conduct

wardlican avatar Dec 09 '25 07:12 wardlican