herddb
herddb copied to clipboard
BRIN page unload deadlock
BUG REPORT
- Please describe the issue you observed: I found tablespaces stuck on checkpoint procedure. Deep looking on logs and stacks I found that HerdDB was stuck in a deadlock case between a BRIN index value deletion and another BRIN index checkpoint: Checkpointing thread
"hdb-pecora-activator" #38 daemon prio=5 os_prio=0 cpu=6524841.27ms elapsed=88200.79s tid=0x00007f4b66096840 nid=0x1a1065 waiting on condition [0x00007f2be12fe000]
java.lang.Thread.State: WAITING (parking)
at jdk.internal.misc.Unsafe.park([email protected]/Native Method)
- parking to wait for <0x00007f2e5b4fb468> (a java.util.concurrent.locks.ReentrantLock$FairSync)
at java.util.concurrent.locks.LockSupport.park([email protected]/LockSupport.java:211)
at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire([email protected]/AbstractQueuedSynchronizer.java:715)
at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire([email protected]/AbstractQueuedSynchronizer.java:938)
at java.util.concurrent.locks.ReentrantLock$Sync.lock([email protected]/ReentrantLock.java:153)
at java.util.concurrent.locks.ReentrantLock.lock([email protected]/ReentrantLock.java:322)
at herddb.index.brin.BlockRangeIndex$Block.unload(BlockRangeIndex.java:700)
at herddb.index.brin.BlockRangeIndex$Block.unload(BlockRangeIndex.java:730)
at herddb.index.brin.BlockRangeIndex.merge(BlockRangeIndex.java:795)
at herddb.index.brin.BlockRangeIndex.checkpoint(BlockRangeIndex.java:914)
at herddb.index.brin.BRINIndexManager.checkpoint(BRINIndexManager.java:317)
at herddb.core.TableManager.checkpoint(TableManager.java:3253)
at herddb.core.TableManager.checkpoint(TableManager.java:2670)
at herddb.core.TableSpaceManager.checkpoint(TableSpaceManager.java:2076)
at herddb.core.DBManager.checkpoint(DBManager.java:964)
at herddb.core.DBManager.executeActivator(DBManager.java:1276)
at herddb.core.DBManager.access$500(DBManager.java:124)
at herddb.core.DBManager$Activator.run(DBManager.java:1208)
at java.lang.Thread.run([email protected]/Thread.java:840)
Locked ownable synchronizers:
- <0x00007f2d51f48b20> (a java.util.concurrent.locks.ReentrantLock$FairSync)
- <0x00007f2d51f48b88> (a java.util.concurrent.locks.ReentrantLock$FairSync)
- <0x00007f2d51f48bf0> (a java.util.concurrent.locks.ReentrantLock$FairSync)
Another index value deletion thread:
"db-dmlcall-localhost:7000-3" #265 prio=5 os_prio=0 cpu=768760.12ms elapsed=88704.87s tid=0x00007f28f0052030 nid=0x1a1226 waiting on condition [0x00007f27881f8000]
java.lang.Thread.State: WAITING (parking)
at jdk.internal.misc.Unsafe.park([email protected]/Native Method)
- parking to wait for <0x00007f2d51f48b88> (a java.util.concurrent.locks.ReentrantLock$FairSync)
at java.util.concurrent.locks.LockSupport.park([email protected]/LockSupport.java:211)
at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire([email protected]/AbstractQueuedSynchronizer.java:715)
at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire([email protected]/AbstractQueuedSynchronizer.java:938)
at java.util.concurrent.locks.ReentrantLock$Sync.lock([email protected]/ReentrantLock.java:153)
at java.util.concurrent.locks.ReentrantLock.lock([email protected]/ReentrantLock.java:322)
at herddb.index.brin.BlockRangeIndex$Block.unload(BlockRangeIndex.java:700)
at herddb.index.brin.BlockRangeIndex$Block.unload(BlockRangeIndex.java:730)
at herddb.index.brin.BlockRangeIndex$Block.ensureBlockLoaded(BlockRangeIndex.java:499)
at herddb.index.brin.BlockRangeIndex$Block.delete(BlockRangeIndex.java:380)
at herddb.index.brin.BlockRangeIndex.delete(BlockRangeIndex.java:1055)
at herddb.index.brin.BRINIndexManager.recordDeleted(BRINIndexManager.java:431)
at herddb.core.TableManager.applyDelete(TableManager.java:2159)
at herddb.core.TableManager.onTransactionCommit(TableManager.java:1948)
at herddb.core.TableSpaceManager.apply(TableSpaceManager.java:428)
at herddb.core.TableSpaceManager.lambda$commitTransaction$22(TableSpaceManager.java:2188)
at herddb.core.TableSpaceManager$$Lambda$1815/0x00007f2befd5f7e8.apply(Unknown Source)
at java.util.concurrent.CompletableFuture.uniHandle([email protected]/CompletableFuture.java:934)
at java.util.concurrent.CompletableFuture$UniHandle.tryFire([email protected]/CompletableFuture.java:911)
at java.util.concurrent.CompletableFuture$Completion.run([email protected]/CompletableFuture.java:482)
at java.util.concurrent.ThreadPoolExecutor.runWorker([email protected]/ThreadPoolExecutor.java:1136)
at java.util.concurrent.ThreadPoolExecutor$Worker.run([email protected]/ThreadPoolExecutor.java:635)
at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
at java.lang.Thread.run([email protected]/Thread.java:840)
Locked ownable synchronizers:
- <0x00007f2d56d06508> (a java.util.concurrent.ThreadPoolExecutor$Worker)
- <0x00007f2e5b4fb468> (a java.util.concurrent.locks.ReentrantLock$FairSync)
The deadlock is caused by chained unload/load on locked blocks.
I'm working on a patch to solve this, essentially we don't want to load/unload other nodes while keeping current node lock and during merge/checkpoint we don't need to fully load merging blocks in page replacement policy memory, the will be discarded at merge end