herddb icon indicating copy to clipboard operation
herddb copied to clipboard

BRIN page unload deadlock

Open diegosalvi opened this issue 5 months ago • 1 comments

BUG REPORT

  1. Please describe the issue you observed: I found tablespaces stuck on checkpoint procedure. Deep looking on logs and stacks I found that HerdDB was stuck in a deadlock case between a BRIN index value deletion and another BRIN index checkpoint: Checkpointing thread
"hdb-pecora-activator" #38 daemon prio=5 os_prio=0 cpu=6524841.27ms elapsed=88200.79s tid=0x00007f4b66096840 nid=0x1a1065 waiting on condition  [0x00007f2be12fe000]
   java.lang.Thread.State: WAITING (parking)
        at jdk.internal.misc.Unsafe.park([email protected]/Native Method)
        - parking to wait for  <0x00007f2e5b4fb468> (a java.util.concurrent.locks.ReentrantLock$FairSync)
        at java.util.concurrent.locks.LockSupport.park([email protected]/LockSupport.java:211)
        at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire([email protected]/AbstractQueuedSynchronizer.java:715)
        at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire([email protected]/AbstractQueuedSynchronizer.java:938)
        at java.util.concurrent.locks.ReentrantLock$Sync.lock([email protected]/ReentrantLock.java:153)
        at java.util.concurrent.locks.ReentrantLock.lock([email protected]/ReentrantLock.java:322)
        at herddb.index.brin.BlockRangeIndex$Block.unload(BlockRangeIndex.java:700)
        at herddb.index.brin.BlockRangeIndex$Block.unload(BlockRangeIndex.java:730)
        at herddb.index.brin.BlockRangeIndex.merge(BlockRangeIndex.java:795)
        at herddb.index.brin.BlockRangeIndex.checkpoint(BlockRangeIndex.java:914)
        at herddb.index.brin.BRINIndexManager.checkpoint(BRINIndexManager.java:317)
        at herddb.core.TableManager.checkpoint(TableManager.java:3253)
        at herddb.core.TableManager.checkpoint(TableManager.java:2670)
        at herddb.core.TableSpaceManager.checkpoint(TableSpaceManager.java:2076)
        at herddb.core.DBManager.checkpoint(DBManager.java:964)
        at herddb.core.DBManager.executeActivator(DBManager.java:1276)
        at herddb.core.DBManager.access$500(DBManager.java:124)
        at herddb.core.DBManager$Activator.run(DBManager.java:1208)
        at java.lang.Thread.run([email protected]/Thread.java:840)

   Locked ownable synchronizers:
        - <0x00007f2d51f48b20> (a java.util.concurrent.locks.ReentrantLock$FairSync)
        - <0x00007f2d51f48b88> (a java.util.concurrent.locks.ReentrantLock$FairSync)
        - <0x00007f2d51f48bf0> (a java.util.concurrent.locks.ReentrantLock$FairSync)

Another index value deletion thread:

"db-dmlcall-localhost:7000-3" #265 prio=5 os_prio=0 cpu=768760.12ms elapsed=88704.87s tid=0x00007f28f0052030 nid=0x1a1226 waiting on condition  [0x00007f27881f8000]
   java.lang.Thread.State: WAITING (parking)
        at jdk.internal.misc.Unsafe.park([email protected]/Native Method)
        - parking to wait for  <0x00007f2d51f48b88> (a java.util.concurrent.locks.ReentrantLock$FairSync)
        at java.util.concurrent.locks.LockSupport.park([email protected]/LockSupport.java:211)
        at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire([email protected]/AbstractQueuedSynchronizer.java:715)
        at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire([email protected]/AbstractQueuedSynchronizer.java:938)
        at java.util.concurrent.locks.ReentrantLock$Sync.lock([email protected]/ReentrantLock.java:153)
        at java.util.concurrent.locks.ReentrantLock.lock([email protected]/ReentrantLock.java:322)
        at herddb.index.brin.BlockRangeIndex$Block.unload(BlockRangeIndex.java:700)
        at herddb.index.brin.BlockRangeIndex$Block.unload(BlockRangeIndex.java:730)
        at herddb.index.brin.BlockRangeIndex$Block.ensureBlockLoaded(BlockRangeIndex.java:499)
        at herddb.index.brin.BlockRangeIndex$Block.delete(BlockRangeIndex.java:380)
        at herddb.index.brin.BlockRangeIndex.delete(BlockRangeIndex.java:1055)
        at herddb.index.brin.BRINIndexManager.recordDeleted(BRINIndexManager.java:431)
        at herddb.core.TableManager.applyDelete(TableManager.java:2159)
        at herddb.core.TableManager.onTransactionCommit(TableManager.java:1948)
        at herddb.core.TableSpaceManager.apply(TableSpaceManager.java:428)
        at herddb.core.TableSpaceManager.lambda$commitTransaction$22(TableSpaceManager.java:2188)
        at herddb.core.TableSpaceManager$$Lambda$1815/0x00007f2befd5f7e8.apply(Unknown Source)
        at java.util.concurrent.CompletableFuture.uniHandle([email protected]/CompletableFuture.java:934)
        at java.util.concurrent.CompletableFuture$UniHandle.tryFire([email protected]/CompletableFuture.java:911)
        at java.util.concurrent.CompletableFuture$Completion.run([email protected]/CompletableFuture.java:482)
        at java.util.concurrent.ThreadPoolExecutor.runWorker([email protected]/ThreadPoolExecutor.java:1136)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run([email protected]/ThreadPoolExecutor.java:635)
        at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
        at java.lang.Thread.run([email protected]/Thread.java:840)

   Locked ownable synchronizers:
        - <0x00007f2d56d06508> (a java.util.concurrent.ThreadPoolExecutor$Worker)
        - <0x00007f2e5b4fb468> (a java.util.concurrent.locks.ReentrantLock$FairSync)

diegosalvi avatar Jan 31 '24 17:01 diegosalvi

The deadlock is caused by chained unload/load on locked blocks.

I'm working on a patch to solve this, essentially we don't want to load/unload other nodes while keeping current node lock and during merge/checkpoint we don't need to fully load merging blocks in page replacement policy memory, the will be discarded at merge end

diegosalvi avatar Jan 31 '24 17:01 diegosalvi