bookkeeper icon indicating copy to clipboard operation
bookkeeper copied to clipboard

Deadlock occurred when entryLogPerLedgerEnabled is true

Open ytong01 opened this issue 2 years ago • 4 comments

BUG REPORT

Recently I'm doing benchmark on bookie with entryLogPerLedgerEnabled = true, I found the process happened deadlock and hence cann't service normally, here is the threaddump information while error occured.

image image

Bookkeer Versoin: 4.15.0

Here is our customize configuration ledgerStorageClass=org.apache.bookkeeper.bookie.SortedLedgerStorage entryLogPerLedgerEnabled=true maximumNumberOfActiveEntryLogs=10000

It seems caused by acquire lock while EntryMemTableParallelFlusher is running , and the deadlock may happen like this

  1. Memtable reach size limit
  2. ledger A and ledger B start flush asynchronous, and they both acquire lock successfully
  3. when thread enters getCurrentLogWithDirInfoForLedger method, it will calls ledgerIdEntryLogMap#get(ledgerId), if ledgerIdEntryLogMap is out of capacity and trigger cleanup ledger A、B, thus onCacheEntryRemoval is invoked, this method attempt to acquire lock , Note if thread A(thread which run ledger A) exactly execute cleanup ledger B ,and thread B execute cleanup ledger A, deadlock is happened.
image image

ytong01 avatar Feb 14 '23 10:02 ytong01

I don't know if many people use entry log per ledger so your best bet is to fix this and contribute the fix. @Ghatage @jvrao @reddycharan might be aware of issues in ELPL or may have fixes in their private repo.

dlg99 avatar Feb 15 '23 19:02 dlg99

@dlg99 Yes, issue is with the fact that onRemoval listener is running in addEntry thread context and it is getting into deadlock while trying to attain lock for another ledger during cleanup process.

Fix we introduced (very recently) is to offload onRemoval task to a dedicated threadpool.

Can get the fix to the community.

reddycharan avatar Feb 15 '23 19:02 reddycharan

@reddycharan thx, we indeed fix the problem by submit the onRemovel task to a dedicated threadpool, I'll commit a pr to community soon.

ytong01 avatar Feb 16 '23 03:02 ytong01

@ytong01 Are you still willing to work on this?

hezhangjian avatar Jun 09 '24 03:06 hezhangjian