firebird icon indicating copy to clipboard operation
firebird copied to clipboard

Сrash when deleting a statement from the cache

Open dmitry-starodubov opened this issue 1 year ago • 8 comments

We have several crashes of 5.0 with similar symptoms (and stacks). The stacks of the two threads are attached. In the first one, DsqlStatementCache::shrink() is called during statement release. It removes the statement from the inactiveStatementList. In the second thread transaction is committed (from the same attachment). This calls DsqlStatementCache::purgeAllAttachments(), which via AST calls DsqlStatementCache::purge(), which clears the inactiveStatementList too. As a result, both threads have DsqlStatementCache::StatementEntry::~StatementEntry() in their stacks at the same time. There is a call of MET_index_pagespace (part of the TableSpaces feature), which is not present in Firebird. I suspect it works here as a catalyst, because it increases the execution time of DsqlStatementCache::purge and increases the chances of a race in the list.

dmitry-starodubov avatar Sep 06 '24 11:09 dmitry-starodubov

@asfernandes , any chance to look at this issue sooner rather than later?

dyemanov avatar Oct 14 '24 06:10 dyemanov

A test case would make easy to identify the problem and would be fundamental to validate a fix.

asfernandes avatar Oct 14 '24 10:10 asfernandes

How can both threads of the same attachment be active in the engine at the same time? Shouldn't one be blocked in the attachment lock of EngineContextHolder?

asfernandes avatar Oct 14 '24 10:10 asfernandes

The 2nd thread handle AST and uses other attachment's context, then it calls MET_index_pagespace() that call JRD_reschedule() that checks out from engine, releasing attachment's lock and allowing 1st thread to access attachment which internal structure (inactiveStatementList) is in "interesting", not consistent state.

The most suspicious as for me is the call of MET_index_pagespace() (that run some query) from Statement::release().

hvlad avatar Oct 14 '24 11:10 hvlad

The most suspicious as for me is the call of MET_index_pagespace() (that run some query) from Statement::release().

And AFAIK, it's not valid to do it from AST handler.

asfernandes avatar Oct 15 '24 00:10 asfernandes

Even without query execution, AST may call LCK_release/LCK_convert that also checks out from the engine and thus may potentially cause races. We simply cannot guarantee that AST cannot interfere with the main attachment, it's just about higher or lower probability. If the current code is fragile about that, I'd say this deserves fixing anyway.

dyemanov avatar Oct 21 '24 08:10 dyemanov

Lock manager checks out from the engine in wait_for_request, blocking_action and shutdownOwner. The only concern here is wait_for_request - as blocking_action is called by wait_for_request and shutdownOwner is called when attachment is released.

LCK_release doesn't call wait_for_request thus its never checks out from the engine.

LCK_convert usually should not wait when converts to the lower lock level, also it not wait when called with LCK_NO_WAIT. It is easy to control, IMO. Am I wrong ?

And I still sure that AST routine should avoid calling expensive routines, such as query execution. Didn't check if Firebird already have such code, though ;)

hvlad avatar Oct 21 '24 08:10 hvlad

Seems you're right, I was more pessimistic about the LM calls than necessary ;-)

And I still sure that AST routine should avoid calling expensive routines, such as query execution.

I'm not arguing with that, just thought it was not the only possible reason for a checkout.

dyemanov avatar Oct 21 '24 09:10 dyemanov

Solved on our side, sorry for bothering.

dyemanov avatar Nov 29 '24 06:11 dyemanov