cassandra
cassandra copied to clipboard
DSP-24330 QueryProcessor evictPrepared deadlock
Concurrent computation of a value and invalidation of the same key can result in 'undefined' behavior in caffeine. We have observed occasional deadlocks.
Dtests seem broken on jenkins atm. CI failures seem to align to the last successful run used as baseline.
The extra Map operations should be a drop in the ocean given there are statements being removed/added to C* tables. I am running some local perf tests regardless.
A side note: won't we exhibit the same kind of problem with
I left this one alone as:
A. invalidateAll() lies in an unused code path atm
B. We'd have to iterate each element and invalidate it individually where that is 'undefined' as well
C. Or lock the whole caffeine cache, maybe with a poison pill or similar, then call invalidateAll() and release the lock. That effectively blocks all threads for the durartion of the operation.
I didn't do it as it's unused but I can add it if we think that is better. We have the same problem in clearPreparedStatementsCache() and a similar one in preloadPreparedStatements() with a computed get().
@pkolaczk a quick review for a second opinion here would be appreciated if you have a gap.
^I'll ping you when were both online see how you connected the dots as I didn't see that!
So we talked about this and cleared the smoke a bit:
- Blocks I noticed such as 0x000000058b679aa8 are blocked waiting on nothing. We are going to assume that is some jvm scheduler weird transient state while the thread is transitioning to some other state and assume caffeine is good and not blocking.
- The jamm computation such as lock 0x00000005801932d8 and alike hold back a number of threads waiting on the computation to finish. There are multiple of those probably blocking on the HashMap striped locking. The idea is to move the computation outside caffeine. Maybe precompute the size and make the caffeine Key a
Pair<MD5, size>to reduce collisions from stripe to hash/size pairs only. - Talk with Ekaterina, as she's faimilar with jamm and j11, to see if she can spot sthg obvious here regarding jamm.
The last commit rollsback the deadlock fix I had introduced and adds precomputed prepared statements sizes. If we like it we can send it see if it passes the deadlock repro.
CI for run #3 is aligned +/- to run #2 so lgtm.
we shifted it from doing N units of work every Nth insertion to doing a unit of work (computation) on every insertion
Previously we were doing 2 times the same computation: In storePreparedStatement() to validate input and then again in caffeine upon insertion.
Now we're only doing it once. We've shed all the caffeine extra computations which should be a net benefit + reduce contention on caffeine.
If so, it could explain why long-running jam computation results in blocking other threads on the same key.
Yes there were 90-ish threads waiting for Jamm to complete.
Finally, based on the current hypothesis the issue doesn't seem to be a deadlock, more likely a very slow computation due to high contention, am I right?
There are 2 issues: 70-ish threads weirdly blocked and other different 90-ish threads waiting on Jamm to complete. The hypothesis being that a slow Jamm caused so many thread contention the jvm's scheduler just left those first 70-ish BLOCKED. The hope is that reducing the contention and the extra computations should streamline everything.
Previously we were doing 2 times the same computation: In storePreparedStatement() to validate input and then again in caffeine upon insertion.
Right, I have forgotten about the validation call, it's indeed less work then which is good.
Rebased and squashed. @pkolaczk would you be so kind to complete your review please? I intend to merge this today/tomorrow. Thx.
Quality Gate passed
Issues
2 New issues
0 Accepted issues
Measures
0 Security Hotspots
86.0% Coverage on New Code
0.0% Duplication on New Code