cockroach
cockroach copied to clipboard
concurrency: mark some unreplicated locks as ineligible for export
Once a lock has been reported as missing via QueryIntent, it is important that it stays missing. As described in #145458, if a lock that has been previously reported as missing is somehow then found via a subsequent QueryIntent request, this can result in a transaction being erroneously committed by the transaction recovery process.
When unreplicated lock reliability is enabled, Lease and Merge use ExportUnreplicatedLocks to move unreplicated locks from the lock table into durable storage. However, these locks may include unreplicated locks that "cover" a replicated lock that was previously reported as missing, exactly what we must avoid.
Here, we solve that by extending QueryIntent's evaluation to mark any locks it reports as missing as ineligible for export.
Fixes #145458
Release note: None
@miraradeva @arulajmani I think we are leaning away from this towards a more fundamental fix. But, I figured I'd put this up in a slightly cleaner form just so we could consider it since, I think something like this might still be required along some of the implementation paths we've discussed.
Fixed up the race condition in the test here.
But I wonder if it's possible for the marking-as-ineligible mechanism to race with the lock-export mechanism. E.g. QueryIntent finds a missing intent and is just about to mark the lock in the lock table, while a lease transfer kicks off a lock export. If it's not possible, we should mention it somewhere.
Good question. What I think makes this safe is that both a lease transfer and subsume acquire non-MVCC write latches over the entire range's span, so when they run they really are running alone.
I wonder if we could even assert that this is the case somewhere?
But I wonder if it's possible for the marking-as-ineligible mechanism to race with the lock-export mechanism. E.g. QueryIntent finds a missing intent and is just about to mark the lock in the lock table, while a lease transfer kicks off a lock export. If it's not possible, we should mention it somewhere.
Good question. What I think makes this safe is that both a lease transfer and subsume acquire non-MVCC write latches over the entire range's span, so when they run they really are running alone.
I wonder if we could even assert that this is the case somewhere?
Yeah, I see. And QueryIntent, which kicks off the marking-as-ineligible mechanism, declares non-MVCC keys on the intents it's querying. It seems safe.
@arulajmani Not sure if you wanted a second look at this. But I'll probably merge this after a rebase tomorrow. If you do spot any follow-ups post merge, let me know.
bors r=miraradeva
The extended CI failure is: https://github.com/cockroachdb/cockroach/issues/148582 The roachtest failure is: https://github.com/cockroachdb/cockroach/issues/148586
bors retry
The universe doesn't want this fix. And in some sense, the universe is correct.
Internal build system failure:
ERROR: The Build Event Protocol upload failed: All 8 retry attempts failed. INTERNAL: INTERNAL: Invocation [tenant_name=default invocation_id=751d3de7-b36f-4ac8-a48e-b5bd813ea02d]: failed to write batch 1
bors retry