cockroach
cockroach copied to clipboard
kv: Handle invalid lease transfer checks
Previously it would be a programming error when checking if the leaseholder move was valid if the leaseholder wasn't one of the replicas. This check was overly strict.
Release note: None
The change here seems good, but I'd still like to understand why we're suddenly seeing this log message. Someone mentioned that 3fe8ffc revealed the log message, but this was a log.Errorf
even before that commit.
Do we know where this function is being called from? Is it replicateQueue.process
-> replicateQueue.shedLease
? If so, could this have been caused by 79886bb? We now call replicateQueue.shedLease
even when processOneChange
returns false /* requeue */
. Does that mean that we call shedLease
even after we've removed the outgoing leaseholder from the range?
pkg/kv/kvserver/allocator/allocatorimpl/allocator.go
line 1644 at r1 (raw file):
// lease since we don't know who the leaseholder is. This normally doesn't // happen, but can occasionally since the loading of the leaseholder and of // the existing replicas aren't always under a consistent lock.
Do you mind spelling out an example of how/where this is happening? (not necessarily as an addition to this comment, just for my understanding)
Code quote:
existing replicas aren't always under a consistent lock.
@nvanbenschoten You are correct that we call shedLease
more than we used to after https://github.com/cockroachdb/cockroach/commit/79886bb56a388b41c3dca327ba0d470d3b79b850.
There are a check that occur before we call it. If the lease respect leaseholder preference checkLeaseRespectsPreferences
? This is always going to return false, err
if we are not the leaseholder. However, the calling method canTransferLeaseFrom
returns true anyway unless the lease has recently moved. However, if we are calling this method it means we previously were both a replica and the leaseholder. Then if we are no longer a replica we can't be the leaseholder either which should (?) mean the lease recently moved.
This is a somewhat convoluted way of verifying this though, and I might be wrong about some of this. It would be better to change canTransferLeaseFrom
to return false
if checkLeaseRespectsPreferences
returns an error.
@aayushshah15 what I meant by the lack of locking is that inside shedLease
we have a replica and a pointer to a rangeDescriptor. Looking at this comment. My understanding is that by having both these structures, and then checking inside leaseholderShouldMoveDueToPreferences
against the current repl vs the "cached" rangeDescriptor/ReplicaDescriptor. If "something else" moves leases or replicas under us, this check can fail. As an example, if the storeRebalancer decided to move our lease in between the calls here and here. I'm not 100% sure this is what happens, but it seems to only occur when both the allocator and store rebalancer are running at the same time.
Do we know where this function is being called from? Is it
replicateQueue.process
->replicateQueue.shedLease
? If so, could this have been caused by 79886bb? We now callreplicateQueue.shedLease
even whenprocessOneChange
returnsfalse /* requeue */
. Does that mean that we callshedLease
even after we've removed the outgoing leaseholder from the range?
This is also showing up, originating from the store rebalancer in the transfer leases phase. Albeit I have only seen it when a store is pushed to the max. In this case it occurs following a failed lease transfer due to not actually having the lease to transfer to begin with.
I think @andrewbaptist description of another actor moving the lease, having unfefined ordering with the call makes sense but seems odd to be so frequent all of a sudden.
@andrewbaptist could you explain why we lifted the call to shedLease
from considerRebalance
to process
? Something feels off to me about not taking the requeue = false
status into effect and continuing with the lease transfer attempt even when we know that the replica has just been removed from the range.
We could add a replica-in-descriptor check to canTransferLeaseFrom
or shedLease
, but even that feels wrong. The local replica may have removed itself from the range without having yet applied to removal to its local state.
Instead, I don't think we should be attempting to do any more work if requeue = false
.
@nvanbenschoten The call needed to be lifted because we don't always call considerRebalance
anymore when processing a replica. We want to do is run the "important" action, and once that is done, queue up a considerReblance
at its default, low priority. Without lifting that call leases can become stranded for long periods in the wrong place since there is a lot of work to do after a node failure, node addition, node decommission, ... The leaseholder move check should be performed "between" process
runs since we don't want to wait that long. There was a specific test related to a specific field defect that verified this behavior which is why this was needed (I unfortunately can't easily find it)
However, I realize now that I didn't understand well enough what requeue
meant. I had wrongly assumed that there could be operations that returned requeue=false
but where the store is still the leaseholder. Looking more carefully through the cases, I don't think that is true anywhere. So it is probably better to change the check to verify against requeue
. It might make sense to also change the name of this field to lhBeingRemoved
but I don't know if I should do that as part of this change. Thoughts?
So it is probably better to change the check to verify against requeue. It might make sense to also change the name of this field to lhBeingRemoved but I don't know if I should do that as part of this change. Thoughts?
It looks like there are cases where requeue=false
but the replica has not been removed from the range, so I'm not sure about the rename:
https://github.com/cockroachdb/cockroach/blob/bf678225946b9fe6682d68a95a67ee70e5103da6/pkg/kv/kvserver/replicate_queue.go#L871
But changing the check to verify against requeue
and not calling canTransferLeaseFrom
if it is false SGTM.
Should we mark this as resolving https://github.com/cockroachdb/cockroach/issues/87146?
bors r=nvanbenschoten
Canceled.
bors r+
@andrewbaptist do we want to backport this to release-22.2?
Yes it would be good to backport it - is it just a matter of adding the backport label to this? I'm not exactly sure of the process.
blathers backport 22.2
Encountered an error creating backports. Some common things that can go wrong:
- The backport branch might have already existed.
- There was a merge conflict.
- The backport branch contained merge commits.
You might need to create your backport manually using the backport tool.
error creating merge commit from bfbf3c7fe4c13b0808afd0738f0ea494c1060d60 to blathers/backport-release-22.2-87652: POST https://api.github.com/repos/cockroachdb/cockroach/merges: 409 Merge conflict []
you may need to manually resolve merge conflicts with the backport tool.
Backport to branch 22.2 failed. See errors above.
:owl: Hoot! I am a Blathers, a bot for CockroachDB. My owner is otan.
It looks like there was a merge conflict with the auto-backport, so we'll want to use the backport tool. Here's additional documentation the described how to deal with conflicts: https://cockroachlabs.atlassian.net/wiki/spaces/CRDB/pages/900005932/Backporting+a+change+to+a+release+branch.