cockroach icon indicating copy to clipboard operation
cockroach copied to clipboard

kv: Handle invalid lease transfer checks

Open andrewbaptist opened this issue 2 years ago • 11 comments

Previously it would be a programming error when checking if the leaseholder move was valid if the leaseholder wasn't one of the replicas. This check was overly strict.

Release note: None

andrewbaptist avatar Sep 08 '22 22:09 andrewbaptist

This change is Reviewable

cockroach-teamcity avatar Sep 08 '22 22:09 cockroach-teamcity

The change here seems good, but I'd still like to understand why we're suddenly seeing this log message. Someone mentioned that 3fe8ffc revealed the log message, but this was a log.Errorf even before that commit.

Do we know where this function is being called from? Is it replicateQueue.process -> replicateQueue.shedLease? If so, could this have been caused by 79886bb? We now call replicateQueue.shedLease even when processOneChange returns false /* requeue */. Does that mean that we call shedLease even after we've removed the outgoing leaseholder from the range?

nvanbenschoten avatar Sep 09 '22 17:09 nvanbenschoten

pkg/kv/kvserver/allocator/allocatorimpl/allocator.go line 1644 at r1 (raw file):

	// lease since we don't know who the leaseholder is. This normally doesn't
	// happen, but can occasionally since the loading of the leaseholder and of
	// the existing replicas aren't always under a consistent lock.

Do you mind spelling out an example of how/where this is happening? (not necessarily as an addition to this comment, just for my understanding)

Code quote:

existing replicas aren't always under a consistent lock.

aayushshah15 avatar Sep 09 '22 17:09 aayushshah15

@nvanbenschoten You are correct that we call shedLease more than we used to after https://github.com/cockroachdb/cockroach/commit/79886bb56a388b41c3dca327ba0d470d3b79b850.

There are a check that occur before we call it. If the lease respect leaseholder preference checkLeaseRespectsPreferences? This is always going to return false, err if we are not the leaseholder. However, the calling method canTransferLeaseFrom returns true anyway unless the lease has recently moved. However, if we are calling this method it means we previously were both a replica and the leaseholder. Then if we are no longer a replica we can't be the leaseholder either which should (?) mean the lease recently moved.

This is a somewhat convoluted way of verifying this though, and I might be wrong about some of this. It would be better to change canTransferLeaseFrom to return false if checkLeaseRespectsPreferences returns an error.

@aayushshah15 what I meant by the lack of locking is that inside shedLease we have a replica and a pointer to a rangeDescriptor. Looking at this comment. My understanding is that by having both these structures, and then checking inside leaseholderShouldMoveDueToPreferences against the current repl vs the "cached" rangeDescriptor/ReplicaDescriptor. If "something else" moves leases or replicas under us, this check can fail. As an example, if the storeRebalancer decided to move our lease in between the calls here and here. I'm not 100% sure this is what happens, but it seems to only occur when both the allocator and store rebalancer are running at the same time.

andrewbaptist avatar Sep 09 '22 18:09 andrewbaptist

Do we know where this function is being called from? Is it replicateQueue.process -> replicateQueue.shedLease? If so, could this have been caused by 79886bb? We now call replicateQueue.shedLease even when processOneChange returns false /* requeue */. Does that mean that we call shedLease even after we've removed the outgoing leaseholder from the range?

This is also showing up, originating from the store rebalancer in the transfer leases phase. Albeit I have only seen it when a store is pushed to the max. In this case it occurs following a failed lease transfer due to not actually having the lease to transfer to begin with.

I think @andrewbaptist description of another actor moving the lease, having unfefined ordering with the call makes sense but seems odd to be so frequent all of a sudden.

kvoli avatar Sep 09 '22 20:09 kvoli

@andrewbaptist could you explain why we lifted the call to shedLease from considerRebalance to process? Something feels off to me about not taking the requeue = false status into effect and continuing with the lease transfer attempt even when we know that the replica has just been removed from the range.

We could add a replica-in-descriptor check to canTransferLeaseFrom or shedLease, but even that feels wrong. The local replica may have removed itself from the range without having yet applied to removal to its local state.

Instead, I don't think we should be attempting to do any more work if requeue = false.

nvanbenschoten avatar Sep 12 '22 15:09 nvanbenschoten

@nvanbenschoten The call needed to be lifted because we don't always call considerRebalance anymore when processing a replica. We want to do is run the "important" action, and once that is done, queue up a considerReblance at its default, low priority. Without lifting that call leases can become stranded for long periods in the wrong place since there is a lot of work to do after a node failure, node addition, node decommission, ... The leaseholder move check should be performed "between" process runs since we don't want to wait that long. There was a specific test related to a specific field defect that verified this behavior which is why this was needed (I unfortunately can't easily find it)

However, I realize now that I didn't understand well enough what requeue meant. I had wrongly assumed that there could be operations that returned requeue=false but where the store is still the leaseholder. Looking more carefully through the cases, I don't think that is true anywhere. So it is probably better to change the check to verify against requeue. It might make sense to also change the name of this field to lhBeingRemoved but I don't know if I should do that as part of this change. Thoughts?

andrewbaptist avatar Sep 12 '22 16:09 andrewbaptist

So it is probably better to change the check to verify against requeue. It might make sense to also change the name of this field to lhBeingRemoved but I don't know if I should do that as part of this change. Thoughts?

It looks like there are cases where requeue=false but the replica has not been removed from the range, so I'm not sure about the rename: https://github.com/cockroachdb/cockroach/blob/bf678225946b9fe6682d68a95a67ee70e5103da6/pkg/kv/kvserver/replicate_queue.go#L871

But changing the check to verify against requeue and not calling canTransferLeaseFrom if it is false SGTM.

nvanbenschoten avatar Sep 12 '22 18:09 nvanbenschoten

Should we mark this as resolving https://github.com/cockroachdb/cockroach/issues/87146?

nvanbenschoten avatar Sep 19 '22 15:09 nvanbenschoten

bors r=nvanbenschoten

andrewbaptist avatar Sep 22 '22 20:09 andrewbaptist

Canceled.

craig[bot] avatar Sep 22 '22 20:09 craig[bot]

bors r+

andrewbaptist avatar Sep 23 '22 14:09 andrewbaptist

Build succeeded:

craig[bot] avatar Sep 23 '22 16:09 craig[bot]

@andrewbaptist do we want to backport this to release-22.2?

nvanbenschoten avatar Sep 26 '22 20:09 nvanbenschoten

Yes it would be good to backport it - is it just a matter of adding the backport label to this? I'm not exactly sure of the process.

andrewbaptist avatar Sep 26 '22 20:09 andrewbaptist

blathers backport 22.2

nvanbenschoten avatar Sep 27 '22 04:09 nvanbenschoten

Encountered an error creating backports. Some common things that can go wrong:

  1. The backport branch might have already existed.
  2. There was a merge conflict.
  3. The backport branch contained merge commits.

You might need to create your backport manually using the backport tool.


error creating merge commit from bfbf3c7fe4c13b0808afd0738f0ea494c1060d60 to blathers/backport-release-22.2-87652: POST https://api.github.com/repos/cockroachdb/cockroach/merges: 409 Merge conflict []

you may need to manually resolve merge conflicts with the backport tool.

Backport to branch 22.2 failed. See errors above.


:owl: Hoot! I am a Blathers, a bot for CockroachDB. My owner is otan.

blathers-crl[bot] avatar Sep 27 '22 04:09 blathers-crl[bot]

It looks like there was a merge conflict with the auto-backport, so we'll want to use the backport tool. Here's additional documentation the described how to deal with conflicts: https://cockroachlabs.atlassian.net/wiki/spaces/CRDB/pages/900005932/Backporting+a+change+to+a+release+branch.

nvanbenschoten avatar Sep 27 '22 04:09 nvanbenschoten