cockroach icon indicating copy to clipboard operation
cockroach copied to clipboard

kvserver: proxy requests that return a NotLeaseholderError

Open andrewbaptist opened this issue 1 year ago • 2 comments

This handles Partial partitions from the DistSender layer.

Here is an example of how this test is set up

n3 /
n1---n2

n1 is the client and n2 is the leaseholder.

The connection between n1 and n2 is broken n3 /
n1 n2

For the following request this sequence happens:

n1 -> n2 -> fails as expected - (network cut) n1 -> n1 -> fails with not leaseholder don't proxy since self-client n1 -> n3 -> fails as expected, but proxy to its local dist_sender n3 -> n1 -> fail as expected. leaseholder not known and can't proxy n3 -> n2 -> success

We could optimize out the first leaseholder lookup on n3 by pre-populating its distsender cache before the proxy request, but that requires a bit of a layering violation and isn't necessary for correctness.

Epic: none

Release note: None

andrewbaptist avatar Jan 04 '24 21:01 andrewbaptist

It looks like your PR touches production code but doesn't add or edit any test code. Did you consider adding tests to your PR?

:owl: Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.

blathers-crl[bot] avatar Jan 04 '24 21:01 blathers-crl[bot]

This change is Reviewable

cockroach-teamcity avatar Jan 04 '24 21:01 cockroach-teamcity

This is mostly working now. There are a few issues still open related to it:

  1. Observability. We need metrics that show when proxying is happening. (In progress)
  2. Changefeeds: These will currently block during a partition and are not easily observable.
  3. DistSQL: Its unclear if this is a problem. There isn't great testing related to this right now so at a minimum we should add that.

andrewbaptist avatar Mar 07 '24 20:03 andrewbaptist

bors r=erikgrinaker

TFTR!

andrewbaptist avatar Mar 23 '24 12:03 andrewbaptist