cockroach kvcoord: secondary tenants do not take network latency into account when routing batch requests

Describe the problem

The DistSender sends batch requests that interact with a range to its replicas. In general, it does so by ordering replicas based on the latency between the requesting node and the node on which the replica lives. If the request must be sent to the leaseholder (and the leaseholder is known) the leaseholder is moved to the front of the queue. See:

https://github.com/cockroachdb/cockroach/blob/4d1f40b992b5f2406bf37b22f59a6f4bc5491d12/pkg/kv/kvclient/kvcoord/dist_sender.go#L1951-L1975

For follower read requests, this has the effect that they are always routed to the nearest replica. Unfortunately, this doesn't work quite as intended for secondary tenants. Instead, because a secondary tenant's DistSender is unaware of which node it is running on, it ends up randomly ordering the replica slice because of:

https://github.com/cockroachdb/cockroach/blob/4d1f40b992b5f2406bf37b22f59a6f4bc5491d12/pkg/kv/kvclient/kvcoord/replica_slice.go#L205-L208

To see why a secondary tenant's dist sender is unaware of which node it is running on, it's because this cast fails: https://github.com/cockroachdb/cockroach/blob/cc155a1efe057be7d9432d160744186059767140/pkg/kv/kvclient/kvcoord/dist_sender.go#L564-L568 @nvanbenschoten was this the hazard you had in mind in the TODO above, or is there more here?

To Reproduce

I discovered this when trying to convert our follower_reads roachtest to run as secondary tenants. TODO(arul): link a WIP PR.

Expected behavior

Secondary tenants should take network latency into account when routing requests. Consequently, follower reads issued by secondary tenants should be served from the nearest replica (instead of a random one).

Additional context

We'd want to solve this to harness many of the benefits listed in https://github.com/cockroachdb/cockroach/issues/72593 for secondary tenants.

cc @cockroachdb/kv

Jira issue: CRDB-15402

Epic: CRDB-14202

May 04 '22 19:05 arulajmani

A request when investigating this: be careful to highlight the problem exists with "secondary tenant running as separate SQL-only servers".

For dedicated/SH we will have secondary tenants running in-process with KV, with a 1:1 relationship with the KV node. In that case, we will be able to use the base algorithm.

May 05 '22 12:05 knz

For dedicated/SH we will have secondary tenants running in-process with KV, with a 1:1 relationship with the KV node. In that case, we will be able to use the base algorithm.

By this do you mean that these secondary tenants will be aware of a local NodeDescriptor?

Jun 22 '22 19:06 nvanbenschoten

By this do you mean that these secondary tenants will be aware of a local NodeDescriptor?

We don't need the entire node descriptor though? just the ID and locality attributes? We're already planning to include those in the sql_livness table.

Jun 22 '22 19:06 knz

We'll definitely need this fixed before we can make MR Serverless a reality. It's a "ship-stopper" issue.

Jun 23 '22 20:06 andy-kimball

This looks straightforward. The fix will involve a bit of plumbing and a generalization of ReplicaSlice.OptimizeReplicaOrder. My suggestion is to change the signature of that method from:

func (rs ReplicaSlice) OptimizeReplicaOrder(nodeDesc *roachpb.NodeDescriptor, latencyFn LatencyFunc)

to

// nodeID can be 0, in which case it is ignored
// latencyFn can be nil, in which case it will not be used
func (rs ReplicaSlice) OptimizeReplicaOrder(nodeID roachpb.NodeID, latencyFn LatencyFunc, locality roachpb.Locality)

Once we have that,

use Gossip if available to grab the client's NodeID (DistSender.getNodeID)
get rid of DistSender.getNodeDescriptor
finally, plumb in the local process's Locality into the DistSender (cfg.Locality in server.NewServer and server.makeTenantSQLServerArgs)

Jul 27 '22 14:07 nvanbenschoten

cockroach cockroach copied to clipboard

kvcoord: secondary tenants do not take network latency into account when routing batch requests

cockroach
cockroach copied to clipboard