cockroach
cockroach copied to clipboard
kvcoord: secondary tenants do not take network latency into account when routing batch requests
Describe the problem
The DistSender
sends batch requests that interact with a range to its replicas. In general, it does so by ordering replicas based on the latency between the requesting node and the node on which the replica lives. If the request must be sent to the leaseholder (and the leaseholder is known) the leaseholder is moved to the front of the queue. See:
https://github.com/cockroachdb/cockroach/blob/4d1f40b992b5f2406bf37b22f59a6f4bc5491d12/pkg/kv/kvclient/kvcoord/dist_sender.go#L1951-L1975
For follower read requests, this has the effect that they are always routed to the nearest replica. Unfortunately, this doesn't work quite as intended for secondary tenants. Instead, because a secondary tenant's DistSender
is unaware of which node it is running on, it ends up randomly ordering the replica slice because of:
https://github.com/cockroachdb/cockroach/blob/4d1f40b992b5f2406bf37b22f59a6f4bc5491d12/pkg/kv/kvclient/kvcoord/replica_slice.go#L205-L208
To see why a secondary tenant's dist sender is unaware of which node it is running on, it's because this cast fails: https://github.com/cockroachdb/cockroach/blob/cc155a1efe057be7d9432d160744186059767140/pkg/kv/kvclient/kvcoord/dist_sender.go#L564-L568 @nvanbenschoten was this the hazard you had in mind in the TODO above, or is there more here?
To Reproduce
I discovered this when trying to convert our follower_reads
roachtest to run as secondary tenants. TODO(arul): link a WIP PR.
Expected behavior
Secondary tenants should take network latency into account when routing requests. Consequently, follower reads issued by secondary tenants should be served from the nearest replica (instead of a random one).
Additional context
We'd want to solve this to harness many of the benefits listed in https://github.com/cockroachdb/cockroach/issues/72593 for secondary tenants.
cc @cockroachdb/kv
Jira issue: CRDB-15402
Epic: CRDB-14202
A request when investigating this: be careful to highlight the problem exists with "secondary tenant running as separate SQL-only servers".
For dedicated/SH we will have secondary tenants running in-process with KV, with a 1:1 relationship with the KV node. In that case, we will be able to use the base algorithm.
For dedicated/SH we will have secondary tenants running in-process with KV, with a 1:1 relationship with the KV node. In that case, we will be able to use the base algorithm.
By this do you mean that these secondary tenants will be aware of a local NodeDescriptor?
By this do you mean that these secondary tenants will be aware of a local NodeDescriptor?
We don't need the entire node descriptor though? just the ID and locality attributes? We're already planning to include those in the sql_livness table.
We'll definitely need this fixed before we can make MR Serverless a reality. It's a "ship-stopper" issue.
This looks straightforward. The fix will involve a bit of plumbing and a generalization of ReplicaSlice.OptimizeReplicaOrder
. My suggestion is to change the signature of that method from:
func (rs ReplicaSlice) OptimizeReplicaOrder(nodeDesc *roachpb.NodeDescriptor, latencyFn LatencyFunc)
to
// nodeID can be 0, in which case it is ignored
// latencyFn can be nil, in which case it will not be used
func (rs ReplicaSlice) OptimizeReplicaOrder(nodeID roachpb.NodeID, latencyFn LatencyFunc, locality roachpb.Locality)
Once we have that,
- use Gossip if available to grab the client's
NodeID
(DistSender.getNodeID
) - get rid of
DistSender.getNodeDescriptor
- finally, plumb in the local process's
Locality
into theDistSender
(cfg.Locality
inserver.NewServer
andserver.makeTenantSQLServerArgs
)