cockroach ccl/multiregionccl: TestEnsureLocalReadsOnGlobalTables failed [ClosedTimestampPolicy is -1; not a valid enum value]

ccl/multiregionccl.TestEnsureLocalReadsOnGlobalTables failed on master @ 7512f81be1591cde8e3b5d34bb0b9b1101f22d24:

=== RUN   TestEnsureLocalReadsOnGlobalTables
    test_log_scope.go:170: test logs captured to: outputs.zip/logTestEnsureLocalReadsOnGlobalTables1878020789
    test_log_scope.go:81: use -show-logs to present logs inline
    roundtrips_test.go:119: condition failed to evaluate within 45s: from roundtrips_test.go:131: expected closedts policy LEAD_FOR_GLOBAL_READS, got -1
    panic.go:626: -- test log scope end --
test logs left over in: outputs.zip/logTestEnsureLocalReadsOnGlobalTables1878020789
--- FAIL: TestEnsureLocalReadsOnGlobalTables (57.62s)

Parameters:

attempt=1
run=7
shard=5

Help

Same failure on other branches

#129545 ccl/multiregionccl: TestEnsureLocalReadsOnGlobalTables failed [C-test-failure O-robot P-2 T-sql-foundations branch-release-23.1]

/cc @cockroachdb/sql-foundations _{This test on roachdash | Improve this report!}

Jira issue: CRDB-41707

Aug 28 '24 10:08 cockroach-teamcity

Checking if KV has any ideas on this. The assertion that failed is https://github.com/cockroachdb/cockroach/blob/3962ffca4feb71db0f66e8f71495d66ba986c1d8/pkg/ccl/multiregionccl/roundtrips_test.go#L130-L132

Somehow entry.ClosedTimestampPolicy is -1, which is not even a valid enum value. I'm not sure how that's possible. https://github.com/cockroachdb/cockroach/blob/41c2a01fc5485b7ced208b630f5a0bce42ea7acc/pkg/roachpb/data.proto#L830-L851

Aug 28 '24 21:08 rafiss

After ruling out a bit flip (1 and -1 don't look similar when varint encoded), looking through the writes to that field, there is

/pkg/kv/kvclient/rangecache/range_cache.go#L1324-L1330

func (e *cacheEntry) toRangeInfo() roachpb.RangeInfo {
	return roachpb.RangeInfo{
		Desc:                  e.desc,
		Lease:                 e.lease,
		ClosedTimestampPolicy: e.closedts,
	}
}

and looking at the write accesses to (cacheEntry).closedts, I found

/pkg/kv/kvclient/rangecache/range_cache.go#L910-L916

	newEntry := cacheEntry{
		desc: rs[0],
		// We don't have any lease information.
		lease: roachpb.Lease{},
		// We don't know the closed timestamp policy.
		closedts: UnknownClosedTimestampPolicy,
	}

/pkg/kv/kvclient/rangecache/range_cache.go#L97-L100

// UnknownClosedTimestampPolicy is used to mark on a cacheEntry that the closed
// timestamp policy is not known. This value is never serialized into
// RangeInfo or any other message which uses the type.
const UnknownClosedTimestampPolicy roachpb.RangeClosedTimestampPolicy = -1

So, for some reason, on at least one of the servers, running SELECT ... repeatedly doesn't populate the cache. I've looked a the logs, but nothing jumps out. It looks like it doesn't fail easily when stressed (waiting for 1000 iters locally right now).

I poked at this a little more. One thing that stuck out was that the "changing the spanconfig for span" never print a config with GlobalReads:true. This is true in passing runs as well, though, so I think it's a wart with the logging - it doesn't seem to talk about the spanconfigs for this global table at all, perhaps because they're never "updated" but first installed as global ones. Either way, it's not going to help understand this test failure.

So it comes back to this: for some reason we're not adding to the range cache as a result of running a SQL query. It seems difficult to understand that without a reproduction under an extended verbosity level or with bespoke logging.

It's interesting that this also failed (but non-identically) on 23.1 last week: https://github.com/cockroachdb/cockroach/issues/129545, but I assume that's just bad luck rather than an indication that we backported something all the way there recently.

Aug 29 '24 09:08 tbg

On second thought, it's even stranger. The test asserts that there is a cache hit:

/pkg/ccl/multiregionccl/roundtrips_test.go#L126-L127

			entry, err := cache.TestingGetCached(context.Background(), tablePrefix, false /* inverted */)
			require.NoError(t, err)

because TestingGetCached returns an error on a miss:

/pkg/kv/kvclient/rangecache/range_cache.go#L1038-L1041

	entry, _ := rc.getCachedRLocked(ctx, key, inverted)
	if entry == nil {
		return roachpb.RangeInfo{}, errors.Newf("no entry found for %s in cache", key)
	}

So there is a cached entry, but it never reflects a closedts policy. This is strange, on passing runs the SucceedsSoon spins for a while on the regular policy, then gets satisfied. Here, it sees the unknown policy throughout (which I have never seen logged in the passing runs). Perhaps there is a caching issue here, where an early entry can be entered into the cache and is not replaced by more valuable later entries offered.

tryLookup^1 is complicated enough for this to be a thing. We use the -1 policy when inserting a descriptor from meta2 (even if it was a prefetched one).

The way in which the cache entries are updated is complex. Most code paths deal chiefly with the descriptor and the lease. The closed timestamp policy seems more of an afterthought. I could imagine there being a sequence of events that results in the range descriptor and lease being up to date, while never updating the closed timestamp policy from the default. That said, I haven't found a smoking gun yet - it is difficult to even trace the regular lifecycle of the cache entry relevant in this test failure.

Aug 29 '24 09:08 tbg

What should happen is that the kvserver returns a new RangeInfo containing the closedts policy:

/pkg/kv/kvserver/replica_send.go#L319-L330

	// Compare the client's info with the replica's info to detect if the client
	// has stale knowledge. Note that the client can have more recent knowledge
	// than the replica in case this is a follower.
	cri := &ba.ClientRangeInfo
	ri := r.GetRangeInfo(ctx)
	needInfo := cri.ExplicitlyRequested ||
		(cri.DescriptorGeneration < ri.Desc.Generation) ||
		(cri.LeaseSequence < ri.Lease.Sequence) ||
		(cri.ClosedTimestampPolicy != ri.ClosedTimestampPolicy)
	if !needInfo {
		return
	}

Now here's something unexpected. The DistSender populates the RangeInfo being compared above here:

/pkg/kv/kvclient/kvcoord/dist_sender.go#L2671-L2688

		// Communicate to the server the information our cache has about the
		// range. If it's stale, the server will return an update.
		ba.ClientRangeInfo = roachpb.ClientRangeInfo{
			// Note that DescriptorGeneration will be 0 if the cached descriptor
			// is "speculative" (see DescSpeculative()). Even if the speculation
			// is correct, we want the serve to return an update, at which point
			// the cached entry will no longer be "speculative".
			DescriptorGeneration: desc.Generation,
			// The LeaseSequence will be 0 if the cache doesn't have lease info,
			// or has a speculative lease. Like above, this asks the server to
			// return an update.
			LeaseSequence: routing.LeaseSeq(),
			// The ClosedTimestampPolicy will be the default if the cache
			// doesn't have info. Like above, this asks the server to return an
			// update.
			ClosedTimestampPolicy: routing.ClosedTimestampPolicy(
				defaultSendClosedTimestampPolicy,
			),

Note that it doesn't attach the actual known policy. This makes sense - after all, the policy might be "-1" and doesn't even make sense on the wire - but for -1 we will fall back to the default.. and the default is (drum roll):

/pkg/kv/kvclient/kvcoord/dist_sender.go#L2483

const defaultSendClosedTimestampPolicy = roachpb.LEAD_FOR_GLOBAL_READS

So this actually checks out: we make the initial cache entry which has the -1. But then we (sort of accidentally, I guess?) tell kvserver that we're aware of the LEAD_FOR_GLOBAL_READS policy already. kvserver will not send a cache update back. So we never update the -1 away.

Note that this is not what the comment in the snippet above claims, but based on my reading of the code there is no way for kvserver to distinguish between a client that is aware of the global reads policy and one that has filled it in as a default.

"Pretending" to know about global reads is new in https://github.com/cockroachdb/cockroach/commit/b5681cd8689079fbcafb5819da6ae45c23dba6d3 but it looks like the same problem existed before, except it affected the much more common LAG_BY_CLUSTER_SETTINGS. The real problem is that we don't have a default value that corresponds to "unknown" (like a lease sequence of zero). Introducing one now would require a migration which is not terrible but also unfortunate.

What we can do is "taint" another part of the RangeInfo to be sent out, though. That will prompt an update back and will fix us up - for example, zeroing the lease sequence when we don't know a closed timestamp policy should solve the problem.

I'll send a PR to try out that idea.

Aug 29 '24 10:08 tbg

Removing release-blocker because this is not a new bug.

Aug 29 '24 14:08 tbg