Austen
Austen
> @kvoli Can the distsender change that is causing this be reverted, or is there some setting we can temporarily disable to avoid these failures? This is failing logic tests...
I've tried reproducing this on a couple different tests: No success on `release-23.1` with `TestPauseMigration` #119696: ``` dev test pkg/upgrade/upgrademanager -f TestPauseMigration -v --stress --cpus=30 ... INFO: Elapsed time: 3191.840s,...
In the repros I've gotten, the upgrade fails because of a heartbeat timeout. This doesn't appear related to recent changes. ``` W240417 18:06:16.459469 2646 upgrade/upgrademanager/manager.go:308 ⋮ [T1,Vsystem,n1,peer=‹127.0.0.1:55794›,client=127.0.0.1:55794,hostnossl,user=root,migration-mgr] 384 error encountered...
> @kvoli just a heads up that we keep on seeing this problem elsewhere (especially in the mixed-version logic tests - see latest linked issues). I'm out till Monday and...
> One interesting data point is that we tend to see a cluster of these failures in separate nightlies at about the same. Sometimes they occur at the same time...
@andrewbaptist I implemented a (less than desirable) retry mechanism when there are unavailable nodes during the upgrade `UntilClusterStable` function, which these tests are failing on. WDYT about this approach? https://github.com/cockroachdb/cockroach/pull/124288
> So it looks like the reactive transfers from https://github.com/cockroachdb/cockroach/commit/ba13d19481da623d96708a86a9459b5e64e65494 aren't working for these 3 leases. This is similar to what we saw in https://github.com/cockroachdb/cockroach/issues/123866, though I closed that with...
This is likely a race between the expiration to epoch lease upgrade here: https://github.com/cockroachdb/cockroach/blob/a3cde79a723a0e908cbf08737f4e5a648feef58f/pkg/kv/kvserver/replica_proposal.go#L495-L495 and `replicaCanBeProcessed`, called by the queue before processing a replica: https://github.com/cockroachdb/cockroach/blob/a3cde79a723a0e908cbf08737f4e5a648feef58f/pkg/kv/kvserver/queue.go#L1074-L1074 This lines up with the...
Removing `A-testing`, this is a bug.
This passes 10/10 tests for `lease-preferences/manual-violating-transfer`. Previously, it would fail 1/3 of the time.