cockroach ccl/schemachangerccl: TestBackupMixedVersionElements failed [node unavailable in KV NodeLiveness even though it started up]

ccl/schemachangerccl.TestBackupMixedVersionElements_base_alter_table_alter_primary_key_drop_rowid failed with artifacts on release-23.1 @ 689c53764c44c609313c00f7317f1e747e13a152:

I240314 21:19:02.138565 1303481 3@pebble/event.go:701  [n2,s2,pebble] 41  [JOB 14] sstable deleted 000008
I240314 21:19:02.155747 1300521 13@kv/kvserver/replica_raft.go:381  [T1,n1,s1,r36/1:/Table/3{5-6}] 42  proposing SIMPLE(v2) [(n2,s2):2]: after=[(n1,s1):1 (n2,s2):2] next=3
I240314 21:19:02.156702 1300521 13@kv/kvserver/replicate_queue.go:1283  [T1,n1,replicate,s1,r50/1:/Table/5{0-1}] 43  adding voter n3,s3: [1*:12]
I240314 21:19:02.157280 1300521 13@kv/kvserver/replica_command.go:2376  [T1,n1,replicate,s1,r50/1:/Table/5{0-1}] 44  change replicas (add [(n3,s3):2LEARNER] remove []): existing descriptor r50:/Table/5{0-1} [(n1,s1):1, next=2, gen=0]
I240314 21:19:02.160757 1300521 13@kv/kvserver/replica_raft.go:381  [T1,n1,s1,r50/1:/Table/5{0-1}] 45  proposing SIMPLE(l2) [(n3,s3):2LEARNER]: after=[(n1,s1):1 (n3,s3):2LEARNER] next=3
I240314 21:19:02.176306 1303491 13@kv/kvserver/store_snapshot.go:1579  [T1,n1,s1,r50/1:/Table/5{0-1}] 46  streamed INITIAL snapshot 82cb4430 at applied index 15 to (n3,s3):2LEARNER with 700 B in 0.00s @ 4.4 MiB/s: kvs=9 rangeKVs=0, rate-limit: 32 MiB/s, queued: 0.01s
I240314 21:19:02.176952 1303441 13@kv/kvserver/replica_raftstorage.go:480  [T1,n3,s3,r50/2:{-}] 47  applying INITIAL snapshot 82cb4430 from (n1,s1):1 at applied index 15
I240314 21:19:02.177400 1303441 3@pebble/event.go:697  [n3,s3,pebble] 48  [JOB 15] ingesting: sstable created 000015
I240314 21:19:02.177528 1303441 3@pebble/event.go:697  [n3,s3,pebble] 49  [JOB 15] ingesting: sstable created 000020
I240314 21:19:02.177618 1303441 3@pebble/event.go:697  [n3,s3,pebble] 50  [JOB 15] ingesting: sstable created 000016
I240314 21:19:02.177643 1303441 3@pebble/event.go:697  [n3,s3,pebble] 51  [JOB 15] ingesting: sstable created 000017
I240314 21:19:02.177659 1303441 3@pebble/event.go:697  [n3,s3,pebble] 52  [JOB 15] ingesting: sstable created 000018
I240314 21:19:02.177674 1303441 3@pebble/event.go:697  [n3,s3,pebble] 53  [JOB 15] ingesting: sstable created 000019
I240314 21:19:02.177734 1303441 3@pebble/event.go:717  [n3,s3,pebble] 54  [JOB 16] WAL created 000021 (recycled 000002)
I240314 21:19:02.177845 1303523 3@pebble/event.go:677  [n3,s3,pebble] 55  [JOB 17] flushing 1 memtable to L0
I240314 21:19:02.177891 1303523 3@pebble/event.go:697  [n3,s3,pebble] 56  [JOB 17] flushing: sstable created 000022
I240314 21:19:02.178068 1303523 3@pebble/event.go:681  [n3,s3,pebble] 57  [JOB 17] flushed 1 memtable to L0 [000022] (1.7 K), in 0.0s (0.0s total), output rate 9.6 M/s
I240314 21:19:02.178171 1303526 3@pebble/event.go:665  [n3,s3,pebble] 58  [JOB 18] compacting(default) L0 [000022] (1.7 K) + L6 [000013 000012] (2.8 K)
I240314 21:19:02.178319 1303526 3@pebble/event.go:697  [n3,s3,pebble] 59  [JOB 18] compacting: sstable created 000023
I240314 21:19:02.178563 1303526 3@pebble/event.go:669  [n3,s3,pebble] 60  [JOB 18] compacted(default) L0 [000022] (1.7 K) + L6 [000013 000012] (2.8 K) -> L6 [000023] (2.3 K), in 0.0s (0.0s total), output rate 7.0 M/s
I240314 21:19:02.178816 1303441 3@pebble/event.go:705  [n3,s3,pebble] 61  [JOB 15] ingested L0:000015 (1.4 K), L0:000020 (1.1 K), L0:000016 (1.5 K), L6:000017 (1.2 K), L6:000018 (1.1 K), L6:000019 (1.1 K)
I240314 21:19:02.178933 1303529 3@pebble/event.go:701  [n3,s3,pebble] 62  [JOB 18] sstable deleted 000012
I240314 21:19:02.178979 1303529 3@pebble/event.go:701  [n3,s3,pebble] 63  [JOB 18] sstable deleted 000013
I240314 21:19:02.178995 1303529 3@pebble/event.go:701  [n3,s3,pebble] 64  [JOB 18] sstable deleted 000022
I240314 21:19:02.179066 1303409 3@pebble/event.go:665  [n3,s3,pebble] 65  [JOB 20] compacting(default) L0 [000015 000020 000016] (4.1 K) + L6 [000023] (2.3 K)
I240314 21:19:02.179189 1303409 3@pebble/event.go:697  [n3,s3,pebble] 66  [JOB 20] compacting: sstable created 000024
I240314 21:19:02.179309 1303441 kv/kvserver/replica_raftstorage.go:491  [T1,n3,s3,r50/2:/Table/5{0-1}] 67  applied INITIAL snapshot 82cb4430 from (n1,s1):1 at applied index 15 (total=2ms data=757 B ingestion=6@2ms)
I240314 21:19:02.179466 1303409 3@pebble/event.go:669  [n3,s3,pebble] 68  [JOB 20] compacted(default) L0 [000015 000020 000016] (4.1 K) + L6 [000023] (2.3 K) -> L6 [000024] (2.6 K), in 0.0s (0.0s total), output rate 7.8 M/s
I240314 21:19:02.179576 1303557 3@pebble/event.go:665  [n3,s3,pebble] 69  [JOB 21] compacting(elision-only) L6 [000017] (1.2 K) + L6 [] (0 B)
I240314 21:19:02.179676 1303557 3@pebble/event.go:669  [n3,s3,pebble] 70  [JOB 21] compacted(elision-only) L6 [000017] (1.2 K) + L6 [] (0 B) -> L6 [] (0 B), in 0.0s (0.0s total), output rate 0 B/s
I240314 21:19:02.179718 1303558 3@pebble/event.go:665  [n3,s3,pebble] 71  [JOB 22] compacting(elision-only) L6 [000018] (1.1 K) + L6 [] (0 B)
I240314 21:19:02.179780 1303558 3@pebble/event.go:669  [n3,s3,pebble] 72  [JOB 22] compacted(elision-only) L6 [000018] (1.1 K) + L6 [] (0 B) -> L6 [] (0 B), in 0.0s (0.0s total), output rate 0 B/s
I240314 21:19:02.179827 1303560 3@pebble/event.go:665  [n3,s3,pebble] 73  [JOB 23] compacting(elision-only) L6 [000019] (1.1 K) + L6 [] (0 B)
I240314 21:19:02.179901 1303560 3@pebble/event.go:669  [n3,s3,pebble] 74  [JOB 23] compacted(elision-only) L6 [000019] (1.1 K) + L6 [] (0 B) -> L6 [] (0 B), in 0.0s (0.0s total), output rate 0 B/s
I240314 21:19:02.179956 1303559 3@pebble/event.go:701  [n3,s3,pebble] 75  [JOB 22] sstable deleted 000018
I240314 21:19:02.179987 1303556 3@pebble/event.go:701  [n3,s3,pebble] 76  [JOB 20] sstable deleted 000015
I240314 21:19:02.180017 1303556 3@pebble/event.go:701  [n3,s3,pebble] 77  [JOB 20] sstable deleted 000016
I240314 21:19:02.179998 1303532 3@pebble/event.go:701  [n3,s3,pebble] 78  [JOB 24] sstable deleted 000017
I240314 21:19:02.180034 1303556 3@pebble/event.go:701  [n3,s3,pebble] 79  [JOB 20] sstable deleted 000020
I240314 21:19:02.180043 1303532 3@pebble/event.go:701  [n3,s3,pebble] 80  [JOB 24] sstable deleted 000019
I240314 21:19:02.180046 1303556 3@pebble/event.go:701  [n3,s3,pebble] 81  [JOB 20] sstable deleted 000023
I240314 21:19:02.180403 1300521 13@kv/kvserver/replica_command.go:2376  [T1,n1,replicate,s1,r50/1:/Table/5{0-1}] 82  change replicas (add [(n3,s3):2] remove []): existing descriptor r50:/Table/5{0-1} [(n1,s1):1, (n3,s3):2LEARNER, next=3, gen=1]
I240314 21:19:02.183884 1300521 13@kv/kvserver/replica_raft.go:381  [T1,n1,s1,r50/1:/Table/5{0-1}] 83  proposing SIMPLE(v2) [(n3,s3):2]: after=[(n1,s1):1 (n3,s3):2] next=3
I240314 21:19:02.184837 1300521 13@kv/kvserver/replicate_queue.go:1283  [T1,n1,replicate,s1,r22/1:/Table/2{1-2}] 84  adding voter n2,s2: [1*:18]
I240314 21:19:02.185345 1300521 13@kv/kvserver/replica_command.go:2376  [T1,n1,replicate,s1,r22/1:/Table/2{1-2}] 85  change replicas (add [(n2,s2):2LEARNER] remove []): existing descriptor r22:/Table/2{1-2} [(n1,s1):1, next=2, gen=0]
I240314 21:19:02.188777 1300521 13@kv/kvserver/replica_raft.go:381  [T1,n1,s1,r22/1:/Table/2{1-2}] 86  proposing SIMPLE(l2) [(n2,s2):2LEARNER]: after=[(n1,s1):1 (n2,s2):2LEARNER] next=3
--- FAIL: TestBackupMixedVersionElements_base_alter_table_alter_primary_key_drop_rowid (274.72s)
=== RUN   TestBackupMixedVersionElements_base_alter_table_alter_primary_key_drop_rowid/backup/restore_stage_17_of_19
    testcluster.go:429: pq: n3 required, but unavailable
    --- FAIL: TestBackupMixedVersionElements_base_alter_table_alter_primary_key_drop_rowid/backup/restore_stage_17_of_19 (0.82s)

Help

/cc @cockroachdb/sql-foundations _{This test on roachdash | Improve this report!}

Jira issue: CRDB-36728

Mar 14 '24 21:03 cockroach-teamcity

The error is n3 required, but unavailable. It happens here, when the upgrades framework uses KV node liveness to check which nodes are up: https://github.com/cockroachdb/cockroach/blob/32622e1b18030bd52529841ec1bb280a5683d5cb/pkg/upgrade/upgradecluster/nodes.go#L31-L62

In the CRDB logs, it looks like n3 is in fact up, but n1 doesn't think it is.

I240314 21:19:01.921826 1299534 server/node.go:533 ⋮ [T1,n3] 37325  initialized store s3
I240314 21:19:01.921855 1299534 kv/kvserver/stores.go:264 ⋮ [T1,n3] 37326  read 0 node addresses from persistent storage
I240314 21:19:01.921899 1299534 server/node.go:627 ⋮ [T1,n3] 37327  started with engine type ‹2›
I240314 21:19:01.921907 1299534 server/node.go:629 ⋮ [T1,n3] 37328  started with attributes []
I240314 21:19:01.921996 1299534 1@server/server.go:1845 ⋮ [T1,n3] 37329  starting https server at ‹127.0.0.1:44033› (use: ‹127.0.0.1:44033›)
I240314 21:19:01.922009 1299534 1@server/server.go:1850 ⋮ [T1,n3] 37330  starting postgres server at ‹127.0.0.1:38171› (use: ‹127.0.0.1:38171›)
I240314 21:19:01.922017 1299534 1@server/server.go:1853 ⋮ [T1,n3] 37331  starting grpc server at ‹127.0.0.1:41507›
I240314 21:19:01.922024 1299534 1@server/server.go:1854 ⋮ [T1,n3] 37332  advertising CockroachDB node at ‹127.0.0.1:41507›
I240314 21:19:01.970295 1302322 gossip/client.go:124 ⋮ [T1,n3] 37333  started gossip client to n0 (‹127.0.0.1:39703›)
I240314 21:19:01.970759 1302314 1@server/server.go:1683 ⋮ [T1,n3] 37334  node connected via gossip
I240314 21:19:01.970827 1300132 kv/kvserver/stores.go:283 ⋮ [T1,n3] 37335  wrote 1 node addresses to persistent storage
I240314 21:19:01.970863 1300132 kv/kvserver/stores.go:283 ⋮ [T1,n3] 37336  wrote 2 node addresses to persistent storage
I240314 21:19:01.971190 1299630 kv/kvserver/stores.go:283 ⋮ [T1,n1] 37337  wrote 2 node addresses to persistent storage
I240314 21:19:01.973313 1299881 kv/kvserver/stores.go:283 ⋮ [T1,n2] 37338  wrote 2 node addresses to persistent storage
I240314 21:19:01.996971 1302660 kv/kvclient/rangefeed/rangefeedcache/watcher.go:335 ⋮ [T1,n3] 37339  spanconfig-subscriber: established range feed cache
I240314 21:19:01.997327 1299534 1@util/log/event_log.go:32 ⋮ [T1,n3] 37340 ={"Timestamp":1710451141997325089,"EventType":"node_join","NodeID":3,"StartedAt":1710451141920492875,"LastUp":1710451141920492875}
I240314 21:19:01.997433 1302663 kv/kvclient/rangefeed/rangefeedcache/watcher.go:335 ⋮ [T1,n3] 37341  settings-watcher: established range feed cache
I240314 21:19:01.999090 1299534 sql/sqlliveness/slinstance/slinstance.go:434 ⋮ [T1,n3] 37342  starting SQL liveness instance
I240314 21:19:02.000683 1302694 sql/sqlliveness/slstorage/slstorage.go:540 ⋮ [T1,n3] 37343  inserted sqlliveness session 010180857f00f1c05947a2b8c8fdf9f7527f34
I240314 21:19:02.000711 1302694 sql/sqlliveness/slinstance/slinstance.go:258 ⋮ [T1,n3] 37344  created new SQL liveness session 010180857f00f1c05947a2b8c8fdf9f7527f34
I240314 21:19:02.000735 1299534 sql/sqlinstance/instancestorage/instancestorage.go:342 ⋮ [T1,n3] 37345  assigning instance id to rpc addr ‹127.0.0.1:41507› and sql addr ‹127.0.0.1:38171›
I240314 21:19:02.001733 1299534 server/server_sql.go:1530 ⋮ [T1,n3] 37346  bound sqlinstance: Instance{RegionPrefix: gA==, InstanceID: 3, SQLAddr: ‹127.0.0.1:38171›, RPCAddr: ‹127.0.0.1:41507›, SessionID: 010180857f00f1c05947a2b8c8fdf9f7527f34, Locality: ‹region=us-east3›, BinaryVersion: 23.1}
I240314 21:19:02.002158 1302726 sql/temporary_schema.go:486 ⋮ [T1,n3] 37347  skipping temporary object cleanup run as it is not the leaseholder
I240314 21:19:02.002178 1302726 sql/temporary_schema.go:487 ⋮ [T1,n3] 37348  completed temporary object cleanup job
I240314 21:19:02.002192 1302726 sql/temporary_schema.go:639 ⋮ [T1,n3] 37349  temporary object cleaner next scheduled to run at 2024-03-14 21:49:02.002140061 +0000 UTC m=+2242.917556323
I240314 21:19:02.002243 1302727 sql/sqlstats/persistedsqlstats/provider.go:170 ⋮ [T1,n3] 37350  starting sql-stats-worker with initial delay: 10m43.100369651s
I240314 21:19:02.002335 1299534 upgrade/upgrademanager/manager.go:170 ⋮ [T1,n3] 37351  running permanent upgrades up to version: 22.2
I240314 21:19:02.002590 1302698 kv/kvclient/rangefeed/rangefeedcache/watcher.go:335 ⋮ [T1,n3] 37352  system-config-cache: established range feed cache
I240314 21:19:02.015450 1299534 upgrade/upgrademanager/manager.go:238 ⋮ [T1,n3] 37353  the last permanent upgrade (v22.1-42) does not appear to have completed; attempting to run all upgrades
I240314 21:19:02.015803 1299534 upgrade/upgrademanager/manager.go:283 ⋮ [T1,n3] 37354  running permanent upgrade for version 0.0-2
I240314 21:19:02.018029 1299534 upgrade/upgrademanager/manager.go:283 ⋮ [T1,n3] 37355  running permanent upgrade for version 0.0-4
I240314 21:19:02.019798 1299534 upgrade/upgrademanager/manager.go:283 ⋮ [T1,n3] 37356  running permanent upgrade for version 0.0-6
I240314 21:19:02.021752 1299534 upgrade/upgrademanager/manager.go:283 ⋮ [T1,n3] 37357  running permanent upgrade for version 0.0-8
I240314 21:19:02.023716 1299534 upgrade/upgrademanager/manager.go:283 ⋮ [T1,n3] 37358  running permanent upgrade for version 0.0-10
I240314 21:19:02.031511 1299534 upgrade/upgrademanager/manager.go:283 ⋮ [T1,n3] 37359  running permanent upgrade for version 0.0-12
I240314 21:19:02.035368 1299534 upgrade/upgrademanager/manager.go:283 ⋮ [T1,n3] 37360  running permanent upgrade for version 22.1-42
I240314 21:19:02.036857 1299534 server/server_sql.go:1638 ⋮ [T1,n3] 37361  done ensuring all necessary startup migrations have run
I240314 21:19:02.037407 1303079 jobs/job_scheduler.go:407 ⋮ [T1,n3] 37362  waiting 3m0s before scheduled jobs daemon start
I240314 21:19:02.037538 1303083 server/auto_upgrade.go:43 ⋮ [T1,n3] 37363  auto upgrade disabled by testing
I240314 21:19:02.039156 1303099 kv/kvclient/rangefeed/rangefeedcache/watcher.go:335 ⋮ [T1,n3] 37364  tenant-settings-watcher: established range feed cache
I240314 21:19:02.039836 1303057 kv/kvclient/rangefeed/rangefeedcache/watcher.go:335 ⋮ [T1,n3] 37365  tenant-capability-watcher: established range feed cache
I240314 21:19:02.041329 1303057 multitenant/tenantcapabilities/tenantcapabilitieswatcher/watcher.go:149 ⋮ [T1,n3] 37366  received results of a full table scan for tenant capabilities
I240314 21:19:02.047374 1303081 sql/syntheticprivilegecache/cache.go:204 ⋮ [T1,n3] 37367  warmed privileges for virtual tables in 9.940078ms
I240314 21:19:02.056691 1299534 1@server/server_sql.go:1756 ⋮ [T1,n3] 37368  serving sql connections
I240314 21:19:02.073947 1303241 upgrade/upgrademanager/manager.go:397 ⋮ [T1,n1,client=127.0.0.1:59982,hostssl,user=root,migration-mgr] 37369  migrating cluster from 22.2 to 22.2-48 (stepping through [22.2-2 22.2-4 22.2-6 22.2-8 22.2-10 22.2-12 22.2-14 22.2-16 22.2-18 22.2-20 22.2-22 22.2-24 22.2-26 22.2-28 22.2-30 22.2-32 22.2-34 22.2-36 22.2-38 22.2-40 22.2-42 22.2-44 22.2-46 22.2-48])
I240314 21:19:02.074000 1303241 upgrade/upgrademanager/manager.go:657 ⋮ [T1,n1,client=127.0.0.1:59982,hostssl,user=root,migration-mgr] 37370  executing operation validate-cluster-version=22.2-48
W240314 21:19:02.074168 1303241 upgrade/upgrademanager/manager.go:370 ⋮ [T1,n1,client=127.0.0.1:59982,hostssl,user=root,migration-mgr] 37371  error encountered during version upgrade: n3 required, but unavailable

Checking if KV can investigate why this disagreement might occur.

Mar 15 '24 16:03 rafiss

Is the solution here to improve the UntilClusterStable function so it can tolerate temporary unavailability of a node? I'm not sure if that's just a band-aid we should try to avoid, or if it would be expected usage of the NodeLiveness system, so I'm hoping the KV team can advise.

https://github.com/cockroachdb/cockroach/blob/7fc5635818705c15b53965b8e7cac8f293cd0a7f/pkg/upgrade/upgradecluster/cluster.go#L74-L98

Mar 15 '24 16:03 rafiss

I'm confused why these series of tests started failing recently (<2 weeks ago) without related changes(?). I've attempted reproducing over an hour to no success with 12a8659895a33f4a29900e6b9d83bd624e120216:

dev test pkg/ccl/schemachangerccl -f TestBackupMixedVersionElements/TestBackupMixedVersionElements_base_alter_table_alter_primary_key_drop_rowid -v --count=1000

Speculatively, the cause the of the failure could be a race between the node joining and heart-beating liveness with the upgrade check.

We know from the logs that the node created a liveness record but not that it had heartbeated its liveness.

https://github.com/cockroachdb/cockroach/blob/70f8a738d1e0d0bdbe51cfdd100a01ad217d30bb/pkg/kv/kvserver/liveness/liveness.go#L584-L584

I240318 21:33:38.679735 1264692 kv/kvserver/liveness/liveness.go:617 â‹® [T1,n1] 1974  created liveness record for n3

I'll continue to try and reproduce.

Mar 21 '24 17:03 kvoli

This hasn't reproduced over 2 hours. I'll try a different linked test.

Mar 21 '24 19:03 kvoli

In the test it is very close between the n3 is "fullly up" and n1 checking if it is alive. It probably published liveness a few ms earlier, but the gossip delay could explain this miss.

I240314 21:19:02.056691 1299534 1@server/server_sql.go:1756 ⋮ [T1,n3] 37368  serving sql connections
W240314 21:19:02.074168 1303241 upgrade/upgrademanager/manager.go:370 ⋮ [T1,n1,client=127.0.0.1:59982,hostssl,user=root,migration-mgr] 37371  error encountered during version upgrade: n3 required, but unavailable

These are only 20ms apart, so it is definitely possible that n1 hadn't seen the liveness record for n1 yet. I'm actually surprised this doesn't fail more often.

Mar 21 '24 19:03 andrewbaptist

It probably published liveness a few ms earlier, but the gossip delay could explain this miss.

It should be consistent since it is scanning the liveness range, as opposed to using the in-memory gossiped version?

https://github.com/cockroachdb/cockroach/blob/6790290f0e6ff613fd57c405e6d13b9befb2766f/pkg/upgrade/upgradecluster/nodes.go#L44-L44

I was assuming the issue was the first heartbeat, after creating the record didn't complete before the upgrade began.

Mar 21 '24 20:03 kvoli

Good point, it does read it from KV. However the write to KV happens here: https://github.com/cockroachdb/cockroach/blob/release-23.1/pkg/kv/kvserver/liveness/liveness.go#L798 inside the RunAsyncTaskEx loop, so it is possible the write hadn't happened by time the check was made. I haven't gone through this in detail, so I'm not sure of when n1 decides to do the check.

Mar 21 '24 21:03 andrewbaptist

This appears like a testing issue afaict, where there's a race between the test cluster starting the upgrade and a node joining the cluster.

Mar 26 '24 14:03 kvoli

I saw the same symptom in this CI run: https://teamcity.cockroachdb.com/buildConfiguration/Cockroach_BazelEssentialCi/14683931?hideProblemsFromDependencies=false&hideTestsFromDependencies=false&expandBuildProblemsSection=true&expandBuildChangesSection=true&expandBuildTestsSection=true

Failed
=== RUN   TestLogic_upgrade_skip_version
    test_log_scope.go:170: test logs captured to: /artifacts/tmp/_tmp/d344e0c08e5dac50ac55f38c33032127/logTestLogic_upgrade_skip_version3569230637
    test_log_scope.go:81: use -show-logs to present logs inline
[16:45:01] --- done: /home/roach/.cache/bazel/_bazel_roach/c5a4e7d36696d9cd970af2045211a7df/sandbox/processwrapper-sandbox/6688/execroot/com_github_cockroachdb_cockroach/bazel-out/k8-fastbuild/bin/pkg/sql/logictest/tests/cockroach-go-testserver-23.1/cockroach-go-testserver-23_1_test_/cockroach-go-testserver-23_1_test.runfiles/com_github_cockroachdb_cockroach/pkg/sql/logictest/testdata/logic_test/upgrade_skip_version with config cockroach-go-testserver-23.1: 1 tests, 0 failures
    logic.go:4145:
        /home/roach/.cache/bazel/_bazel_roach/c5a4e7d36696d9cd970af2045211a7df/sandbox/processwrapper-sandbox/6688/execroot/com_github_cockroachdb_cockroach/bazel-out/k8-fastbuild/bin/pkg/sql/logictest/tests/cockroach-go-testserver-23.1/cockroach-go-testserver-23_1_test_/cockroach-go-testserver-23_1_test.runfiles/com_github_cockroachdb_cockroach/pkg/sql/logictest/testdata/logic_test/upgrade_skip_version:10: error while processing
    logic.go:4145:
        /home/roach/.cache/bazel/_bazel_roach/c5a4e7d36696d9cd970af2045211a7df/sandbox/processwrapper-sandbox/6688/execroot/com_github_cockroachdb_cockroach/bazel-out/k8-fastbuild/bin/pkg/sql/logictest/tests/cockroach-go-testserver-23.1/cockroach-go-testserver-23_1_test_/cockroach-go-testserver-23_1_test.runfiles/com_github_cockroachdb_cockroach/pkg/sql/logictest/testdata/logic_test/upgrade_skip_version:10:
        expected success, but found
        (XXUUU) n2 required, but unavailable
        nodes.go:57: in NodesFromNodeLiveness()
cockroach logs captured in: /artifacts/tmp/_tmp/d344e0c08e5dac50ac55f38c33032127/cockroach-logs28120113
    panic.go:626: -- test log scope end --
test logs left over in: /artifacts/tmp/_tmp/d344e0c08e5dac50ac55f38c33032127/logTestLogic_upgrade_skip_version3569230637
--- FAIL: TestLogic_upgrade_skip_version (62.13s)

Apr 03 '24 20:04 rafiss

@kvoli Can the distsender change that is causing this be reverted, or is there some setting we can temporarily disable to avoid these failures? This is failing logic tests several times per week and causing significant toil for SQL Queries, so we'd greatly appreciate any efforts to temporarily alleviate the failures.

Apr 16 '24 22:04 mgartner

@kvoli Can the distsender change that is causing this be reverted, or is there some setting we can temporarily disable to avoid these failures? This is failing logic tests several times per week and causing significant toil for SQL Queries, so we'd greatly appreciate any efforts to temporarily alleviate the failures.

I didn't think this was related to any distsender changes. I'll move it to the front of the line to debug.

Apr 17 '24 13:04 kvoli

I've tried reproducing this on a couple different tests:

No success on release-23.1 with TestPauseMigration #119696:

dev test pkg/upgrade/upgrademanager -f TestPauseMigration -v --stress --cpus=30
...
INFO: Elapsed time: 3191.840s, Critical Path: 89.89s
INFO: 1001 processes: 1 internal, 1000 linux-sandbox.
//pkg/upgrade/upgrademanager:upgrademanager_test                         PASSED in 89.9s
  Stats over 1000 runs: max = 89.9s, min = 32.8s, avg = 76.3s, dev = 4.7s

Executed 1 out of 1 test: 1 test passes.
INFO: Build Event Protocol files produced successfully.
INFO: Build completed successfully, 1001 total actions

Managed 1 repro on master with:

dev testlogic --config='cockroach-go-testserver-23.2' --subtests='upgrade_system_table_changes' --count=1000 --cpus=8

I've bumped some logging to determine if its this theory:

Speculatively, the cause the of the failure could be a race between the node joining and heart-beating liveness with the upgrade check.

Apr 17 '24 17:04 kvoli

In the repros I've gotten, the upgrade fails because of a heartbeat timeout. This doesn't appear related to recent changes.

W240417 18:06:16.459469 2646 upgrade/upgrademanager/manager.go:308 ⋮ [T1,Vsystem,n1,peer=‹127.0.0.1:55794›,client=127.0.0.1:55794,hostnossl,user=root,migration-mgr] 384  error encountered during version upgrade: n1 required, but unavailable status=‹NODE_STATUS_LIVE› liveness=liveness(nid:1 epo:2 exp:1713377175.860645000,0) nv=nodeID:1 connected membership:active livenessExpiration:1713377175.860645000,0 livenessEpoch:2 now:1713377176.459255000,0 descUpdateTime:1713377170.864586000,0 descUnavailableTime:0,0

Details

W240417 18:06:15.861164 474 kv/kvserver/liveness/liveness.go:668 ⋮ [T1,Vsystem,n1,liveness-hb] 382  failed node liveness heartbeat: operation ‹"node liveness heartbeat"› timed out after 3.001s (given timeout 3s): aborted in DistSender: result is ambiguous: context deadline 
W240417 18:06:16.458967 314 kv/kvserver/spanlatch/manager.go:654 ⋮ [T1,Vsystem,n1,s1,r2/1:‹/System/NodeLiveness{-Max}›,raft] 383  ConditionalPut [/System/NodeLiveness/2], EndTxn(commit modified-span (node-liveness)) [/System/NodeLiveness/2], [txn: cbf5beb8], [can-forward-ts] has held latch for 4560555000 ns. Some possible causes are slow disk reads, slow raft replication, and expensive request processing.
W240417 18:06:16.459469 2646 upgrade/upgrademanager/manager.go:308 ⋮ [T1,Vsystem,n1,peer=‹127.0.0.1:55794›,client=127.0.0.1:55794,hostnossl,user=root,migration-mgr] 384  error encountered during version upgrade: n1 required, but unavailable status=‹NODE_STATUS_LIVE› liveness=liveness(nid:1 epo:2 exp:1713377175.860645000,0) nv=nodeID:1 connected membership:active livenessExpiration:1713377175.860645000,0 livenessEpoch:2 now:1713377176.459255000,0 descUpdateTime:1713377170.864586000,0 descUnavailableTime:0,0 nl=map[1:nodeID:1 connected membership:active livenessExpiration:1713377175.860645000,0 livenessEpoch:2 now:1713377176.459255000,0 descUpdateTime:1713377170.864586000,0 descUnavailableTime:0,0 2:nodeID:2 connected membership:active livenessExpiration:1713377177.838836000,0 livenessEpoch:2 now:1713377176.459260000,0 descUpdateTime:1713377166.838513000,0 descUnavailableTime:0,0 3:nodeID:3 connected membership:active livenessExpiration:1713377179.627089000,0 livenessEpoch:2 now:1713377176.459344000,0 descUpdateTime:1713377173.627188000,0 descUnavailableTime:0,0]

W240417 19:21:53.567723 1560 upgrade/upgrademanager/manager.go:308 ⋮ [T1,Vsystem,n1,peer=‹127.0.0.1:54405›,client=127.0.0.1:54405,hostnossl,user=root,migration-mgr] 392  error encountered during version upgrade: n2 required, but unavailable status=‹NODE_STATUS_DRAINING› liveness=liveness(nid:2 epo:1 exp:1713381712.189543000,0 drain:true membership:active) nv=nodeID:2 draining connected membership:active livenessExpiration:1713381712.189543000,0 livenessEpoch:1 now:1713381713.567653000,1 descUpdateTime:1713381709.324655000,0 descUnavailableTime:0,0

Details

W240417 19:21:49.974772 302 raft/raft.go:1203 ⋮ [T1,Vsystem,n1,s1,r2/1:‹/System/NodeLiveness{-Max}›] 385  1 stepped down to follower since quorum is not active
W240417 19:21:49.974786 305 raft/raft.go:1203 ⋮ [T1,Vsystem,n1,s1,r20/1:‹/Table/1{7-8}›] 386  1 stepped down to follower since quorum is not active
I240417 19:21:49.989576 4459 13@kv/kvserver/replicate_queue.go:767 ⋮ [T1,Vsystem,n1,replicate,s1,r19/1:‹/Table/1{6-7}›] 387  error processing replica: 2 matching stores are currently throttled: [‹s3: suspect› ‹s2: draining›]
E240417 19:21:49.989615 4459 kv/kvserver/queue.go:1202 ⋮ [T1,Vsystem,n1,replicate,s1,r19/1:‹/Table/1{6-7}›] 388  2 matching stores are currently throttled: [‹s3: suspect› ‹s2: draining›]
W240417 19:21:52.018763 466 kv/kvserver/liveness/liveness.go:752 ⋮ [T1,Vsystem,n1,liveness-hb] 389  slow heartbeat took 3.000735417s; err=result is ambiguous: after 2.99s of attempting command: context deadline exceeded
W240417 19:21:52.018813 466 kv/kvserver/liveness/liveness.go:668 ⋮ [T1,Vsystem,n1,liveness-hb] 390  failed node liveness heartbeat: operation ‹"node liveness heartbeat"› timed out after 3.001s (given timeout 3s): result is ambiguous: after 2.99s of attempting command: context deadline exceeded
W240417 19:21:52.018813 466 kv/kvserver/liveness/liveness.go:668 ⋮ [T1,Vsystem,n1,liveness-hb] 390 +
W240417 19:21:52.018813 466 kv/kvserver/liveness/liveness.go:668 ⋮ [T1,Vsystem,n1,liveness-hb] 390 +An inability to maintain liveness will prevent a node from participating in a
W240417 19:21:52.018813 466 kv/kvserver/liveness/liveness.go:668 ⋮ [T1,Vsystem,n1,liveness-hb] 390 +cluster. If this problem persists, it may be a sign of resource starvation or
W240417 19:21:52.018813 466 kv/kvserver/liveness/liveness.go:668 ⋮ [T1,Vsystem,n1,liveness-hb] 390 +of network connectivity problems. For help troubleshooting, visit:
W240417 19:21:52.018813 466 kv/kvserver/liveness/liveness.go:668 ⋮ [T1,Vsystem,n1,liveness-hb] 390 +
W240417 19:21:52.018813 466 kv/kvserver/liveness/liveness.go:668 ⋮ [T1,Vsystem,n1,liveness-hb] 390 +    https://www.cockroachlabs.com/docs/stable/cluster-setup-troubleshooting.html#node-liveness-issues
W240417 19:21:53.567259 302 kv/kvserver/spanlatch/manager.go:654 ⋮ [T1,Vsystem,n1,s1,r2/1:‹/System/NodeLiveness{-Max}›,raft] 391  ConditionalPut [/System/NodeLiveness/3], EndTxn(commit modified-span (node-liveness)) [/System/NodeLiveness/3], [txn: 696253e1], [can-forward-ts] has held latch for 5101158000 ns. Some possible causes are slow disk reads, slow raft replication, and expensive request processing.
W240417 19:21:53.567723 1560 upgrade/upgrademanager/manager.go:308 ⋮ [T1,Vsystem,n1,peer=‹127.0.0.1:54405›,client=127.0.0.1:54405,hostnossl,user=root,migration-mgr] 392  error encountered during version upgrade: n2 required, but unavailable status=‹NODE_STATUS_DRAINING› liveness=liveness(nid:2 epo:1 exp:1713381712.189543000,0 drain:true membership:active) nv=nodeID:2 draining connected membership:active livenessExpiration:1713381712.189543000,0 livenessEpoch:1 now:1713381713.567653000,1 descUpdateTime:1713381709.324655000,0 descUnavailableTime:0,0 nl=map[1:nodeID:1 connected membership:active livenessExpiration:1713381715.018012000,0 livenessEpoch:2 now:1713381713.567651000,0 descUpdateTime:1713381708.049323000,0 descUnavailableTime:0,0 2:nodeID:2 draining connected membership:active livenessExpiration:1713381712.189543000,0 livenessEpoch:1 now:1713381713.567653000,1 descUpdateTime:1713381709.324655000,0 descUnavailableTime:0,0 3:nodeID:3 connected membership:active livenessExpiration:1713381714.435557000,0 livenessEpoch:2 now:1713381713.567655000,0 descUpdateTime:1713381713.436042000,0 descUnavailableTime:0,0]
W240417 19:21:53.567934 466 kv/kvserver/liveness/liveness.go:752 ⋮ [T1,Vsystem,n1,liveness-hb] 393  slow heartbeat took 1.549076083s; err=<nil>

These would have passed if there were a retry loop on checking for unavailable nodes.

Apr 17 '24 19:04 kvoli

@kvoli just a heads up that we keep on seeing this problem elsewhere (especially in the mixed-version logic tests - see latest linked issues).

Apr 25 '24 19:04 yuzefovich

@kvoli just a heads up that we keep on seeing this problem elsewhere (especially in the mixed-version logic tests - see latest linked issues).

I'm out till Monday and we decided to drop this issue in favor of clearing GA blockers.

I'm still surprised this has suddenly started failing tests, without associated changes. There haven't been any updates affecting liveness here recently (or backported to 23.1).

Can we skip these tests in the meantime? The failures I've reproduced are due to overload more than anything else, where one failed heartbeat fails the test.

Apr 25 '24 20:04 kvoli

Unfortunately, it's not a single test that is hitting it - we'd need to disable some logic test configs. I think it's easier for now to keep on closing similarly looking failures.

Apr 25 '24 23:04 yuzefovich

another flake here: https://mesolite.cluster.engflow.com/invocations/default/8210b601-304d-4189-921e-369cbbad2e49

May 01 '24 14:05 msbutler

One interesting data point is that we tend to see a cluster of these failures in separate nightlies at about the same. Sometimes they occur at the same time on different release branches:

What hypotheses would explain this?

I'm still surprised this has suddenly started failing tests, without associated changes. There haven't been any updates affecting liveness here recently (or backported to 23.1).

It worries me that we don't have an explanation for why these failures started in mid March. How confident are we that this isn't a real issue?

Can we skip these tests in the meantime? The failures I've reproduced are due to overload more than anything else, where one failed heartbeat fails the test.

Is it expected that one failed heartbeat should fail a test?

May 01 '24 16:05 mgartner

One interesting data point is that we tend to see a cluster of these failures in separate nightlies at about the same. Sometimes they occur at the same time on different release branches: [...] What hypotheses would explain this?

I suppose machine slowness is one explanation but still seems unsatisfactory that it is would be so correlated unless always the same machine?

It worries me that we don't have an explanation for why these failures started in mid March. How confident are we that this isn't a real issue?

Mostly confident, weren't these tests only re-enabled around then?

Is it expected that one failed heartbeat should fail a test?

This is an interesting point, and something that @andrewbaptist and I talked about recently. It isn't desirable that a single heartbeat is failing a test imo, however we do get some added test coverage from it (albeit, small?).

@andrewbaptist we discussed two possibilities to stop these failures reoccurring:

Create a test only node liveness, which doesn't fail liveness heartbeats for nodes. This would need to be opt in presumably, since these tests are failing back to 23.1 and a backport is required.
Wrap NodesFromNodeLiveness in a retry loop, which allows a retry when a node is marked as unavailable.

wdyt?

May 03 '24 03:05 kvoli

As the tests are written today, many of them will fail with a single heartbeat failure which is unfortunate. I think the code in upgradeStatus Is too strict today and we should consider changing this. There could be a different and probably better change of either writing the version into the NodeLiveness record, getting the version directly from gossip (rather than indirectly through NodeLiveness) or removing the IsLive check on nodes (or making it less strict). I'm not sure who owns upgrade today but there is a simple change to create a testing knob to not fail liveness checks during tests that do upgrade (unless they want to test that).

@mgartner do you want to chat more about this option, or is there someone else I should talk to?

May 03 '24 17:05 andrewbaptist

@rafiss might have more of an opinion than I do about what would be most appropriate for these tests. My main concern is that this could be a real issue (doesn't sound like it though) and the noisiness which is causing toil for a couple teams.

May 06 '24 20:05 mgartner

As the tests are written today, many of them will fail with a single heartbeat failure which is unfortunate. I think the code in upgradeStatus Is too strict today and we should consider changing this.

Which change do you have in mind? I don't see the connection to upgradeStatus here. I thought the issue was in NodesFromNodeLiveness; that's where the "nX required, but unavailable error comes from.

There could be a different and probably better change of either writing the version into the NodeLiveness record, getting the version directly from gossip (rather than indirectly through NodeLiveness)

I don't know the pros/cons of getting the version directly from gossip. Could you say more about what you propose? Are you talking about this call to ScanNodeVitalityFromKV? https://github.com/cockroachdb/cockroach/blob/4a52da5c0d90e40c6e1d6e2b26f4a72dd04527ef/pkg/upgrade/upgradecluster/nodes.go#L48

or removing the IsLive check on nodes (or making it less strict).

What would a less strict IsLive check look like? I'm wondering which of the conditions here you propose relaxing: https://github.com/cockroachdb/cockroach/blob/f9cd1796e6d2f8668f47ce8f809879ee27540117/pkg/kv/kvserver/liveness/livenesspb/liveness.go#L319-L347

I'm not sure who owns upgrade today but there is a simple change to create a testing knob to not fail liveness checks during tests that do upgrade (unless they want to test that).

I don't think it would be so simple to add a testing-only knob. Many of the tests that are failing do not use the TestCluster interface -- they actually start a real CRDB binary. I guess an environment variable would work, but I'm not sure if we want to introduce a test-only environment variable. I suppose I'm not fundamentally opposed to that.

May 13 '24 15:05 rafiss

@andrewbaptist I implemented a (less than desirable) retry mechanism when there are unavailable nodes during the upgrade UntilClusterStable function, which these tests are failing on. WDYT about this approach?

https://github.com/cockroachdb/cockroach/pull/124288

May 16 '24 19:05 kvoli

cockroach cockroach copied to clipboard

ccl/schemachangerccl: TestBackupMixedVersionElements failed [node unavailable in KV NodeLiveness even though it started up]

cockroach
cockroach copied to clipboard