cockroach roachtest: separate panic node step into panic step and restart step

As previously mentioned in #147641, the panic node mutator was done in one step (panicking the node and restarting it). This change splits that step into two steps, allowing for a node to be dead for an extended period of time before restarting in a later step.

Pending Merge of #147641

Epic: none Release note: none

Jun 16 '25 18:06 Dev-Kyle

This change is

Jun 16 '25 18:06 cockroach-teamcity

I've run a quick analysis to look at the distribution of the following across 1,000,000 plans,

panicked nodes
distance between panic and restart

The raw results are telling,

go run plan_parser.go |grep panicking |awk -F" " '{print $NF}'|sort |uniq -c |sort -rn
102208 1
101781 4
101682 3
100987 2

That means, each node is equally likely, as expected. However, only ~40% of the plans had a panic mutator.

Now, as to the distance distribution,

go run plan_parser.go |grep Distance |awk -F":" '{print $2}'|sort |uniq -c |sort -rn
351999  1
40245  2
8189  3
3429  4
1362  5
 633  6
 375  7
 173  8
 114  9
  59  10
  34  11
  21  12
  12  13
   7  16
   5  14
   1  15

There is a somewhat long tail but majority is clustered around 1 and 2. The former is trivial since it means no interleaving of other steps. The latter is only slightly better since it interleaves only one step. Here is the CDF plot for the visually-oriented,

Example of a plan where the distance between panic (at step 24) and restart (at step 40) is maximum from the above distribution, i.e., 16,

Seed:               7832459832577650981
Upgrades:           v24.2.5 → v24.3.5 → v25.2.0 → <current>
Deployment mode:    system-only
Mutators:           preserve_downgrade_option_randomizer, cluster_setting[kv.transaction.write_buffering.enabled], cluster_setting[kv.rangefeed.buffered_sender.enabled], panic_node
Plan:
├── install fixtures for version "v25.2.0" (1)
├── start cluster at version "v25.2.0" (2)
├── wait for all nodes (:1-4) to acknowledge cluster version '25.2' on system tenant (3)
├── start separate process virtual cluster mixed-version-tenant-seudy with binary version v25.2.0 (4)
├── wait for all nodes (:1-4) to acknowledge cluster version '25.2' on mixed-version-tenant-seudy tenant (5)
├── set cluster setting "spanconfig.tenant_limit" to '50000' on mixed-version-tenant-seudy tenant (6)
├── disable KV and tenant(SQL) rate limiter on mixed-version-tenant-seudy tenant (7)
├── set cluster setting "server.secondary_tenants.authorization.mode" to 'allow-all' on system tenant (8)
├── run startup hooks concurrently
│   ├── run "maybe enable tenant features", after 10ms delay (9)
│   ├── run "load TPCC dataset", after 3s delay (10)
│   ├── run "load bank dataset", after 500ms delay (11)
│   └── set cluster setting "kv.transaction.write_buffering.enabled" to 'true' on system tenant, after 0s delay (12)
└── upgrade cluster from "v25.2.0" to "<current>"
   ├── upgrade storage cluster
   │   ├── run following steps concurrently
   │   │   ├── prevent auto-upgrades on system tenant by setting `preserve_downgrade_option`, after 10ms delay (13)
   │   │   └── set cluster setting "kv.rangefeed.buffered_sender.enabled" to 'false' on system tenant, after 3s delay (14)
   │   ├── upgrade nodes :1-4 from "v25.2.0" to "<current>"
   │   │   ├── restart system server on node 1 with binary version <current> (15)
   │   │   ├── restart system server on node 2 with binary version <current> (16)
   │   │   ├── allow upgrade to happen on system tenant by resetting `preserve_downgrade_option` (17)
   │   │   ├── restart system server on node 4 with binary version <current> (18)
   │   │   ├── run "TPCC workload" (19)
   │   │   └── restart system server on node 3 with binary version <current> (20)
   │   ├── run "TPCC workload" (21)
   │   ├── wait for all nodes (:1-4) to acknowledge cluster version <current> on system tenant (22)
   │   └── run "check TPCC workload" (23)
   └── upgrade tenant
      ├── upgrade nodes :1-4 from "v25.2.0" to "<current>"
      │   ├── panicking system interface on node 3 (24)
      │   ├── restart mixed-version-tenant-seudy server on node 3 with binary version <current> (25)
      │   ├── restart mixed-version-tenant-seudy server on node 1 with binary version <current> (26)
      │   ├── run following steps concurrently
      │   │   ├── run "TPCC workload", after 18s delay (27)
      │   │   └── reset cluster setting "kv.rangefeed.buffered_sender.enabled" on system tenant, after 50ms delay (28)
      │   ├── restart mixed-version-tenant-seudy server on node 2 with binary version <current> (29)
      │   └── restart mixed-version-tenant-seudy server on node 4 with binary version <current> (30)
      ├── downgrade nodes :1-4 from "<current>" to "v25.2.0"
      │   ├── restart mixed-version-tenant-seudy server on node 3 with binary version v25.2.0 (31)
      │   ├── restart mixed-version-tenant-seudy server on node 2 with binary version v25.2.0 (32)
      │   ├── run following steps concurrently
      │   │   ├── run "TPCC workload", after 3s delay (33)
      │   │   └── set cluster setting "kv.transaction.write_buffering.enabled" to 'false' on system tenant, after 500ms delay (34)
      │   ├── restart mixed-version-tenant-seudy server on node 4 with binary version v25.2.0 (35)
      │   └── restart mixed-version-tenant-seudy server on node 1 with binary version v25.2.0 (36)
      ├── upgrade nodes :1-4 from "v25.2.0" to "<current>"
      │   ├── restart mixed-version-tenant-seudy server on node 3 with binary version <current> (37)
      │   ├── restart mixed-version-tenant-seudy server on node 4 with binary version <current> (38)
      │   ├── restart mixed-version-tenant-seudy server on node 1 with binary version <current> (39)
      │   ├── restarting node 3 after panic (40)
      │   ├── run "TPCC workload" (41)
      │   └── restart mixed-version-tenant-seudy server on node 2 with binary version <current> (42)
      ├── run following steps concurrently
      │   ├── set cluster setting "kv.rangefeed.buffered_sender.enabled" to 'true' on system tenant, after 3s delay (43)
      │   ├── set `version` to <current> on mixed-version-tenant-seudy tenant, after 500ms delay (44)
      │   └── run "TPCC workload", after 18s delay (45)
      ├── wait for all nodes (:1-4) to acknowledge cluster version <current> on mixed-version-tenant-seudy tenant (46)
      └── run "check TPCC workload" (47)

The above analysis can be easily reproduced via [1].

[1] https://gist.github.com/srosenberg/f194d7044466ba4ade1c56129799c78e

Jul 03 '25 17:07 srosenberg

I think the maximum distance plan Stan sent reveals a bug:

https://github.com/cockroachdb/cockroach/blob/f51bd2d3d999249d7911f574a7d03ac49fe3d214/pkg/cmd/roachtest/roachtestutil/mixedversion/mutators.go#L421-L434

_, runHook := s.impl.(runHookStep) should mean that we restart before running any user hooks. However we can see a user hook being run on step 27. Perhaps this isn't being caught because its being run concurrently?

There is a somewhat long tail but majority is clustered around 1 and 2.

I think this makes sense if we think about what actually can be run in between our panic and restart: other mutator steps, separate process rolling restarts. Short of removing the no user hook condition (which seems out of scope), I'm not sure how we can improve this number.

We could favor taking the latest possible restart step instead of a random one:

https://github.com/cockroachdb/cockroach/blob/f51bd2d3d999249d7911f574a7d03ac49fe3d214/pkg/cmd/roachtest/roachtestutil/mixedversion/mutators.go#L450-L452

But I suspect with the above mentioned bug fixed, we'd still cap out at ~8 steps and still be clustered around the 1-3 step mark.

Jul 03 '25 18:07 DarrylWong

I also looked at the distribution of interleaved steps. These are all the unique interleaved steps, along with their occurrences,

14217 restart tenant server on node with binary version <current>
13585 prevent auto-upgrades on system tenant by setting `preserve_downgrade_option`
12546 allow upgrade to happen on system tenant by resetting `preserve_downgrade_option`
8781 restart tenant server on node with binary version v
6483 prevent auto-upgrades on tenant tenant by setting `preserve_downgrade_option`
5995  after delay
5921 set cluster setting to true on system tenant
5902 set cluster setting to false on system tenant
4265 reset cluster setting on system tenant
3970 allow upgrade to happen on tenant tenant by resetting `preserve_downgrade_option`
1588 run TPCC workload
1224 set `version` to <current> on tenant tenant
1034 set cluster setting to zstd on system tenant
1004 set cluster setting to snappy on system tenant
 280 wait for all nodes (:1-4) to acknowledge cluster version <current> on system tenant
 176 wait for all nodes (:1-4) to acknowledge cluster version <current> on tenant tenant
 169 wait for all nodes (:1-4) to acknowledge cluster version  on system tenant
 123 set `version` to  on tenant tenant
  47 wait for all nodes (:1-4) to acknowledge cluster version  on tenant tenant
  11 run check TPCC workload

As Darryl already pointed out, run is there, but also wait for all nodes.

Jul 05 '25 03:07 srosenberg

Here is a newly generated distribution after resolving the issue of inserting panics after incompatible steps.

Raw results for panicking is still as expected, with roughly the same number of panics on each node

go run plan_parser.go |grep panicking |awk -F" " '{print $NF}'|sort |uniq -c |sort -rn
18553 1
18447 4
18315 3
18232 2

The distance distribution is still closely clustered around 1 and 2 (as expected), with some interweaving up to ~10 steps when accounting for multiple restarts on tenant as well as other compatible steps (e.g. setting a cluster setting)

go run plan_parser.go |grep Distance |awk -F":" '{print $2}'|sort |uniq -c |sort -rn
60657  1
10481  2
1675  3
450  4
164  5
76  6
17  7
15  8
6  10

Jul 09 '25 17:07 Dev-Kyle

bors r=DarrylWong

Jul 11 '25 14:07 Dev-Kyle

Build succeeded:

Jul 11 '25 15:07 craig[bot]