scylladb dtest: topology_test.TestTopology.test_crash_during_decommission got unexpected errors: raft::request

Scylla version 5.5.0~dev-0.20240219.82104b6f50f6

topology_test.TestTopology.test_crash_during_decommission got errors in node1 log:

...
INFO  2024-02-19 21:58:55,351 [shard 0: gms] gossip - Node 127.0.98.2 has restarted, now UP, status = NORMAL
ERROR 2024-02-19 21:58:56,126 [shard 0:strm] raft_topology - drain rpc failed, proceed to fence old writes: raft::request_aborted (Request is aborted by a caller)
ERROR 2024-02-19 21:58:56,126 [shard 0:strm] raft_topology - transition_state::write_both_read_old, global_token_metadata_barrier failed, error raft::request_aborted (Request is aborted by a caller)
...

test logs at the same time:

...
21:58:53,400 802     topology_test                  DEBUG    topology_test.py    :253  | test_crash_during_decommission: Restarting node2
21:58:54,406 802     ccm                            DEBUG    cluster.py          :762  | test_crash_during_decommission: node2: Starting scylla: args=['/jenkins/workspace/scylla-master/releng-testing/gating-dtest-release-with-consistent-topology-changes/scylla/.dtest/dtest-f2tc6s32/test/node2/bin/scylla', '--options-file', '/jenkins/workspace/scylla-master/releng-testing/gating-dtest-release-with-consistent-topology-changes/scylla/.dtest/dtest-f2tc6s32/test/node2/conf/scylla.yaml', '--log-to-stdout', '1', '--abort-on-seastar-bad-alloc', '--abort-on-lsa-bad-alloc', '1', '--abort-on-internal-error', '1', '--api-address', '127.0.98.2', '--smp', '2', '--memory', '1024M', '--developer-mode', 'true', '--default-log-level', 'info', '--overprovisioned', '--prometheus-address', '127.0.98.2', '--unsafe-bypass-fsync', '1', '--kernel-page-cache', '1', '--commitlog-use-o-dsync', '0', '--max-networking-io-control-blocks', '1000'] wait_other_notice=False wait_for_binary_proto=True
21:58:55,067 802     ccm                            DEBUG    cluster.py          :762  | test_crash_during_decommission: node2: Starting scylla-jmx: args=['/jenkins/workspace/scylla-master/releng-testing/gating-dtest-release-with-consistent-topology-changes/scylla/.dtest/dtest-f2tc6s32/test/node2/bin/symlinks/scylla-jmx', '-Dapiaddress=127.0.98.2', '-Djavax.management.builder.initial=com.scylladb.jmx.utils.APIBuilder', '-Djava.rmi.server.hostname=127.0.98.2', '-Dcom.sun.management.jmxremote', '-Dcom.sun.management.jmxremote.host=127.0.98.2', '-Dcom.sun.management.jmxremote.port=7199', '-Dcom.sun.management.jmxremote.rmi.port=7199', '-Dcom.sun.management.jmxremote.local.only=false', '-Xmx256m', '-XX:+UseSerialGC', '-Dcom.sun.management.jmxremote.authenticate=false', '-Dcom.sun.management.jmxremote.ssl=false', '-jar', '/jenkins/workspace/scylla-master/releng-testing/gating-dtest-release-with-consistent-topology-changes/scylla/.dtest/dtest-f2tc6s32/test/node2/bin/scylla-jmx-1.0.jar']
21:58:55,521 802     ccm                            DEBUG    cluster.py          :762  | test_crash_during_decommission: node2: nodetool cmd=status -Dcom.scylladb.apiPort=10000 wait=True timeout=None
21:58:56,960 802     topology_test                  DEBUG    topology_test.py    :268  | test_crash_during_decommission: Status as reported by node 127.0.98.2
21:58:56,960 802     topology_test                  DEBUG    topology_test.py    :269  | test_crash_during_decommission: Datacenter: datacenter1
=======================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
--  Address     Load       Tokens       Owns    Host ID                               Rack
UN  127.0.98.1  248.86 KB  256          ?       229ab0df-41ae-4094-8951-3ea273483f62  rack1
UN  127.0.98.2  0 bytes    256          ?       51fbf454-4811-4643-9fc2-ca6303d001a3  rack1
UN  127.0.98.3  501.14 KB  256          ?       e16ce75a-f599-46fb-bbec-df5caf681746  rack1

Note: Non-system keyspaces don't have the same replication settings, effective ownership information is meaningless
...

https://jenkins.scylladb.com/job/scylla-master/job/releng-testing/job/gating-dtest-release-with-consistent-topology-changes/25/testReport/junit/topology_test/TestTopology/FullDtest___full_split006___test_crash_during_decommission/

Cluster logs:

1708380532773_topology_test.py TestTopology test_crash_during_decommission.zip

It's an intermittent problem, the test is passing most runs. But according to @kbr-scylla there is a bigger problem with this than just the exception, it has unbounded run time.

Feb 20 '24 10:02 enaydanov

But according to @kbr-scylla there is a bigger problem with this than just the exception, it has unbounded run time.

Yes. In the logs I saw over 80 retries for the decommission operation. The test is starting decommission on one node, and then killing another node in a loop and restarting it.

The killed node keeps becoming coordinator after each restart and retries the operation.

This happens for about 80 times. Finally, another node (the decommissioning one) is lucky enough to become coordinator, and this time it fails the operation because it tries to RPC the killed node.

Feb 20 '24 12:02 kbr-scylla

It looks like the test starts a decommission and restarts some other node in a loop until the decommission fails. Why not re-write the test to stop the node, wait for decommission to fail, and only then start it. It will work with gossiper and raft the same way.

Feb 20 '24 12:02 gleb-cloudius

And now on gating https://jenkins.scylladb.com/job/scylla-master/job/gating-dtest-release-with-consistent-topology-changes/3/artifact/logs-full.release.006/dtest-gw1.log

Feb 20 '24 17:02 benipeled

@enaydanov I agree with @gleb-cloudius 's suggestion. This IIUC should also cause this error to stop appearing.

Feb 20 '24 18:02 kbr-scylla

I've dug into the history of the test and found following C* issue: https://issues.apache.org/jira/browse/CASSANDRA-10231 and the original idea of the test is to reproduce a situation discovered by a Jepsen run when the status of a decommissioning node became null on a crashed node and stay the same till a restart of this node.

It's not about causing the decommission to fail, it's about updating a status of the decommissioning/decommissioned node on a crashed node.

As mentioned in the C* issued by the original author of the test, it's not an easy task and done thru many attempts of repro steps. That makes this test nondeterministic and I wonder how it became a part of next_gating suite. From my pov we should exclude it from next_gating in general not just for RAFT topology mode.

Feb 21 '24 05:02 enaydanov

Nice, thanks for the research. In that case, @enaydanov, we can simply limit the number of crashes of node2. Maybe to something like 10?

In gossip mode, the decommission will usually fail after 1-2 crashes anyway.

In raft-topology mode, the problem is that the coordinator is retrying (no such retry mechanism for gossip mode exists). And it just keeps retrying, and if we're (un)lucky, it may be retrying for a very long time until timing makes it so we eventually crash at the right place or some other node becomes coordinator.

So let's limit the number of crashes so the test doesn't run potentially forever.

After the last crash, we can wait for decommission to fail before restarting node2.

As for the request_aborted error, it might happen and it's fine if it happens, so we can ignore a pattern like raft::request_aborted (Request is aborted by a caller)

Feb 21 '24 17:02 kbr-scylla

Unassigning from myself --- test needs adjustment

Feb 27 '24 11:02 kbr-scylla

scylladb
scylladb copied to clipboard

dtest: topology_test.TestTopology.test_crash_during_decommission got unexpected errors: raft::request_aborted (Request is aborted by a caller)

scylladb scylladb copied to clipboard

dtest: topology_test.TestTopology.test_crash_during_decommission got unexpected errors: raft::request_aborted (Request is aborted by a caller)

scylladb
scylladb copied to clipboard