scylladb
scylladb copied to clipboard
dtest: topology_test.TestTopology.test_crash_during_decommission got unexpected errors: raft::request_aborted (Request is aborted by a caller)
Scylla version 5.5.0~dev-0.20240219.82104b6f50f6
topology_test.TestTopology.test_crash_during_decommission got errors in node1 log:
...
INFO 2024-02-19 21:58:55,351 [shard 0: gms] gossip - Node 127.0.98.2 has restarted, now UP, status = NORMAL
ERROR 2024-02-19 21:58:56,126 [shard 0:strm] raft_topology - drain rpc failed, proceed to fence old writes: raft::request_aborted (Request is aborted by a caller)
ERROR 2024-02-19 21:58:56,126 [shard 0:strm] raft_topology - transition_state::write_both_read_old, global_token_metadata_barrier failed, error raft::request_aborted (Request is aborted by a caller)
...
test logs at the same time:
...
21:58:53,400 802 topology_test DEBUG topology_test.py :253 | test_crash_during_decommission: Restarting node2
21:58:54,406 802 ccm DEBUG cluster.py :762 | test_crash_during_decommission: node2: Starting scylla: args=['/jenkins/workspace/scylla-master/releng-testing/gating-dtest-release-with-consistent-topology-changes/scylla/.dtest/dtest-f2tc6s32/test/node2/bin/scylla', '--options-file', '/jenkins/workspace/scylla-master/releng-testing/gating-dtest-release-with-consistent-topology-changes/scylla/.dtest/dtest-f2tc6s32/test/node2/conf/scylla.yaml', '--log-to-stdout', '1', '--abort-on-seastar-bad-alloc', '--abort-on-lsa-bad-alloc', '1', '--abort-on-internal-error', '1', '--api-address', '127.0.98.2', '--smp', '2', '--memory', '1024M', '--developer-mode', 'true', '--default-log-level', 'info', '--overprovisioned', '--prometheus-address', '127.0.98.2', '--unsafe-bypass-fsync', '1', '--kernel-page-cache', '1', '--commitlog-use-o-dsync', '0', '--max-networking-io-control-blocks', '1000'] wait_other_notice=False wait_for_binary_proto=True
21:58:55,067 802 ccm DEBUG cluster.py :762 | test_crash_during_decommission: node2: Starting scylla-jmx: args=['/jenkins/workspace/scylla-master/releng-testing/gating-dtest-release-with-consistent-topology-changes/scylla/.dtest/dtest-f2tc6s32/test/node2/bin/symlinks/scylla-jmx', '-Dapiaddress=127.0.98.2', '-Djavax.management.builder.initial=com.scylladb.jmx.utils.APIBuilder', '-Djava.rmi.server.hostname=127.0.98.2', '-Dcom.sun.management.jmxremote', '-Dcom.sun.management.jmxremote.host=127.0.98.2', '-Dcom.sun.management.jmxremote.port=7199', '-Dcom.sun.management.jmxremote.rmi.port=7199', '-Dcom.sun.management.jmxremote.local.only=false', '-Xmx256m', '-XX:+UseSerialGC', '-Dcom.sun.management.jmxremote.authenticate=false', '-Dcom.sun.management.jmxremote.ssl=false', '-jar', '/jenkins/workspace/scylla-master/releng-testing/gating-dtest-release-with-consistent-topology-changes/scylla/.dtest/dtest-f2tc6s32/test/node2/bin/scylla-jmx-1.0.jar']
21:58:55,521 802 ccm DEBUG cluster.py :762 | test_crash_during_decommission: node2: nodetool cmd=status -Dcom.scylladb.apiPort=10000 wait=True timeout=None
21:58:56,960 802 topology_test DEBUG topology_test.py :268 | test_crash_during_decommission: Status as reported by node 127.0.98.2
21:58:56,960 802 topology_test DEBUG topology_test.py :269 | test_crash_during_decommission: Datacenter: datacenter1
=======================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
-- Address Load Tokens Owns Host ID Rack
UN 127.0.98.1 248.86 KB 256 ? 229ab0df-41ae-4094-8951-3ea273483f62 rack1
UN 127.0.98.2 0 bytes 256 ? 51fbf454-4811-4643-9fc2-ca6303d001a3 rack1
UN 127.0.98.3 501.14 KB 256 ? e16ce75a-f599-46fb-bbec-df5caf681746 rack1
Note: Non-system keyspaces don't have the same replication settings, effective ownership information is meaningless
...
https://jenkins.scylladb.com/job/scylla-master/job/releng-testing/job/gating-dtest-release-with-consistent-topology-changes/25/testReport/junit/topology_test/TestTopology/FullDtest___full_split006___test_crash_during_decommission/
Cluster logs:
1708380532773_topology_test.py TestTopology test_crash_during_decommission.zip
It's an intermittent problem, the test is passing most runs. But according to @kbr-scylla there is a bigger problem with this than just the exception, it has unbounded run time.
But according to @kbr-scylla there is a bigger problem with this than just the exception, it has unbounded run time.
Yes. In the logs I saw over 80 retries for the decommission operation. The test is starting decommission on one node, and then killing another node in a loop and restarting it.
The killed node keeps becoming coordinator after each restart and retries the operation.
This happens for about 80 times. Finally, another node (the decommissioning one) is lucky enough to become coordinator, and this time it fails the operation because it tries to RPC the killed node.
It looks like the test starts a decommission and restarts some other node in a loop until the decommission fails. Why not re-write the test to stop the node, wait for decommission to fail, and only then start it. It will work with gossiper and raft the same way.
And now on gating https://jenkins.scylladb.com/job/scylla-master/job/gating-dtest-release-with-consistent-topology-changes/3/artifact/logs-full.release.006/dtest-gw1.log
@enaydanov I agree with @gleb-cloudius 's suggestion. This IIUC should also cause this error to stop appearing.
I've dug into the history of the test and found following C* issue: https://issues.apache.org/jira/browse/CASSANDRA-10231 and the original idea of the test is to reproduce a situation discovered by a Jepsen run when the status of a decommissioning node became null
on a crashed node and stay the same till a restart of this node.
It's not about causing the decommission to fail, it's about updating a status of the decommissioning/decommissioned node on a crashed node.
As mentioned in the C* issued by the original author of the test, it's not an easy task and done thru many attempts of repro steps. That makes this test nondeterministic and I wonder how it became a part of next_gating suite. From my pov we should exclude it from next_gating in general not just for RAFT topology mode.
Nice, thanks for the research. In that case, @enaydanov, we can simply limit the number of crashes of node2
. Maybe to something like 10?
In gossip mode, the decommission will usually fail after 1-2 crashes anyway.
In raft-topology mode, the problem is that the coordinator is retrying (no such retry mechanism for gossip mode exists). And it just keeps retrying, and if we're (un)lucky, it may be retrying for a very long time until timing makes it so we eventually crash at the right place or some other node becomes coordinator.
So let's limit the number of crashes so the test doesn't run potentially forever.
After the last crash, we can wait for decommission to fail before restarting node2
.
As for the request_aborted
error, it might happen and it's fine if it happens, so we can ignore a pattern like raft::request_aborted (Request is aborted by a caller)
Unassigning from myself --- test needs adjustment