scylla-cluster-tests
scylla-cluster-tests copied to clipboard
After error: sdcm.remote.base.RetryableNetworkException next command on with same remoter and ssh session could be not executed on node
Issue description
- [x] This issue is a regression.
- [ ] It is unknown if this issue is a regression.
Node3 was decommissioned with Decommission_seed_node
nemesis and new node was added node11. But during cleanup
on node11 issue: https://github.com/scylladb/scylladb/issues/14004 happened with network exception. In several nemesis, node11 was choosen for new nemesis disrupt_restart_with_resharding
. First steps of nemesis are stop scylla update scylla.yaml and start scylla. And according to sct.log, command was executed succesfully on node11.
Sct reported that command was executed on node11:
< t:2024-05-12 05:23:06,708 f:cluster.py l:635 c:sdcm.cluster_aws p:DEBUG > Node longevity-parallel-topology-schema--db-node-2a3bb187-11 [34.244.68.23 | 10.4.10.153]: CPUSET on node longevity-parallel-topology-schema--db-node-2a3bb187-11: CPUSET="--cpuset 1-7 "
< t:2024-05-12 05:23:06,974 f:nemesis.py l:1001 c:sdcm.nemesis p:INFO > sdcm.nemesis.SisyphusMonkey: Restart node with resharding. New murmur3_partitioner_ignore_msb_bits value: 15
< t:2024-05-12 05:23:06,974 f:decorators.py l:118 c:sdcm.utils.decorators p:DEBUG > BEGIN: stop_scylla <AWSNode>
< t:2024-05-12 05:23:06,974 f:remote_base.py l:521 c:RemoteLibSSH2CmdRunner p:DEBUG > <10.4.10.153>: Running command "sudo systemctl stop scylla-server.service"...
< t:2024-05-12 05:23:55,490 f:base.py l:143 c:RemoteLibSSH2CmdRunner p:DEBUG > <10.4.10.153>: Command "sudo systemctl stop scylla-server.service" finished with status 0
< t:2024-05-12 05:24:00,261 f:remote_base.py l:521 c:RemoteLibSSH2CmdRunner p:DEBUG > <10.4.10.153>: Running command "sudo systemctl start scylla-server.service"...
< t:2024-05-12 05:24:44,815 f:base.py l:143 c:RemoteLibSSH2CmdRunner p:DEBUG > <10.4.10.153>: Command "sudo systemctl start scylla-server.service" finished with status 0
But this command is not logged and scylla was not restarted :
2024-05-12T05:22:23.581+00:00 longevity-parallel-topology-schema--db-node-2a3bb187-11 !INFO | scylla[6772]: [shard 3:strm] query_processor - Column definitions for scylla_bench.test changed, invalidating related prepared statements
2024-05-12T05:22:23.581+00:00 longevity-parallel-topology-schema--db-node-2a3bb187-11 !INFO | scylla[6772]: [shard 0:strm] migration_manager - Gossiping my schema version 82c3b60c-101e-11ef-df50-1b54cd8053fe
2024-05-12T05:22:23.581+00:00 longevity-parallel-topology-schema--db-node-2a3bb187-11 !INFO | scylla[6772]: [shard 0:strm] schema_tables - Schema version changed to 82c3b60c-101e-11ef-df50-1b54cd8053fe
2024-05-12T05:22:23.581+00:00 longevity-parallel-topology-schema--db-node-2a3bb187-11 !INFO | scylla[6772]: [shard 0:strm] schema_tables - Altering scylla_bench.test id=0ef48140-0fef-11ef-a1d9-169281dba5bc version=83986280-101e-11ef-bbd9-8728581320a2
2024-05-12T05:22:23.581+00:00 longevity-parallel-topology-schema--db-node-2a3bb187-11 !INFO | scylla[6772]: [shard 0:strm] query_processor - Column definitions for scylla_bench.test changed, invalidating related prepared statements
2024-05-12T05:22:23.581+00:00 longevity-parallel-topology-schema--db-node-2a3bb187-11 !INFO | scylla[6772]: [shard 5:strm] query_processor - Column definitions for scylla_bench.test changed, invalidating related prepared statements
2024-05-12T05:22:23.581+00:00 longevity-parallel-topology-schema--db-node-2a3bb187-11 !INFO | scylla[6772]: [shard 1:strm] query_processor - Column definitions for scylla_bench.test changed, invalidating related prepared statements
2024-05-12T05:22:23.581+00:00 longevity-parallel-topology-schema--db-node-2a3bb187-11 !INFO | scylla[6772]: [shard 6:strm] query_processor - Column definitions for scylla_bench.test changed, invalidating related prepared statements
2024-05-12T05:22:23.581+00:00 longevity-parallel-topology-schema--db-node-2a3bb187-11 !INFO | scylla[6772]: [shard 3:strm] query_processor - Column definitions for scylla_bench.test changed, invalidating related prepared statements
2024-05-12T05:22:23.581+00:00 longevity-parallel-topology-schema--db-node-2a3bb187-11 !INFO | scylla[6772]: [shard 2:strm] query_processor - Column definitions for scylla_bench.test changed, invalidating related prepared statements
2024-05-12T05:22:23.581+00:00 longevity-parallel-topology-schema--db-node-2a3bb187-11 !INFO | scylla[6772]: [shard 4:strm] query_processor - Column definitions for scylla_bench.test changed, invalidating related prepared statements
2024-05-12T05:22:23.581+00:00 longevity-parallel-topology-schema--db-node-2a3bb187-11 !INFO | scylla[6772]: [shard 0:strm] migration_manager - Gossiping my schema version 83987c20-101e-11ef-930b-cceb725bebf8
2024-05-12T05:22:23.581+00:00 longevity-parallel-topology-schema--db-node-2a3bb187-11 !INFO | scylla[6772]: [shard 0:strm] schema_tables - Schema version changed to 83987c20-101e-11ef-930b-cceb725bebf8
2024-05-12T05:22:23.581+00:00 longevity-parallel-topology-schema--db-node-2a3bb187-11 !INFO | scylla[6772]: [shard 0:comp] compaction - [Compact keyspace1.standard1 7d2d5180-101e-11ef-b3fd-aed309b4fe72] Compacted 2 sstables to [/var/lib/scylla/data/keyspace1/standard1-fe44ce900fee11efa9cc2e9ed5cf889a/me-3gg2_0ejw_47j8w2qjd05txmp7xu-big-Data.db:level=0]. 1GB to 1GB (~86% of original) in 10905ms = 135MB/s. ~1052672 total partitions merged to 905165.
2024-05-12T05:22:23.581+00:00 longevity-parallel-topology-schema--db-node-2a3bb187-11 !INFO | scylla[6772]: [shard 0:comp] compaction - [Compact keyspace1.standard1 83ae8290-101e-11ef-b3fd-aed309b4fe72] Compacting [/var/lib/scylla/data/keyspace1/standard1-fe44ce900fee11efa9cc2e9ed5cf889a/me-3gg2_0ejw_47j8w2qjd05txmp7xu-big-Data.db:level=0:origin=compaction,/var/lib/scylla/data/keyspace1/standard1-fe44ce900fee11efa9cc2e9ed5cf889a/me-3gg2_0eeq_4jbmo2qjd05txmp7xu-big-Data.db:level=0:origin=cleanup]
2024-05-12T05:22:23.581+00:00 longevity-parallel-topology-schema--db-node-2a3bb187-11 !INFO | scylla[6772]: [shard 6:strm] compaction_manager - Starting off-strategy compaction for keyspace1.standard1 compaction_group=0, 469 candidates were found
2024-05-12T05:23:23.582+00:00 longevity-parallel-topology-schema--db-node-2a3bb187-11 !INFO | scylla[6772]: [shard 2:strm] compaction_manager - Starting off-strategy compaction for keyspace1.standard1 compaction_group=0, 486 candidates were found
2024-05-12T05:24:23.585+00:00 longevity-parallel-topology-schema--db-node-2a3bb187-11 !INFO | scylla[6772]: [shard 5:strm] compaction - [Reshape keyspace1.standard1 8a7ce940-101e-11ef-a596-aed409b4fe72] Reshaped 484 sstables to [/var/lib/scylla/data/keyspace1/standard1-fe44ce900fee11efa9cc2e9ed5cf889a/me-3gg2_0ekj_07xhs2inife2kgzt6a-big-Data.db:level=0]. 5MB to 2MB (~48% of original) in 153ms = 34MB/s. ~61952 total partitions merged to 1720.
2024-05-12T05:24:23.585+00:00 longevity-parallel-topology-schema--db-node-2a3bb187-11 !INFO | scylla[6772]: [shard 5:strm] compaction_manager - Done with off-strategy compaction for keyspace1.standard1 compaction_group=0
2024-05-12T05:25:23.587+00:00 longevity-parallel-topology-schema--db-node-2a3bb187-11 !INFO | scylla[6772]: [shard 0:strm] query_processor - Column definitions for scylla_bench.test changed, invalidating related prepared statements
2024-05-12T05:25:23.587+00:00 longevity-parallel-topology-schema--db-node-2a3bb187-11 !INFO | scylla[6772]: [shard 5:strm] query_processor - Column definitions for scylla_bench.test changed, invalidating related prepared statements
2024-05-12T05:25:23.587+00:00 longevity-parallel-topology-schema--db-node-2a3bb187-11 !INFO | scylla[6772]: [shard 6:strm] query_processor - Column definitions for scylla_bench.test changed, invalidating related prepared statements
2024-05-12T05:25:23.587+00:00 longevity-parallel-topology-schema--db-node-2a3bb187-11 !INFO | scylla[6772]: [shard 2:strm] query_processor - Column definitions for scylla_bench.test changed, invalidating related prepared statements
2024-05-12T05:25:23.587+00:00 longevity-parallel-topology-schema--db-node-2a3bb187-11 !INFO | scylla[6772]: [shard 3:strm] query_processor - Column definitions for scylla_bench.test changed, invalidating related prepared statements
The possible reason is node11 was lost connection due to issue: https://github.com/scylladb/scylladb/issues/14004 during previous disruptive nemesis and after that ssh connection was not reconnected properly.
Steps to Reproduce
- Decommission seed with possible networkexception like in issue https://github.com/scylladb/scylladb/issues/14004
- Run disrupt_restart_with_resharding on new node after step1
Expected behavior: [What you expected to happen] disrupt_restart_with_resharding should pass
Actual behavior: [What actually happened]
Failed with error: Exception: Resharding has not been started (murmur3_partitioner_ignore_msb_bits=15) Check the log for the details
because command for scylla restart was not exectued on node.
Impact
Nemesis disrupt_restart_with_resharding
failed because scylla was not restart
How frequently does it reproduce?
rarely
Installation details
SCT Version: master:latest Scylla version (or git commit hash): Scylla version 5.5.0~dev-0.20240510.28791aa2c1d3
Logs
- test_id: 2a3bb187-5efb-4f8d-87a8-9508cb0a56cf
- Argus: https://argus.scylladb.com/test/e17ff39a-f3f6-4ac6-a161-871267fd3955/runs?additionalRuns[]=2a3bb187-5efb-4f8d-87a8-9508cb0a56cf