scylla-cluster-tests icon indicating copy to clipboard operation
scylla-cluster-tests copied to clipboard

Upgrade db packages stops all nodes when growing cluster by 3 in parallel (custom db packages)

Open soyacz opened this issue 1 year ago • 2 comments

A test with custom scylla db packages (update_db_packages param set). When growing cluster by 3 in parallel, SCT stops all the nodes instead of only added ones. Culprit line: https://github.com/scylladb/scylla-cluster-tests/blob/066dd0231cd80ccb29e9d503c4f22cc6221a912a/sdcm/cluster.py#L4220

Impact

Fail the test due c-s errors when stopping all the nodes.

How frequently does it reproduce?

Always when growing in parallel and using custom db packages.

Installation details

Cluster size: 3 nodes (i4i.2xlarge)

Scylla Nodes used in this run:

  • elasticity-test-ubuntu-db-node-bc75f3a1-6 (18.202.56.4 | 10.4.1.158) (shards: 7)
  • elasticity-test-ubuntu-db-node-bc75f3a1-5 (34.243.57.142 | 10.4.0.46) (shards: 7)
  • elasticity-test-ubuntu-db-node-bc75f3a1-4 (34.240.37.34 | 10.4.2.211) (shards: 7)
  • elasticity-test-ubuntu-db-node-bc75f3a1-3 (34.245.179.48 | 10.4.2.65) (shards: 7)
  • elasticity-test-ubuntu-db-node-bc75f3a1-2 (3.254.86.25 | 10.4.0.13) (shards: 7)
  • elasticity-test-ubuntu-db-node-bc75f3a1-1 (34.244.12.247 | 10.4.0.137) (shards: 7)

OS / Image: ami-0415b87a177bf40a6 (aws: undefined_region)

Test: scylla-enterprise-perf-regression-latency-650gb-elasticity Test id: bc75f3a1-389f-4c3e-a84f-ef388d9bd03c Test name: scylla-staging/lukasz/scylla-enterprise-perf-regression-latency-650gb-elasticity Test method: performance_regression_test.PerformanceRegressionTest.test_latency_mixed_with_nemesis Test config file(s):

Logs and commands
  • Restore Monitor Stack command: $ hydra investigate show-monitor bc75f3a1-389f-4c3e-a84f-ef388d9bd03c
  • Restore monitor on AWS instance using Jenkins job
  • Show all stored logs command: $ hydra investigate show-logs bc75f3a1-389f-4c3e-a84f-ef388d9bd03c

Logs:

Jenkins job URL Argus

soyacz avatar Sep 04 '24 06:09 soyacz

@soyacz this logic can go, we don't care about the ordering of starting nodes anymore, we can remove that if, and remove all the else branch

we should just stop/stop the node that are being asked, we shouldn't touch any other nodes at that point, it's a mistake

fruch avatar Sep 08 '24 22:09 fruch

@soyacz this logic can go, we don't care about the ordering of starting nodes anymore, we can remove that if, and remove all the else branch

we should just stop/stop the node that are being asked, we shouldn't touch any other nodes at that point, it's a mistake

Yes, shouldn't be hard to fix, let's plan it for this sprint.

soyacz avatar Sep 09 '24 06:09 soyacz