scylla-cluster-tests Upgrade db packages stops all nodes when growing cluster by 3 in parallel (custom db packages)

A test with custom scylla db packages (update_db_packages param set). When growing cluster by 3 in parallel, SCT stops all the nodes instead of only added ones. Culprit line: https://github.com/scylladb/scylla-cluster-tests/blob/066dd0231cd80ccb29e9d503c4f22cc6221a912a/sdcm/cluster.py#L4220

Impact

Fail the test due c-s errors when stopping all the nodes.

How frequently does it reproduce?

Always when growing in parallel and using custom db packages.

Installation details

Cluster size: 3 nodes (i4i.2xlarge)

Scylla Nodes used in this run:

elasticity-test-ubuntu-db-node-bc75f3a1-6 (18.202.56.4 | 10.4.1.158) (shards: 7)
elasticity-test-ubuntu-db-node-bc75f3a1-5 (34.243.57.142 | 10.4.0.46) (shards: 7)
elasticity-test-ubuntu-db-node-bc75f3a1-4 (34.240.37.34 | 10.4.2.211) (shards: 7)
elasticity-test-ubuntu-db-node-bc75f3a1-3 (34.245.179.48 | 10.4.2.65) (shards: 7)
elasticity-test-ubuntu-db-node-bc75f3a1-2 (3.254.86.25 | 10.4.0.13) (shards: 7)
elasticity-test-ubuntu-db-node-bc75f3a1-1 (34.244.12.247 | 10.4.0.137) (shards: 7)

OS / Image: ami-0415b87a177bf40a6 (aws: undefined_region)

Test: scylla-enterprise-perf-regression-latency-650gb-elasticity Test id: bc75f3a1-389f-4c3e-a84f-ef388d9bd03c Test name: scylla-staging/lukasz/scylla-enterprise-perf-regression-latency-650gb-elasticity Test method: performance_regression_test.PerformanceRegressionTest.test_latency_mixed_with_nemesis Test config file(s):

perf-regression-latency-650gb-elasticity.yaml

Logs and commands

Restore Monitor Stack command: $ hydra investigate show-monitor bc75f3a1-389f-4c3e-a84f-ef388d9bd03c
Restore monitor on AWS instance using Jenkins job
Show all stored logs command: $ hydra investigate show-logs bc75f3a1-389f-4c3e-a84f-ef388d9bd03c

Logs:

db-cluster-bc75f3a1.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/bc75f3a1-389f-4c3e-a84f-ef388d9bd03c/20240903_130445/db-cluster-bc75f3a1.tar.gz
sct-runner-events-bc75f3a1.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/bc75f3a1-389f-4c3e-a84f-ef388d9bd03c/20240903_130445/sct-runner-events-bc75f3a1.tar.gz
sct-bc75f3a1.log.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/bc75f3a1-389f-4c3e-a84f-ef388d9bd03c/20240903_130445/sct-bc75f3a1.log.tar.gz
loader-set-bc75f3a1.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/bc75f3a1-389f-4c3e-a84f-ef388d9bd03c/20240903_130445/loader-set-bc75f3a1.tar.gz
monitor-set-bc75f3a1.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/bc75f3a1-389f-4c3e-a84f-ef388d9bd03c/20240903_130445/monitor-set-bc75f3a1.tar.gz

Jenkins job URL Argus

Sep 04 '24 06:09 soyacz

@soyacz this logic can go, we don't care about the ordering of starting nodes anymore, we can remove that if, and remove all the else branch

we should just stop/stop the node that are being asked, we shouldn't touch any other nodes at that point, it's a mistake

Sep 08 '24 22:09 fruch

@soyacz this logic can go, we don't care about the ordering of starting nodes anymore, we can remove that if, and remove all the else branch

we should just stop/stop the node that are being asked, we shouldn't touch any other nodes at that point, it's a mistake

Yes, shouldn't be hard to fix, let's plan it for this sprint.

Sep 09 '24 06:09 soyacz

scylla-cluster-tests scylla-cluster-tests copied to clipboard

Upgrade db packages stops all nodes when growing cluster by 3 in parallel (custom db packages)

Impact

How frequently does it reproduce?

Installation details

Logs:

scylla-cluster-tests
scylla-cluster-tests copied to clipboard