scylla-cluster-tests
scylla-cluster-tests copied to clipboard
Upgrade db packages stops all nodes when growing cluster by 3 in parallel (custom db packages)
A test with custom scylla db packages (update_db_packages param set).
When growing cluster by 3 in parallel, SCT stops all the nodes instead of only added ones.
Culprit line: https://github.com/scylladb/scylla-cluster-tests/blob/066dd0231cd80ccb29e9d503c4f22cc6221a912a/sdcm/cluster.py#L4220
Impact
Fail the test due c-s errors when stopping all the nodes.
How frequently does it reproduce?
Always when growing in parallel and using custom db packages.
Installation details
Cluster size: 3 nodes (i4i.2xlarge)
Scylla Nodes used in this run:
- elasticity-test-ubuntu-db-node-bc75f3a1-6 (18.202.56.4 | 10.4.1.158) (shards: 7)
- elasticity-test-ubuntu-db-node-bc75f3a1-5 (34.243.57.142 | 10.4.0.46) (shards: 7)
- elasticity-test-ubuntu-db-node-bc75f3a1-4 (34.240.37.34 | 10.4.2.211) (shards: 7)
- elasticity-test-ubuntu-db-node-bc75f3a1-3 (34.245.179.48 | 10.4.2.65) (shards: 7)
- elasticity-test-ubuntu-db-node-bc75f3a1-2 (3.254.86.25 | 10.4.0.13) (shards: 7)
- elasticity-test-ubuntu-db-node-bc75f3a1-1 (34.244.12.247 | 10.4.0.137) (shards: 7)
OS / Image: ami-0415b87a177bf40a6 (aws: undefined_region)
Test: scylla-enterprise-perf-regression-latency-650gb-elasticity
Test id: bc75f3a1-389f-4c3e-a84f-ef388d9bd03c
Test name: scylla-staging/lukasz/scylla-enterprise-perf-regression-latency-650gb-elasticity
Test method: performance_regression_test.PerformanceRegressionTest.test_latency_mixed_with_nemesis
Test config file(s):
Logs and commands
- Restore Monitor Stack command:
$ hydra investigate show-monitor bc75f3a1-389f-4c3e-a84f-ef388d9bd03c - Restore monitor on AWS instance using Jenkins job
- Show all stored logs command:
$ hydra investigate show-logs bc75f3a1-389f-4c3e-a84f-ef388d9bd03c
Logs:
- db-cluster-bc75f3a1.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/bc75f3a1-389f-4c3e-a84f-ef388d9bd03c/20240903_130445/db-cluster-bc75f3a1.tar.gz
- sct-runner-events-bc75f3a1.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/bc75f3a1-389f-4c3e-a84f-ef388d9bd03c/20240903_130445/sct-runner-events-bc75f3a1.tar.gz
- sct-bc75f3a1.log.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/bc75f3a1-389f-4c3e-a84f-ef388d9bd03c/20240903_130445/sct-bc75f3a1.log.tar.gz
- loader-set-bc75f3a1.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/bc75f3a1-389f-4c3e-a84f-ef388d9bd03c/20240903_130445/loader-set-bc75f3a1.tar.gz
- monitor-set-bc75f3a1.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/bc75f3a1-389f-4c3e-a84f-ef388d9bd03c/20240903_130445/monitor-set-bc75f3a1.tar.gz
@soyacz this logic can go, we don't care about the ordering of starting nodes anymore, we can remove that if, and remove all the else branch
we should just stop/stop the node that are being asked, we shouldn't touch any other nodes at that point, it's a mistake
@soyacz this logic can go, we don't care about the ordering of starting nodes anymore, we can remove that if, and remove all the else branch
we should just stop/stop the node that are being asked, we shouldn't touch any other nodes at that point, it's a mistake
Yes, shouldn't be hard to fix, let's plan it for this sprint.