scylla-cluster-tests
scylla-cluster-tests copied to clipboard
SLA nemeses fail because a new node does not receive the load due to long Scylla initialisation.
Issue description
- [ ] This issue is a regression.
- [ ] It is unknown if this issue is a regression.
Nemesis: disrupt_add_remove_dc
Run SLA nemeses in parallel.
SLA nemeses fail because a new node that added to new DC does not receive the load. It happens because it takes about 30 minutes for initialisation complete. SLA nemeses depend on load on the node. In case there is no load - the nemesis fails.
Actually it may happen when run any nemesis that add new node.
The question here - how to recognise this situation and do not fail SLA nemesis.
Installation details
Kernel Version: 5.15.0-1040-aws
Scylla version (or git commit hash): 2023.1.0-20230813.68e9cef1baf7
with build-id c7f9855620b984af24957d7ab0bd8054306d182e
Cluster size: 5 nodes (i3.2xlarge)
Scylla Nodes used in this run:
- longevity-sla-system-24h-reproduc-db-node-324eb9b1-9 (52.51.53.205 | 10.4.3.119) (shards: 7)
- longevity-sla-system-24h-reproduc-db-node-324eb9b1-8 (63.32.95.108 | 10.4.0.9) (shards: 7)
- longevity-sla-system-24h-reproduc-db-node-324eb9b1-7 (54.194.182.240 | 10.4.2.216) (shards: 7)
- longevity-sla-system-24h-reproduc-db-node-324eb9b1-6 (34.240.166.76 | 10.4.3.0) (shards: 7)
- longevity-sla-system-24h-reproduc-db-node-324eb9b1-5 (54.76.52.145 | 10.4.3.71) (shards: 7)
- longevity-sla-system-24h-reproduc-db-node-324eb9b1-4 (63.32.46.198 | 10.4.0.38) (shards: 7)
- longevity-sla-system-24h-reproduc-db-node-324eb9b1-3 (63.32.93.249 | 10.4.2.254) (shards: 7)
- longevity-sla-system-24h-reproduc-db-node-324eb9b1-2 (34.247.88.22 | 10.4.1.199) (shards: 7)
- longevity-sla-system-24h-reproduc-db-node-324eb9b1-18 (3.252.204.12 | 10.4.0.209) (shards: 7)
- longevity-sla-system-24h-reproduc-db-node-324eb9b1-17 (54.216.124.202 | 10.4.1.233) (shards: 7)
- longevity-sla-system-24h-reproduc-db-node-324eb9b1-16 (54.75.79.189 | 10.4.1.151) (shards: 7)
- longevity-sla-system-24h-reproduc-db-node-324eb9b1-15 (34.252.1.20 | 10.4.1.39) (shards: 7)
- longevity-sla-system-24h-reproduc-db-node-324eb9b1-14 (34.252.206.195 | 10.4.3.92) (shards: 7)
- longevity-sla-system-24h-reproduc-db-node-324eb9b1-13 (34.243.217.234 | 10.4.0.47) (shards: 7)
- longevity-sla-system-24h-reproduc-db-node-324eb9b1-12 (3.248.230.240 | 10.4.1.40) (shards: 7)
- longevity-sla-system-24h-reproduc-db-node-324eb9b1-11 (52.208.46.153 | 10.4.1.26) (shards: 7)
- longevity-sla-system-24h-reproduc-db-node-324eb9b1-10 (34.245.175.255 | 10.4.0.49) (shards: 7)
- longevity-sla-system-24h-reproduc-db-node-324eb9b1-1 (176.34.81.184 | 10.4.3.118) (shards: 7)
OS / Image: ami-083eb64a6e3b43cc8
(aws: undefined_region)
Test: longevity-sla-system-24h
Test id: 324eb9b1-c30c-4bd9-a183-e61227aee1cb
Test name: scylla-staging/yulia/longevity-sla-system-24h
Test config file(s):
Logs and commands
- Restore Monitor Stack command:
$ hydra investigate show-monitor 324eb9b1-c30c-4bd9-a183-e61227aee1cb
- Restore monitor on AWS instance using Jenkins job
- Show all stored logs command:
$ hydra investigate show-logs 324eb9b1-c30c-4bd9-a183-e61227aee1cb
Logs:
No logs captured during this run.
this may be specific to add remove dc nemesis as new node is not getting load due data being not replicated (we don't change existing keyspaces RF to new DC). Can we limit verification to nodes only from datacenters where we have data replicated to?
this may be specific to add remove dc nemesis as new node is not getting load due data being not replicated (we don't change existing keyspaces RF to new DC). Can we limit verification to nodes only from datacenters where we have data replicated to?
how can we know where the data is replicated?
describe keyspace should show dc names
describe keyspace should show dc names
do you mean - choose DC by it's number? dc0, dc1
example: sdcm.utils.replication_strategy_utils.ReplicationStrategy
get's datacenters where data is replicated