scylla-cluster-tests SLA nemeses fail because a new node does not receive the load due to long Scylla initialisation.

Issue description

[ ] This issue is a regression.
[ ] It is unknown if this issue is a regression.

Nemesis: disrupt_add_remove_dc Run SLA nemeses in parallel.

SLA nemeses fail because a new node that added to new DC does not receive the load. It happens because it takes about 30 minutes for initialisation complete. SLA nemeses depend on load on the node. In case there is no load - the nemesis fails.

Actually it may happen when run any nemesis that add new node.

The question here - how to recognise this situation and do not fail SLA nemesis.

Screenshot from 2023-09-04 09-53-18

Installation details

Kernel Version: 5.15.0-1040-aws Scylla version (or git commit hash): 2023.1.0-20230813.68e9cef1baf7 with build-id c7f9855620b984af24957d7ab0bd8054306d182e

Cluster size: 5 nodes (i3.2xlarge)

Scylla Nodes used in this run:

longevity-sla-system-24h-reproduc-db-node-324eb9b1-9 (52.51.53.205 | 10.4.3.119) (shards: 7)
longevity-sla-system-24h-reproduc-db-node-324eb9b1-8 (63.32.95.108 | 10.4.0.9) (shards: 7)
longevity-sla-system-24h-reproduc-db-node-324eb9b1-7 (54.194.182.240 | 10.4.2.216) (shards: 7)
longevity-sla-system-24h-reproduc-db-node-324eb9b1-6 (34.240.166.76 | 10.4.3.0) (shards: 7)
longevity-sla-system-24h-reproduc-db-node-324eb9b1-5 (54.76.52.145 | 10.4.3.71) (shards: 7)
longevity-sla-system-24h-reproduc-db-node-324eb9b1-4 (63.32.46.198 | 10.4.0.38) (shards: 7)
longevity-sla-system-24h-reproduc-db-node-324eb9b1-3 (63.32.93.249 | 10.4.2.254) (shards: 7)
longevity-sla-system-24h-reproduc-db-node-324eb9b1-2 (34.247.88.22 | 10.4.1.199) (shards: 7)
longevity-sla-system-24h-reproduc-db-node-324eb9b1-18 (3.252.204.12 | 10.4.0.209) (shards: 7)
longevity-sla-system-24h-reproduc-db-node-324eb9b1-17 (54.216.124.202 | 10.4.1.233) (shards: 7)
longevity-sla-system-24h-reproduc-db-node-324eb9b1-16 (54.75.79.189 | 10.4.1.151) (shards: 7)
longevity-sla-system-24h-reproduc-db-node-324eb9b1-15 (34.252.1.20 | 10.4.1.39) (shards: 7)
longevity-sla-system-24h-reproduc-db-node-324eb9b1-14 (34.252.206.195 | 10.4.3.92) (shards: 7)
longevity-sla-system-24h-reproduc-db-node-324eb9b1-13 (34.243.217.234 | 10.4.0.47) (shards: 7)
longevity-sla-system-24h-reproduc-db-node-324eb9b1-12 (3.248.230.240 | 10.4.1.40) (shards: 7)
longevity-sla-system-24h-reproduc-db-node-324eb9b1-11 (52.208.46.153 | 10.4.1.26) (shards: 7)
longevity-sla-system-24h-reproduc-db-node-324eb9b1-10 (34.245.175.255 | 10.4.0.49) (shards: 7)
longevity-sla-system-24h-reproduc-db-node-324eb9b1-1 (176.34.81.184 | 10.4.3.118) (shards: 7)

OS / Image: ami-083eb64a6e3b43cc8 (aws: undefined_region)

Test: longevity-sla-system-24h Test id: 324eb9b1-c30c-4bd9-a183-e61227aee1cb Test name: scylla-staging/yulia/longevity-sla-system-24h Test config file(s):

longevity-sla-system-24h.yaml

Logs and commands

Restore Monitor Stack command: $ hydra investigate show-monitor 324eb9b1-c30c-4bd9-a183-e61227aee1cb
Restore monitor on AWS instance using Jenkins job
Show all stored logs command: $ hydra investigate show-logs 324eb9b1-c30c-4bd9-a183-e61227aee1cb

Logs:

No logs captured during this run.

Jenkins job URL Argus

Sep 04 '23 06:09 juliayakovlev

this may be specific to add remove dc nemesis as new node is not getting load due data being not replicated (we don't change existing keyspaces RF to new DC). Can we limit verification to nodes only from datacenters where we have data replicated to?

Sep 04 '23 07:09 soyacz

this may be specific to add remove dc nemesis as new node is not getting load due data being not replicated (we don't change existing keyspaces RF to new DC). Can we limit verification to nodes only from datacenters where we have data replicated to?

how can we know where the data is replicated?

Sep 04 '23 07:09 juliayakovlev

describe keyspace should show dc names

Sep 04 '23 07:09 soyacz

describe keyspace should show dc names

do you mean - choose DC by it's number? dc0, dc1

Sep 05 '23 07:09 juliayakovlev

example: sdcm.utils.replication_strategy_utils.ReplicationStrategy get's datacenters where data is replicated

Sep 05 '23 08:09 soyacz

scylla-cluster-tests scylla-cluster-tests copied to clipboard

SLA nemeses fail because a new node does not receive the load due to long Scylla initialisation.

Issue description

Installation details

Logs:

scylla-cluster-tests
scylla-cluster-tests copied to clipboard