scylla-cluster-tests icon indicating copy to clipboard operation
scylla-cluster-tests copied to clipboard

SLA nemeses fail because a new node does not receive the load due to long Scylla initialisation.

Open juliayakovlev opened this issue 1 year ago • 5 comments

Issue description

  • [ ] This issue is a regression.
  • [ ] It is unknown if this issue is a regression.

Argus

Nemesis: disrupt_add_remove_dc Run SLA nemeses in parallel.

SLA nemeses fail because a new node that added to new DC does not receive the load. It happens because it takes about 30 minutes for initialisation complete. SLA nemeses depend on load on the node. In case there is no load - the nemesis fails.

Actually it may happen when run any nemesis that add new node.

The question here - how to recognise this situation and do not fail SLA nemesis.

Screenshot from 2023-09-04 09-53-18

Installation details

Kernel Version: 5.15.0-1040-aws Scylla version (or git commit hash): 2023.1.0-20230813.68e9cef1baf7 with build-id c7f9855620b984af24957d7ab0bd8054306d182e

Cluster size: 5 nodes (i3.2xlarge)

Scylla Nodes used in this run:

  • longevity-sla-system-24h-reproduc-db-node-324eb9b1-9 (52.51.53.205 | 10.4.3.119) (shards: 7)
  • longevity-sla-system-24h-reproduc-db-node-324eb9b1-8 (63.32.95.108 | 10.4.0.9) (shards: 7)
  • longevity-sla-system-24h-reproduc-db-node-324eb9b1-7 (54.194.182.240 | 10.4.2.216) (shards: 7)
  • longevity-sla-system-24h-reproduc-db-node-324eb9b1-6 (34.240.166.76 | 10.4.3.0) (shards: 7)
  • longevity-sla-system-24h-reproduc-db-node-324eb9b1-5 (54.76.52.145 | 10.4.3.71) (shards: 7)
  • longevity-sla-system-24h-reproduc-db-node-324eb9b1-4 (63.32.46.198 | 10.4.0.38) (shards: 7)
  • longevity-sla-system-24h-reproduc-db-node-324eb9b1-3 (63.32.93.249 | 10.4.2.254) (shards: 7)
  • longevity-sla-system-24h-reproduc-db-node-324eb9b1-2 (34.247.88.22 | 10.4.1.199) (shards: 7)
  • longevity-sla-system-24h-reproduc-db-node-324eb9b1-18 (3.252.204.12 | 10.4.0.209) (shards: 7)
  • longevity-sla-system-24h-reproduc-db-node-324eb9b1-17 (54.216.124.202 | 10.4.1.233) (shards: 7)
  • longevity-sla-system-24h-reproduc-db-node-324eb9b1-16 (54.75.79.189 | 10.4.1.151) (shards: 7)
  • longevity-sla-system-24h-reproduc-db-node-324eb9b1-15 (34.252.1.20 | 10.4.1.39) (shards: 7)
  • longevity-sla-system-24h-reproduc-db-node-324eb9b1-14 (34.252.206.195 | 10.4.3.92) (shards: 7)
  • longevity-sla-system-24h-reproduc-db-node-324eb9b1-13 (34.243.217.234 | 10.4.0.47) (shards: 7)
  • longevity-sla-system-24h-reproduc-db-node-324eb9b1-12 (3.248.230.240 | 10.4.1.40) (shards: 7)
  • longevity-sla-system-24h-reproduc-db-node-324eb9b1-11 (52.208.46.153 | 10.4.1.26) (shards: 7)
  • longevity-sla-system-24h-reproduc-db-node-324eb9b1-10 (34.245.175.255 | 10.4.0.49) (shards: 7)
  • longevity-sla-system-24h-reproduc-db-node-324eb9b1-1 (176.34.81.184 | 10.4.3.118) (shards: 7)

OS / Image: ami-083eb64a6e3b43cc8 (aws: undefined_region)

Test: longevity-sla-system-24h Test id: 324eb9b1-c30c-4bd9-a183-e61227aee1cb Test name: scylla-staging/yulia/longevity-sla-system-24h Test config file(s):

Logs and commands
  • Restore Monitor Stack command: $ hydra investigate show-monitor 324eb9b1-c30c-4bd9-a183-e61227aee1cb
  • Restore monitor on AWS instance using Jenkins job
  • Show all stored logs command: $ hydra investigate show-logs 324eb9b1-c30c-4bd9-a183-e61227aee1cb

Logs:

No logs captured during this run.

Jenkins job URL Argus

juliayakovlev avatar Sep 04 '23 06:09 juliayakovlev

this may be specific to add remove dc nemesis as new node is not getting load due data being not replicated (we don't change existing keyspaces RF to new DC). Can we limit verification to nodes only from datacenters where we have data replicated to?

soyacz avatar Sep 04 '23 07:09 soyacz

this may be specific to add remove dc nemesis as new node is not getting load due data being not replicated (we don't change existing keyspaces RF to new DC). Can we limit verification to nodes only from datacenters where we have data replicated to?

how can we know where the data is replicated?

juliayakovlev avatar Sep 04 '23 07:09 juliayakovlev

describe keyspace should show dc names

soyacz avatar Sep 04 '23 07:09 soyacz

describe keyspace should show dc names

do you mean - choose DC by it's number? dc0, dc1

juliayakovlev avatar Sep 05 '23 07:09 juliayakovlev

example: sdcm.utils.replication_strategy_utils.ReplicationStrategy get's datacenters where data is replicated

soyacz avatar Sep 05 '23 08:09 soyacz