scylla-cluster-tests SLA - Scheduler runtime validation failed on node that unbootstraped

Scheduler runtime validation failed on node that unbootstraped (during decommission). How we can filter such nodes and not validate it there?

< t:2023-08-13 13:39:24,134 f:file_logger.py  l:101  c:sdcm.sct_events.file_logger p:INFO  > 2023-08-13 13:39:24.122: (TestStepEvent Severity.ERROR) period_type=end event_id=9f5ca403-e5a8-4fc2-91e3-51a6307f43cf 
during_nemesis=BootstrapStreamingError,ReplaceServiceLevelUsingDropDuringLoad duration=14m12s: step=Attach service level 'sl800_ac323a2a' with 800 shares to role250_ac323a2a. Validate scheduler runtime during lo
ad  errors=
< t:2023-08-13 13:39:24,134 f:file_logger.py  l:101  c:sdcm.sct_events.file_logger p:INFO  > Probably the issue https://github.com/scylladb/scylla-enterprise/issues/2572
< t:2023-08-13 13:39:24,134 f:file_logger.py  l:101  c:sdcm.sct_events.file_logger p:INFO  > (Node 10.0.3.223) - Service level sl:sl800_ac323a2a did not get resources unexpectedly. CPU%: 94.79. Runtime per service level group:
< t:2023-08-13 13:39:24,134 f:file_logger.py  l:101  c:sdcm.sct_events.file_logger p:INFO  >   sl:sl800_ac323a2a (shares 800): 0.0
< t:2023-08-13 13:39:24,134 f:file_logger.py  l:101  c:sdcm.sct_events.file_logger p:INFO  >   sl:sl500_ac323a2a (shares 500): 237.13
< t:2023-08-13 13:39:24,134 f:file_logger.py  l:101  c:sdcm.sct_events.file_logger p:INFO  > Traceback (most recent call last):
< t:2023-08-13 13:39:24,134 f:file_logger.py  l:101  c:sdcm.sct_events.file_logger p:INFO  >   File "/home/ubuntu/scylla-cluster-tests/sdcm/sla/sla_tests.py", line 126, in attach_sl_and_validate_scheduler_runtim
e
< t:2023-08-13 13:39:24,134 f:file_logger.py  l:101  c:sdcm.sct_events.file_logger p:INFO  >     self.validate_scheduler_runtime(start_time=start_time,
< t:2023-08-13 13:39:24,134 f:file_logger.py  l:101  c:sdcm.sct_events.file_logger p:INFO  >   File "/home/ubuntu/scylla-cluster-tests/sdcm/sla/libs/sla_utils.py", line 225, in validate_scheduler_runtime
< t:2023-08-13 13:39:24,134 f:file_logger.py  l:101  c:sdcm.sct_events.file_logger p:INFO  >     raise SchedulerRuntimeUnexpectedValue("".join(result))
< t:2023-08-13 13:39:24,134 f:file_logger.py  l:101  c:sdcm.sct_events.file_logger p:INFO  > sdcm.sla.libs.sla_utils.SchedulerRuntimeUnexpectedValue: 
< t:2023-08-13 13:39:24,134 f:file_logger.py  l:101  c:sdcm.sct_events.file_logger p:INFO  > Probably the issue https://github.com/scylladb/scylla-enterprise/issues/2572
< t:2023-08-13 13:39:24,134 f:file_logger.py  l:101  c:sdcm.sct_events.file_logger p:INFO  > (Node 10.0.3.223) - Service level sl:sl800_ac323a2a did not get resources unexpectedly. CPU%: 94.79. Runtime per servi
ce level group:
< t:2023-08-13 13:39:24,134 f:file_logger.py  l:101  c:sdcm.sct_events.file_logger p:INFO  >   sl:sl800_ac323a2a (shares 800): 0.0
< t:2023-08-13 13:39:24,134 f:file_logger.py  l:101  c:sdcm.sct_events.file_logger p:INFO  >   sl:sl500_ac323a2a (shares 500): 237.13

Issue description

[ ] This issue is a regression.
[ ] It is unknown if this issue is a regression.

Describe your issue in detail and steps it took to produce it.

Impact

Describe the impact this issue causes to the user.

How frequently does it reproduce?

Describe the frequency with how this issue can be reproduced.

Installation details

Kernel Version: 5.15.0-1039-aws Scylla version (or git commit hash): 2022.2.11-20230705.27d29485de90 with build-id f467a0ad8869d61384d8bbc8f20e4fb8fd281f4b

Cluster size: 5 nodes (i4i.2xlarge)

Scylla Nodes used in this run:

longevity-sla-system-24h-master-db-node-0ceb3cea-9 (13.48.48.99 | 10.0.0.37) (shards: 7)
longevity-sla-system-24h-master-db-node-0ceb3cea-8 (16.171.242.252 | 10.0.3.241) (shards: 7)
longevity-sla-system-24h-master-db-node-0ceb3cea-7 (16.171.236.88 | 10.0.2.34) (shards: 7)
longevity-sla-system-24h-master-db-node-0ceb3cea-6 (13.48.70.144 | 10.0.3.223) (shards: -1)
longevity-sla-system-24h-master-db-node-0ceb3cea-5 (16.16.25.196 | 10.0.3.78) (shards: 7)
longevity-sla-system-24h-master-db-node-0ceb3cea-4 (13.51.13.172 | 10.0.2.132) (shards: 7)
longevity-sla-system-24h-master-db-node-0ceb3cea-3 (13.51.13.71 | 10.0.2.178) (shards: 7)
longevity-sla-system-24h-master-db-node-0ceb3cea-2 (13.53.216.241 | 10.0.0.65) (shards: 7)
longevity-sla-system-24h-master-db-node-0ceb3cea-10 (13.48.70.54 | 10.0.2.0) (shards: -1)
longevity-sla-system-24h-master-db-node-0ceb3cea-1 (16.16.185.155 | 10.0.0.96) (shards: 7)

OS / Image: ami-0ce59e86771bcb0ef (aws: undefined_region)

Test: longevity-sla-system-24h Test id: 0ceb3cea-2cc6-46f1-9d48-95e3cde45bb3 Test name: enterprise-2022.2/Reproducers/longevity-sla-system-24h Test config file(s):

longevity-sla-system-24h.yaml

Logs and commands

Restore Monitor Stack command: $ hydra investigate show-monitor 0ceb3cea-2cc6-46f1-9d48-95e3cde45bb3
Restore monitor on AWS instance using Jenkins job
Show all stored logs command: $ hydra investigate show-logs 0ceb3cea-2cc6-46f1-9d48-95e3cde45bb3

Logs:

db-cluster-0ceb3cea.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/0ceb3cea-2cc6-46f1-9d48-95e3cde45bb3/20230814_095238/db-cluster-0ceb3cea.tar.gz
sct-runner-events-0ceb3cea.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/0ceb3cea-2cc6-46f1-9d48-95e3cde45bb3/20230814_095238/sct-runner-events-0ceb3cea.tar.gz
sct-0ceb3cea.log.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/0ceb3cea-2cc6-46f1-9d48-95e3cde45bb3/20230814_095238/sct-0ceb3cea.log.tar.gz
loader-set-0ceb3cea.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/0ceb3cea-2cc6-46f1-9d48-95e3cde45bb3/20230814_095238/loader-set-0ceb3cea.tar.gz
monitor-set-0ceb3cea.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/0ceb3cea-2cc6-46f1-9d48-95e3cde45bb3/20230814_095238/monitor-set-0ceb3cea.tar.gz

Jenkins job URL Argus

Aug 14 '23 14:08 juliayakovlev

@juliayakovlev maybe adding proper EventFilter to decommission (similar like we do in DropIndex)?

Aug 17 '23 10:08 soyacz

@juliayakovlev just exclude the node with have running_nemesis, for the check (adding back the target_node)

Jan 01 '24 20:01 fruch

@juliayakovlev just exclude the node with have running_nemesis, for the check (adding back the target_node)

It won't let up 100% cover. If SLA nemesis will start during decommission (or similar) nemesis and finish after that we won't know about the nemesis (validation is performs in the end)

Jan 02 '24 07:01 juliayakovlev

scylla-cluster-tests
scylla-cluster-tests copied to clipboard

SLA - Scheduler runtime validation failed on node that unbootstraped - how filter and not validate such node?

Issue description

Impact

How frequently does it reproduce?

Installation details

Logs:

scylla-cluster-tests scylla-cluster-tests copied to clipboard

SLA - Scheduler runtime validation failed on node that unbootstraped - how filter and not validate such node?

Issue description

Impact

How frequently does it reproduce?

Installation details

Logs:

scylla-cluster-tests
scylla-cluster-tests copied to clipboard