scylla-cluster-tests
scylla-cluster-tests copied to clipboard
SLA - Scheduler runtime validation failed on node that unbootstraped - how filter and not validate such node?
Scheduler runtime validation failed on node that unbootstraped (during decommission). How we can filter such nodes and not validate it there?
< t:2023-08-13 13:39:24,134 f:file_logger.py l:101 c:sdcm.sct_events.file_logger p:INFO > 2023-08-13 13:39:24.122: (TestStepEvent Severity.ERROR) period_type=end event_id=9f5ca403-e5a8-4fc2-91e3-51a6307f43cf
during_nemesis=BootstrapStreamingError,ReplaceServiceLevelUsingDropDuringLoad duration=14m12s: step=Attach service level 'sl800_ac323a2a' with 800 shares to role250_ac323a2a. Validate scheduler runtime during lo
ad errors=
< t:2023-08-13 13:39:24,134 f:file_logger.py l:101 c:sdcm.sct_events.file_logger p:INFO > Probably the issue https://github.com/scylladb/scylla-enterprise/issues/2572
< t:2023-08-13 13:39:24,134 f:file_logger.py l:101 c:sdcm.sct_events.file_logger p:INFO > (Node 10.0.3.223) - Service level sl:sl800_ac323a2a did not get resources unexpectedly. CPU%: 94.79. Runtime per service level group:
< t:2023-08-13 13:39:24,134 f:file_logger.py l:101 c:sdcm.sct_events.file_logger p:INFO > sl:sl800_ac323a2a (shares 800): 0.0
< t:2023-08-13 13:39:24,134 f:file_logger.py l:101 c:sdcm.sct_events.file_logger p:INFO > sl:sl500_ac323a2a (shares 500): 237.13
< t:2023-08-13 13:39:24,134 f:file_logger.py l:101 c:sdcm.sct_events.file_logger p:INFO > Traceback (most recent call last):
< t:2023-08-13 13:39:24,134 f:file_logger.py l:101 c:sdcm.sct_events.file_logger p:INFO > File "/home/ubuntu/scylla-cluster-tests/sdcm/sla/sla_tests.py", line 126, in attach_sl_and_validate_scheduler_runtim
e
< t:2023-08-13 13:39:24,134 f:file_logger.py l:101 c:sdcm.sct_events.file_logger p:INFO > self.validate_scheduler_runtime(start_time=start_time,
< t:2023-08-13 13:39:24,134 f:file_logger.py l:101 c:sdcm.sct_events.file_logger p:INFO > File "/home/ubuntu/scylla-cluster-tests/sdcm/sla/libs/sla_utils.py", line 225, in validate_scheduler_runtime
< t:2023-08-13 13:39:24,134 f:file_logger.py l:101 c:sdcm.sct_events.file_logger p:INFO > raise SchedulerRuntimeUnexpectedValue("".join(result))
< t:2023-08-13 13:39:24,134 f:file_logger.py l:101 c:sdcm.sct_events.file_logger p:INFO > sdcm.sla.libs.sla_utils.SchedulerRuntimeUnexpectedValue:
< t:2023-08-13 13:39:24,134 f:file_logger.py l:101 c:sdcm.sct_events.file_logger p:INFO > Probably the issue https://github.com/scylladb/scylla-enterprise/issues/2572
< t:2023-08-13 13:39:24,134 f:file_logger.py l:101 c:sdcm.sct_events.file_logger p:INFO > (Node 10.0.3.223) - Service level sl:sl800_ac323a2a did not get resources unexpectedly. CPU%: 94.79. Runtime per servi
ce level group:
< t:2023-08-13 13:39:24,134 f:file_logger.py l:101 c:sdcm.sct_events.file_logger p:INFO > sl:sl800_ac323a2a (shares 800): 0.0
< t:2023-08-13 13:39:24,134 f:file_logger.py l:101 c:sdcm.sct_events.file_logger p:INFO > sl:sl500_ac323a2a (shares 500): 237.13
Issue description
- [ ] This issue is a regression.
- [ ] It is unknown if this issue is a regression.
Describe your issue in detail and steps it took to produce it.
Impact
Describe the impact this issue causes to the user.
How frequently does it reproduce?
Describe the frequency with how this issue can be reproduced.
Installation details
Kernel Version: 5.15.0-1039-aws
Scylla version (or git commit hash): 2022.2.11-20230705.27d29485de90
with build-id f467a0ad8869d61384d8bbc8f20e4fb8fd281f4b
Cluster size: 5 nodes (i4i.2xlarge)
Scylla Nodes used in this run:
- longevity-sla-system-24h-master-db-node-0ceb3cea-9 (13.48.48.99 | 10.0.0.37) (shards: 7)
- longevity-sla-system-24h-master-db-node-0ceb3cea-8 (16.171.242.252 | 10.0.3.241) (shards: 7)
- longevity-sla-system-24h-master-db-node-0ceb3cea-7 (16.171.236.88 | 10.0.2.34) (shards: 7)
- longevity-sla-system-24h-master-db-node-0ceb3cea-6 (13.48.70.144 | 10.0.3.223) (shards: -1)
- longevity-sla-system-24h-master-db-node-0ceb3cea-5 (16.16.25.196 | 10.0.3.78) (shards: 7)
- longevity-sla-system-24h-master-db-node-0ceb3cea-4 (13.51.13.172 | 10.0.2.132) (shards: 7)
- longevity-sla-system-24h-master-db-node-0ceb3cea-3 (13.51.13.71 | 10.0.2.178) (shards: 7)
- longevity-sla-system-24h-master-db-node-0ceb3cea-2 (13.53.216.241 | 10.0.0.65) (shards: 7)
- longevity-sla-system-24h-master-db-node-0ceb3cea-10 (13.48.70.54 | 10.0.2.0) (shards: -1)
- longevity-sla-system-24h-master-db-node-0ceb3cea-1 (16.16.185.155 | 10.0.0.96) (shards: 7)
OS / Image: ami-0ce59e86771bcb0ef
(aws: undefined_region)
Test: longevity-sla-system-24h
Test id: 0ceb3cea-2cc6-46f1-9d48-95e3cde45bb3
Test name: enterprise-2022.2/Reproducers/longevity-sla-system-24h
Test config file(s):
Logs and commands
- Restore Monitor Stack command:
$ hydra investigate show-monitor 0ceb3cea-2cc6-46f1-9d48-95e3cde45bb3
- Restore monitor on AWS instance using Jenkins job
- Show all stored logs command:
$ hydra investigate show-logs 0ceb3cea-2cc6-46f1-9d48-95e3cde45bb3
Logs:
- db-cluster-0ceb3cea.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/0ceb3cea-2cc6-46f1-9d48-95e3cde45bb3/20230814_095238/db-cluster-0ceb3cea.tar.gz
- sct-runner-events-0ceb3cea.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/0ceb3cea-2cc6-46f1-9d48-95e3cde45bb3/20230814_095238/sct-runner-events-0ceb3cea.tar.gz
- sct-0ceb3cea.log.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/0ceb3cea-2cc6-46f1-9d48-95e3cde45bb3/20230814_095238/sct-0ceb3cea.log.tar.gz
- loader-set-0ceb3cea.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/0ceb3cea-2cc6-46f1-9d48-95e3cde45bb3/20230814_095238/loader-set-0ceb3cea.tar.gz
- monitor-set-0ceb3cea.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/0ceb3cea-2cc6-46f1-9d48-95e3cde45bb3/20230814_095238/monitor-set-0ceb3cea.tar.gz
@juliayakovlev maybe adding proper EventFilter to decommission (similar like we do in DropIndex)?
@juliayakovlev
just exclude the node with have running_nemesis
, for the check (adding back the target_node)
@juliayakovlev just exclude the node with have
running_nemesis
, for the check (adding back the target_node)
It won't let up 100% cover. If SLA nemesis will start during decommission (or similar) nemesis and finish after that we won't know about the nemesis (validation is performs in the end)