scylla-cluster-tests
scylla-cluster-tests copied to clipboard
When the fast way to validate that service level is propagated to all nodes will be developed, we need to use it in `sdcm.sla.libs.sla_utils.SlaUtils.wait_for_service_level_propagated`
https://github.com/scylladb/scylla-enterprise/issues/3754
Issue description
There is no fast way to validate that new created service level was propagated to all nodes.
In SCT test we query scylla_scheduler_shares
from node metrics endpoint.
The response time is long, sdcm.sla.libs.sla_utils.SlaUtils.wait_for_service_level_propagated
failed with timeout despite the SL was added fast. But due to long validation we got error event
As it happened in this test run:
############# t:2023-12-22 00:48:53,159 f:common.py l:1756 c:utils p:DEBUG > Executing CQL 'CREATE SERVICE_LEVEL IF NOT EXISTS 'sl800_e06f287a' WITH shares = 800' ...
t:2023-12-22 00:49:09,644 f:sla_utils.py l:405 c:sdcm.sla.libs.sla_utils p:DEBUG > Start wait for service level propagated for service_level 'sl800_e06f287a' on the node 'longevity-sla-system-24h-2024-1-db-node-d38963cd-1' t:2023-12-22 00:49:10,056 f:sla_utils.py l:425 c:sdcm.sla.libs.sla_utils p:DEBUG > Finish wait for service level propagated for service_level 'sl800_e06f287a' on the node 'longevity-sla-system-24h-2024-1-db-node-d38963cd-1'
t:2023-12-22 00:49:10,056 f:sla_utils.py l:405 c:sdcm.sla.libs.sla_utils p:DEBUG > Start wait for service level propagated for service_level 'sl800_e06f287a' on the node 'longevity-sla-system-24h-2024-1-db-node-d38963cd-2' t:2023-12-22 00:49:10,663 f:sla_utils.py l:425 c:sdcm.sla.libs.sla_utils p:DEBUG > Finish wait for service level propagated for service_level 'sl800_e06f287a' on the node 'longevity-sla-system-24h-2024-1-db-node-d38963cd-2'
t:2023-12-22 00:49:10,674 f:sla_utils.py l:405 c:sdcm.sla.libs.sla_utils p:DEBUG > Start wait for service level propagated for service_level 'sl800_e06f287a' on the node 'longevity-sla-system-24h-2024-1-db-node-d38963cd-3' t:2023-12-22 00:49:33,244 f:sla_utils.py l:394 c:sdcm.sla.libs.sla_utils p:DEBUG > Found scheduler groups: defaultdict(<class 'list'>, {'atexit': [1000, 1000, 1000, 1000, 1000, 1000, 1000], 'background_reclaim': [2, 30, 79, 33, 81, 4, 75], 'commitlog': [1000, 1000, 1000, 1000, 1000, 1000, 1000], 'compaction': [50, 50, 50, 50, 50, 50, 50], 'gossip': [1000, 1000, 1000, 1000, 1000, 1000, 1000], 'main': [200, 200, 200, 200, 200, 200, 200], 'mem_compaction': [1000, 1000, 1000, 1000, 1000, 1000, 1000], 'memtable': [1, 1, 1, 1, 1, 1, 1], 'memtable_to_cache': [200, 200, 200, 200, 200, 200, 200], 'sl:default': [1000, 1000, 1000, 1000, 1000, 1000, 1000], 'sl:sl800_e06f287a': [800, 800, 800, 800, 800, 800, 800], 'sl_deleted:sl500_ab2110fe': [500, 500, 500, 500, 500, 500, 500], 'streaming': [200, 200, 200, 200, 200, 200, 200]}) t:2023-12-22 00:49:33,950 f:sla_utils.py l:425 c:sdcm.sla.libs.sla_utils p:DEBUG > Finish wait for service level propagated for service_level 'sl800_e06f287a' on the node 'longevity-sla-system-24h-2024-1-db-node-d38963cd-3'
t:2023-12-22 00:49:33,955 f:sla_utils.py l:405 c:sdcm.sla.libs.sla_utils p:DEBUG > Start wait for service level propagated for service_level 'sl800_e06f287a' on the node 'longevity-sla-system-24h-2024-1-db-node-d38963cd-4' t:2023-12-22** 00:49:56**,376 f:sla_utils.py l:394 c:sdcm.sla.libs.sla_utils p:DEBUG > Found scheduler groups: defaultdict(<class 'list'>, {'atexit': [1000, 1000, 1000, 1000, 1000, 1000, 1000], 'background_reclaim': [33, 85, 68, 74, 22, 1, 3], 'commitlog': [1000, 1000, 1000, 1000, 1000, 1000, 1000], 'compaction': [50, 50, 50, 50, 77, 50, 50], 'gossip': [1000, 1000, 1000, 1000, 1000, 1000, 1000], 'main': [200, 200, 200, 200, 200, 200, 200], 'mem_compaction': [1000, 1000, 1000, 1000, 1000, 1000, 1000], 'memtable': [1, 1, 1, 1, 1, 1, 1], 'memtable_to_cache': [200, 200, 200, 200, 200, 200, 200], 'sl:default': [1000, 1000, 1000, 1000, 1000, 1000, 1000], 'sl:sl800_e06f287a': [800, 800, 800, 800, 800, 800, 800], 'sl_deleted:sl500_ab2110fe': [500, 500, 500, 500, 500, 500, 500], 'streaming': [200, 200, 200, 200, 200, 200, 200]}) t:2023-12-22 00:49:56,513 f:sla_utils.py l:425 c:sdcm.sla.libs.sla_utils p:DEBUG > Finish wait for service level propagated for service_level 'sl800_e06f287a' on the node 'longevity-sla-system-24h-2024-1-db-node-d38963cd-4'
t:2023-12-22 00:49:56,518 f:sla_utils.py l:405 c:sdcm.sla.libs.sla_utils p:DEBUG > Start wait for service level propagated for service_level 'sl800_e06f287a' on the node 'longevity-sla-system-24h-2024-1-db-node-d38963cd-5' < t:2023-12-22 00:50:18,823 f:sla_utils.py l:394 c:sdcm.sla.libs.sla_utils p:DEBUG > Found scheduler groups: defaultdict(<class 'list'>, {'atexit': [1000, 1000, 1000, 1000, 1000, 1000, 1000], 'background_reclaim': [68, 71, 67, 71, 73, 81, 70], 'commitlog': [1000, 1000, 1000, 1000, 1000, 1000, 1000], 'compaction': [50, 50, 50, 50, 50, 50, 50], 'gossip': [1000, 1000, 1000, 1000, 1000, 1000, 1000], 'main': [200, 200, 200, 200, 200, 200, 200], 'mem_compaction': [1000, 1000, 1000, 1000, 1000, 1000, 1000], 'memtable': [1, 1, 1, 1, 1, 1, 1], 'memtable_to_cache': [200, 200, 200, 200, 200, 200, 200], 'sl:default': [1000, 1000, 1000, 1000, 1000, 1000, 1000], 'sl:sl800_e06f287a': [800, 800, 800, 800, 800, 800, 800], 'sl_deleted:sl500_ab2110fe': [500, 500, 500, 500, 500, 500, 500], 'streaming': [200, 200, 200, 200, 200, 200, 200]}) t:2023-12-22 00:50:19,547 f:sla_utils.py l:425 c:sdcm.sla.libs.sla_utils p:DEBUG > Finish wait for service level propagated for service_level 'sl800_e06f287a' on the node 'longevity-sla-system-24h-2024-1-db-node-d38963cd-5'
t:2023-12-22 00:50:19,696 f:file_logger.py l:101 c:sdcm.sct_events.file_logger p:ERROR > 2023-12-22 00:50:19.552: (SoftTimeoutEvent Severity.ERROR) period_type=one-time event_id=35a20527-28af-4a91-b4d1-f15fc83f1be3 during_nemesis=MgmtBackup,SlaDecreaseSharesDuringLoad, source=SoftTimeout message=operation 'SERVICE_LEVEL_PROPAGATION' took 69.90814685821533s and soft-timeout was set to 15s #############
The code: https://github.com/scylladb/scylla-cluster-tests/blob/934db64901986651a4db308e09086d304a9aeb22/sdcm/sla/sla_tests.py#L161 https://github.com/scylladb/scylla-cluster-tests/blob/934db64901986651a4db308e09086d304a9aeb22/sdcm/sla/libs/sla_utils.py#L400
Impact
Describe the impact this issue causes to the user.
How frequently does it reproduce?
Describe the frequency with how this issue can be reproduced.
Installation details
Kernel Version: 5.15.0-1051-aws
Scylla version (or git commit hash): 2024.1.0~rc2-20231217.f57117d9cfe3
with build-id 3a4d2dfe8ef4eef5454badb34d1710a5f36a859c
Cluster size: 5 nodes (i3.2xlarge)
Scylla Nodes used in this run:
- longevity-sla-system-24h-2024-1-db-node-d38963cd-7 (54.247.4.117 | 10.4.1.83) (shards: -1)
- longevity-sla-system-24h-2024-1-db-node-d38963cd-6 (54.75.40.53 | 10.4.3.10) (shards: -1)
- longevity-sla-system-24h-2024-1-db-node-d38963cd-5 (63.33.216.112 | 10.4.3.242) (shards: 7)
- longevity-sla-system-24h-2024-1-db-node-d38963cd-4 (34.248.24.221 | 10.4.1.7) (shards: 7)
- longevity-sla-system-24h-2024-1-db-node-d38963cd-3 (54.171.145.93 | 10.4.2.124) (shards: 7)
- longevity-sla-system-24h-2024-1-db-node-d38963cd-2 (54.154.137.212 | 10.4.0.175) (shards: 7)
- longevity-sla-system-24h-2024-1-db-node-d38963cd-1 (3.253.42.242 | 10.4.2.48) (shards: 7)
OS / Image: ami-01b6277edc91bfcc7
(aws: undefined_region)
Test: longevity-sla-system-24h-test
Test id: d38963cd-e9ea-45ae-98ce-fa346afb89b8
Test name: enterprise-2024.1/SCT_Enterprise_Features/Workload_Prioritization/longevity-sla-system-24h-test
Test config file(s):
Logs and commands
- Restore Monitor Stack command:
$ hydra investigate show-monitor d38963cd-e9ea-45ae-98ce-fa346afb89b8
- Restore monitor on AWS instance using Jenkins job
- Show all stored logs command:
$ hydra investigate show-logs d38963cd-e9ea-45ae-98ce-fa346afb89b8
Logs:
- db-cluster-d38963cd.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/d38963cd-e9ea-45ae-98ce-fa346afb89b8/20231222_111207/db-cluster-d38963cd.tar.gz
- sct-runner-events-d38963cd.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/d38963cd-e9ea-45ae-98ce-fa346afb89b8/20231222_111207/sct-runner-events-d38963cd.tar.gz
- sct-d38963cd.log.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/d38963cd-e9ea-45ae-98ce-fa346afb89b8/20231222_111207/sct-d38963cd.log.tar.gz
- loader-set-d38963cd.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/d38963cd-e9ea-45ae-98ce-fa346afb89b8/20231222_111207/loader-set-d38963cd.tar.gz
- monitor-set-d38963cd.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/d38963cd-e9ea-45ae-98ce-fa346afb89b8/20231222_111207/monitor-set-d38963cd.tar.gz
@juliayakovlev what's the SCT issue here ? it's not exactly clear what we can do with it....
@fruch I opened it here to be on top of it and change the code is they will provide it such ability
@fruch I opened it here to be on top of it and change the code is they will provide it such ability
If so edit the title and description to represent it
Cause it looks like a bug report for SCT, and it's not clear how it's different from the issue referenced
@juliayakovlev how fast it is required to be? maybe it would be enough to query directly node's metrics instead of prometheus to get faster response?
sdcm.sla.libs.sla_utils.SlaUtils.wait_for_service_level_propagated
@soyacz
sdcm.utils.adaptive_timeouts.load_info_store.NodeLoadInfoService._get_metrics
It asks metrics from localhost:
def _get_metrics(self, port):
metrics = self.remoter.run(f'curl -s localhost:{port}/metrics', verbose=False).stdout
metrics_dict = {}
for line in metrics.splitlines():
if line and not line.startswith('#'):
try:
key, value = line.rsplit(' ', 1)
metrics_dict[key] = value
except ValueError:
LOGGER.debug("Couldn't parse line: %s", line)
return metrics_dict
Which another option do we have?
In SCT test we query
scylla_scheduler_shares
from Prometheus.
I got confused by that. Should be then 'query node metrics endpoint' I suppose
Which another option do we have?
So, I don't know another option.
In SCT test we query
scylla_scheduler_shares
from Prometheus.I got confused by that. Should be then 'query node metrics endpoint' I suppose
You are right. Fix
Which another option do we have?
So, I don't know another option.