scylla-cluster-tests icon indicating copy to clipboard operation
scylla-cluster-tests copied to clipboard

When the fast way to validate that service level is propagated to all nodes will be developed, we need to use it in `sdcm.sla.libs.sla_utils.SlaUtils.wait_for_service_level_propagated`

Open juliayakovlev opened this issue 1 year ago • 7 comments

https://github.com/scylladb/scylla-enterprise/issues/3754

Issue description

There is no fast way to validate that new created service level was propagated to all nodes. In SCT test we query scylla_scheduler_shares from node metrics endpoint. The response time is long, sdcm.sla.libs.sla_utils.SlaUtils.wait_for_service_level_propagated failed with timeout despite the SL was added fast. But due to long validation we got error event

As it happened in this test run:

############# t:2023-12-22 00:48:53,159 f:common.py l:1756 c:utils p:DEBUG > Executing CQL 'CREATE SERVICE_LEVEL IF NOT EXISTS 'sl800_e06f287a' WITH shares = 800' ...

t:2023-12-22 00:49:09,644 f:sla_utils.py l:405 c:sdcm.sla.libs.sla_utils p:DEBUG > Start wait for service level propagated for service_level 'sl800_e06f287a' on the node 'longevity-sla-system-24h-2024-1-db-node-d38963cd-1' t:2023-12-22 00:49:10,056 f:sla_utils.py l:425 c:sdcm.sla.libs.sla_utils p:DEBUG > Finish wait for service level propagated for service_level 'sl800_e06f287a' on the node 'longevity-sla-system-24h-2024-1-db-node-d38963cd-1'

t:2023-12-22 00:49:10,056 f:sla_utils.py l:405 c:sdcm.sla.libs.sla_utils p:DEBUG > Start wait for service level propagated for service_level 'sl800_e06f287a' on the node 'longevity-sla-system-24h-2024-1-db-node-d38963cd-2' t:2023-12-22 00:49:10,663 f:sla_utils.py l:425 c:sdcm.sla.libs.sla_utils p:DEBUG > Finish wait for service level propagated for service_level 'sl800_e06f287a' on the node 'longevity-sla-system-24h-2024-1-db-node-d38963cd-2'

t:2023-12-22 00:49:10,674 f:sla_utils.py l:405 c:sdcm.sla.libs.sla_utils p:DEBUG > Start wait for service level propagated for service_level 'sl800_e06f287a' on the node 'longevity-sla-system-24h-2024-1-db-node-d38963cd-3' t:2023-12-22 00:49:33,244 f:sla_utils.py l:394 c:sdcm.sla.libs.sla_utils p:DEBUG > Found scheduler groups: defaultdict(<class 'list'>, {'atexit': [1000, 1000, 1000, 1000, 1000, 1000, 1000], 'background_reclaim': [2, 30, 79, 33, 81, 4, 75], 'commitlog': [1000, 1000, 1000, 1000, 1000, 1000, 1000], 'compaction': [50, 50, 50, 50, 50, 50, 50], 'gossip': [1000, 1000, 1000, 1000, 1000, 1000, 1000], 'main': [200, 200, 200, 200, 200, 200, 200], 'mem_compaction': [1000, 1000, 1000, 1000, 1000, 1000, 1000], 'memtable': [1, 1, 1, 1, 1, 1, 1], 'memtable_to_cache': [200, 200, 200, 200, 200, 200, 200], 'sl:default': [1000, 1000, 1000, 1000, 1000, 1000, 1000], 'sl:sl800_e06f287a': [800, 800, 800, 800, 800, 800, 800], 'sl_deleted:sl500_ab2110fe': [500, 500, 500, 500, 500, 500, 500], 'streaming': [200, 200, 200, 200, 200, 200, 200]}) t:2023-12-22 00:49:33,950 f:sla_utils.py l:425 c:sdcm.sla.libs.sla_utils p:DEBUG > Finish wait for service level propagated for service_level 'sl800_e06f287a' on the node 'longevity-sla-system-24h-2024-1-db-node-d38963cd-3'

t:2023-12-22 00:49:33,955 f:sla_utils.py l:405 c:sdcm.sla.libs.sla_utils p:DEBUG > Start wait for service level propagated for service_level 'sl800_e06f287a' on the node 'longevity-sla-system-24h-2024-1-db-node-d38963cd-4' t:2023-12-22** 00:49:56**,376 f:sla_utils.py l:394 c:sdcm.sla.libs.sla_utils p:DEBUG > Found scheduler groups: defaultdict(<class 'list'>, {'atexit': [1000, 1000, 1000, 1000, 1000, 1000, 1000], 'background_reclaim': [33, 85, 68, 74, 22, 1, 3], 'commitlog': [1000, 1000, 1000, 1000, 1000, 1000, 1000], 'compaction': [50, 50, 50, 50, 77, 50, 50], 'gossip': [1000, 1000, 1000, 1000, 1000, 1000, 1000], 'main': [200, 200, 200, 200, 200, 200, 200], 'mem_compaction': [1000, 1000, 1000, 1000, 1000, 1000, 1000], 'memtable': [1, 1, 1, 1, 1, 1, 1], 'memtable_to_cache': [200, 200, 200, 200, 200, 200, 200], 'sl:default': [1000, 1000, 1000, 1000, 1000, 1000, 1000], 'sl:sl800_e06f287a': [800, 800, 800, 800, 800, 800, 800], 'sl_deleted:sl500_ab2110fe': [500, 500, 500, 500, 500, 500, 500], 'streaming': [200, 200, 200, 200, 200, 200, 200]}) t:2023-12-22 00:49:56,513 f:sla_utils.py l:425 c:sdcm.sla.libs.sla_utils p:DEBUG > Finish wait for service level propagated for service_level 'sl800_e06f287a' on the node 'longevity-sla-system-24h-2024-1-db-node-d38963cd-4'

t:2023-12-22 00:49:56,518 f:sla_utils.py l:405 c:sdcm.sla.libs.sla_utils p:DEBUG > Start wait for service level propagated for service_level 'sl800_e06f287a' on the node 'longevity-sla-system-24h-2024-1-db-node-d38963cd-5' < t:2023-12-22 00:50:18,823 f:sla_utils.py l:394 c:sdcm.sla.libs.sla_utils p:DEBUG > Found scheduler groups: defaultdict(<class 'list'>, {'atexit': [1000, 1000, 1000, 1000, 1000, 1000, 1000], 'background_reclaim': [68, 71, 67, 71, 73, 81, 70], 'commitlog': [1000, 1000, 1000, 1000, 1000, 1000, 1000], 'compaction': [50, 50, 50, 50, 50, 50, 50], 'gossip': [1000, 1000, 1000, 1000, 1000, 1000, 1000], 'main': [200, 200, 200, 200, 200, 200, 200], 'mem_compaction': [1000, 1000, 1000, 1000, 1000, 1000, 1000], 'memtable': [1, 1, 1, 1, 1, 1, 1], 'memtable_to_cache': [200, 200, 200, 200, 200, 200, 200], 'sl:default': [1000, 1000, 1000, 1000, 1000, 1000, 1000], 'sl:sl800_e06f287a': [800, 800, 800, 800, 800, 800, 800], 'sl_deleted:sl500_ab2110fe': [500, 500, 500, 500, 500, 500, 500], 'streaming': [200, 200, 200, 200, 200, 200, 200]}) t:2023-12-22 00:50:19,547 f:sla_utils.py l:425 c:sdcm.sla.libs.sla_utils p:DEBUG > Finish wait for service level propagated for service_level 'sl800_e06f287a' on the node 'longevity-sla-system-24h-2024-1-db-node-d38963cd-5'

t:2023-12-22 00:50:19,696 f:file_logger.py l:101 c:sdcm.sct_events.file_logger p:ERROR > 2023-12-22 00:50:19.552: (SoftTimeoutEvent Severity.ERROR) period_type=one-time event_id=35a20527-28af-4a91-b4d1-f15fc83f1be3 during_nemesis=MgmtBackup,SlaDecreaseSharesDuringLoad, source=SoftTimeout message=operation 'SERVICE_LEVEL_PROPAGATION' took 69.90814685821533s and soft-timeout was set to 15s #############

The code: https://github.com/scylladb/scylla-cluster-tests/blob/934db64901986651a4db308e09086d304a9aeb22/sdcm/sla/sla_tests.py#L161 https://github.com/scylladb/scylla-cluster-tests/blob/934db64901986651a4db308e09086d304a9aeb22/sdcm/sla/libs/sla_utils.py#L400

Impact

Describe the impact this issue causes to the user.

How frequently does it reproduce?

Describe the frequency with how this issue can be reproduced.

Installation details

Kernel Version: 5.15.0-1051-aws Scylla version (or git commit hash): 2024.1.0~rc2-20231217.f57117d9cfe3 with build-id 3a4d2dfe8ef4eef5454badb34d1710a5f36a859c

Cluster size: 5 nodes (i3.2xlarge)

Scylla Nodes used in this run:

  • longevity-sla-system-24h-2024-1-db-node-d38963cd-7 (54.247.4.117 | 10.4.1.83) (shards: -1)
  • longevity-sla-system-24h-2024-1-db-node-d38963cd-6 (54.75.40.53 | 10.4.3.10) (shards: -1)
  • longevity-sla-system-24h-2024-1-db-node-d38963cd-5 (63.33.216.112 | 10.4.3.242) (shards: 7)
  • longevity-sla-system-24h-2024-1-db-node-d38963cd-4 (34.248.24.221 | 10.4.1.7) (shards: 7)
  • longevity-sla-system-24h-2024-1-db-node-d38963cd-3 (54.171.145.93 | 10.4.2.124) (shards: 7)
  • longevity-sla-system-24h-2024-1-db-node-d38963cd-2 (54.154.137.212 | 10.4.0.175) (shards: 7)
  • longevity-sla-system-24h-2024-1-db-node-d38963cd-1 (3.253.42.242 | 10.4.2.48) (shards: 7)

OS / Image: ami-01b6277edc91bfcc7 (aws: undefined_region)

Test: longevity-sla-system-24h-test Test id: d38963cd-e9ea-45ae-98ce-fa346afb89b8 Test name: enterprise-2024.1/SCT_Enterprise_Features/Workload_Prioritization/longevity-sla-system-24h-test Test config file(s):

Logs and commands
  • Restore Monitor Stack command: $ hydra investigate show-monitor d38963cd-e9ea-45ae-98ce-fa346afb89b8
  • Restore monitor on AWS instance using Jenkins job
  • Show all stored logs command: $ hydra investigate show-logs d38963cd-e9ea-45ae-98ce-fa346afb89b8

Logs:

Jenkins job URL Argus

juliayakovlev avatar Jan 02 '24 13:01 juliayakovlev

@juliayakovlev what's the SCT issue here ? it's not exactly clear what we can do with it....

fruch avatar Jan 02 '24 14:01 fruch

@fruch I opened it here to be on top of it and change the code is they will provide it such ability

juliayakovlev avatar Jan 02 '24 14:01 juliayakovlev

@fruch I opened it here to be on top of it and change the code is they will provide it such ability

If so edit the title and description to represent it

Cause it looks like a bug report for SCT, and it's not clear how it's different from the issue referenced

fruch avatar Jan 02 '24 16:01 fruch

@juliayakovlev how fast it is required to be? maybe it would be enough to query directly node's metrics instead of prometheus to get faster response?

soyacz avatar Jan 09 '24 13:01 soyacz

sdcm.sla.libs.sla_utils.SlaUtils.wait_for_service_level_propagated

@soyacz sdcm.utils.adaptive_timeouts.load_info_store.NodeLoadInfoService._get_metrics It asks metrics from localhost:

    def _get_metrics(self, port):
        metrics = self.remoter.run(f'curl -s localhost:{port}/metrics', verbose=False).stdout
        metrics_dict = {}
        for line in metrics.splitlines():
            if line and not line.startswith('#'):
                try:
                    key, value = line.rsplit(' ', 1)
                    metrics_dict[key] = value
                except ValueError:
                    LOGGER.debug("Couldn't parse line: %s", line)
        return metrics_dict

Which another option do we have?

juliayakovlev avatar Jan 09 '24 13:01 juliayakovlev

In SCT test we query scylla_scheduler_shares from Prometheus.

I got confused by that. Should be then 'query node metrics endpoint' I suppose

Which another option do we have?

So, I don't know another option.

soyacz avatar Jan 09 '24 14:01 soyacz

In SCT test we query scylla_scheduler_shares from Prometheus.

I got confused by that. Should be then 'query node metrics endpoint' I suppose

You are right. Fix

Which another option do we have?

So, I don't know another option.

juliayakovlev avatar Jan 09 '24 14:01 juliayakovlev