scylla-cluster-tests icon indicating copy to clipboard operation
scylla-cluster-tests copied to clipboard

Start time of a nemesis is incorrectly presented in the list of executed nemeses in Argus

Open dimakr opened this issue 1 year ago • 6 comments

Start time of the mgmt_repair_cli nemesis is 2024-08-03 07:40:35 as per sct.log:

< t:2024-08-03 07:40:35,846 f:nemesis.py      l:1953 c:sdcm.nemesis         p:DEBUG > sdcm.nemesis.SisyphusMonkey: Selecting the next nemesis out of stack ['disrupt_multiple_hard_reboot_node', ...
< t:2024-08-03 07:40:35,870 f:nemesis.py      l:1877 c:sdcm.nemesis         p:INFO  > sdcm.nemesis.SisyphusMonkey: >>>>>>>>>>>>>Started random_disrupt_method mgmt_repair_cli
< t:2024-08-03 07:40:35,870 f:nemesis.py      l:1141 c:sdcm.nemesis         p:DEBUG > sdcm.nemesis.SisyphusMonkey: Set current_disruption -> MgmtRepairCli
< t:2024-08-03 07:40:35,870 f:nemesis.py      l:408  c:sdcm.nemesis         p:INFO  > sdcm.nemesis.SisyphusMonkey: Current Target: Node longevity-tls-50gb-3d-master-db-node-6f2ed542-4 [99.80.226.107 | 10.4.21.188] with running nemesis: MgmtRepairCli

But in Argus it is reported to start at 2024-08-03 10:37:14 (this is rather the time when the critical event that failed the test occurred - timestamp for that event is 2024-08-03 10:37:13.780): Screenshot from 2024-08-08 12-38-14

Observed in the Argus link

dimakr avatar Aug 08 '24 10:08 dimakr

@dimakr argus link does not work

soyacz avatar Aug 09 '24 09:08 soyacz

Bad copy-n-paste., updated it.

dimakr avatar Aug 09 '24 09:08 dimakr

It's not a problem with wrong start time - as indeed nemesis started later. Problem is that before sending data to argus (and collecting&printing details about start time) we do check_cluster_health() which took long here - and this is what to needs to be investigated in this run.

soyacz avatar Aug 09 '24 09:08 soyacz

@dimakr we need to complete this investigation, just stating the timing it wrong, isn't the issue

test start cause of this failure:

2024-08-03 10:37:13.729: (ClusterHealthValidatorEvent Severity.CRITICAL) period_type=one-time event_id=63931095-7d0d-43b0-b3c2-e2e1741a2e3c: type=NodeStatus node=longevity-tls-50gb-3d-master-db-node-6f2ed542-7 error=Current node Node longevity-tls-50gb-3d-master-db-node-6f2ed542-7 [79.125.27.189 | 10.4.22.222]. Node Node longevity-tls-50gb-3d-master-db-node-6f2ed542-6 [52.18.47.81 | 10.4.23.44] (not target node) status is DN 

healthcheck is retrying, probably for couple of hours we need to understand why we get this mismatch, and why it's like that for so long

fruch avatar Aug 15 '24 19:08 fruch

look like there are nested retrying around connecting to a specific node longevity-tls-50gb-3d-master-db-node-6f2ed542-6

< t:2024-08-03 09:00:27,276 f:decorators.py   l:74   c:sdcm.utils.decorators p:DEBUG > 'cql_connection_patient_exclusive': failed with 'NoHostAvailable('Unable to connect to any servers', {'10.4.23.44:9042': OSError(None, "Tried connecting to [('10.4.23.44', 9042)]. Last error: timed out")})', retrying [#5]

fruch avatar Aug 15 '24 20:08 fruch

@juliayakovlev should be looking into the two multiple network

seem like the multiple_hard_reboot nemesis is doing something to our network configuration, since in all failures it happens on the faulty node ~1h before

fruch avatar Sep 09 '24 12:09 fruch

@dimakr Did you faced out similar problem recently?

The commit https://github.com/scylladb/scylla-cluster-tests/pull/8883 made changes in multiple network configuration. While testing of this change the nemeses time is correct in the Argus: https://argus.scylladb.com/tests/scylla-cluster-tests/61a71bd6-177f-40e6-9d9d-0c8b74a8c03d

Please, let me know if you will get this problem again

juliayakovlev avatar Nov 03 '24 09:11 juliayakovlev

@juliayakovlev Yes it is still actual. Not knowing about this bug report last week I created another one here: https://github.com/scylladb/scylla-cluster-tests/issues/9170

vponomaryov avatar Nov 04 '24 08:11 vponomaryov

Did you faced out similar problem recently?

As Valerii mentioned the lag between nemeses can be observed here and there. But as it was explained it is probably expected, as it is originating from healthcheck being performed and retried.

This particular issue was raised as one of symptoms of a problem of multiple_hard_reboot disruption. So we can probably close the current issue. And keep the https://github.com/scylladb/scylla-cluster-tests/issues/9170 one if we still want to improve somehow the way the healthcheck is reported (e.g to include it into a nemesis time).

dimakr avatar Nov 04 '24 09:11 dimakr

closing we'll keep it in: https://github.com/scylladb/scylla-cluster-tests/issues/9170

fruch avatar Nov 12 '24 08:11 fruch