scylla-cluster-tests
scylla-cluster-tests copied to clipboard
Start time of a nemesis is incorrectly presented in the list of executed nemeses in Argus
Start time of the mgmt_repair_cli nemesis is 2024-08-03 07:40:35 as per sct.log:
< t:2024-08-03 07:40:35,846 f:nemesis.py l:1953 c:sdcm.nemesis p:DEBUG > sdcm.nemesis.SisyphusMonkey: Selecting the next nemesis out of stack ['disrupt_multiple_hard_reboot_node', ...
< t:2024-08-03 07:40:35,870 f:nemesis.py l:1877 c:sdcm.nemesis p:INFO > sdcm.nemesis.SisyphusMonkey: >>>>>>>>>>>>>Started random_disrupt_method mgmt_repair_cli
< t:2024-08-03 07:40:35,870 f:nemesis.py l:1141 c:sdcm.nemesis p:DEBUG > sdcm.nemesis.SisyphusMonkey: Set current_disruption -> MgmtRepairCli
< t:2024-08-03 07:40:35,870 f:nemesis.py l:408 c:sdcm.nemesis p:INFO > sdcm.nemesis.SisyphusMonkey: Current Target: Node longevity-tls-50gb-3d-master-db-node-6f2ed542-4 [99.80.226.107 | 10.4.21.188] with running nemesis: MgmtRepairCli
But in Argus it is reported to start at 2024-08-03 10:37:14 (this is rather the time when the critical event that failed the test occurred - timestamp for that event is 2024-08-03 10:37:13.780):
Observed in the Argus link
@dimakr argus link does not work
Bad copy-n-paste., updated it.
It's not a problem with wrong start time - as indeed nemesis started later.
Problem is that before sending data to argus (and collecting&printing details about start time) we do check_cluster_health() which took long here - and this is what to needs to be investigated in this run.
@dimakr we need to complete this investigation, just stating the timing it wrong, isn't the issue
test start cause of this failure:
2024-08-03 10:37:13.729: (ClusterHealthValidatorEvent Severity.CRITICAL) period_type=one-time event_id=63931095-7d0d-43b0-b3c2-e2e1741a2e3c: type=NodeStatus node=longevity-tls-50gb-3d-master-db-node-6f2ed542-7 error=Current node Node longevity-tls-50gb-3d-master-db-node-6f2ed542-7 [79.125.27.189 | 10.4.22.222]. Node Node longevity-tls-50gb-3d-master-db-node-6f2ed542-6 [52.18.47.81 | 10.4.23.44] (not target node) status is DN
healthcheck is retrying, probably for couple of hours we need to understand why we get this mismatch, and why it's like that for so long
look like there are nested retrying around connecting to a specific node longevity-tls-50gb-3d-master-db-node-6f2ed542-6
< t:2024-08-03 09:00:27,276 f:decorators.py l:74 c:sdcm.utils.decorators p:DEBUG > 'cql_connection_patient_exclusive': failed with 'NoHostAvailable('Unable to connect to any servers', {'10.4.23.44:9042': OSError(None, "Tried connecting to [('10.4.23.44', 9042)]. Last error: timed out")})', retrying [#5]
@juliayakovlev should be looking into the two multiple network
seem like the multiple_hard_reboot nemesis is doing something to our network configuration, since in all failures it happens on the faulty node ~1h before
@dimakr Did you faced out similar problem recently?
The commit https://github.com/scylladb/scylla-cluster-tests/pull/8883 made changes in multiple network configuration. While testing of this change the nemeses time is correct in the Argus: https://argus.scylladb.com/tests/scylla-cluster-tests/61a71bd6-177f-40e6-9d9d-0c8b74a8c03d
Please, let me know if you will get this problem again
@juliayakovlev Yes it is still actual. Not knowing about this bug report last week I created another one here: https://github.com/scylladb/scylla-cluster-tests/issues/9170
Did you faced out similar problem recently?
As Valerii mentioned the lag between nemeses can be observed here and there. But as it was explained it is probably expected, as it is originating from healthcheck being performed and retried.
This particular issue was raised as one of symptoms of a problem of multiple_hard_reboot disruption. So we can probably close the current issue. And keep the https://github.com/scylladb/scylla-cluster-tests/issues/9170 one if we still want to improve somehow the way the healthcheck is reported (e.g to include it into a nemesis time).
closing we'll keep it in: https://github.com/scylladb/scylla-cluster-tests/issues/9170