scylla-cluster-tests GCE tests fail to start due issue with ssh connection (spot failure wasn't visible in Argus)

Packages

Issue description

[ ] This issue is a regression.
[ ] It is unknown if this issue is a regression.

error Failed to connect in 60 seconds, last error: (ConnectError)Error connecting to host '10.142.0.75:22' - timed out possibly regression introduced in https://github.com/scylladb/scylla-cluster-tests/pull/7461

Impact

Test fails without execution.

How frequently does it reproduce?

Describe the frequency with how this issue can be reproduced.

Installation details

Cluster size: 6 nodes (n2-highmem-16)

Scylla Nodes used in this run:

longevity-10gb-3h-master-db-node-107e5ca5-0-6 (34.73.197.55 | 10.142.0.74) (shards: 14)
longevity-10gb-3h-master-db-node-107e5ca5-0-5 (35.196.139.154 | 10.142.0.73) (shards: 14)
longevity-10gb-3h-master-db-node-107e5ca5-0-4 (34.23.238.116 | 10.142.0.72) (shards: 14)
longevity-10gb-3h-master-db-node-107e5ca5-0-3 (34.148.55.75 | 10.142.0.71) (shards: 14)
longevity-10gb-3h-master-db-node-107e5ca5-0-2 (104.196.107.253 | 10.142.0.63) (shards: 14)
longevity-10gb-3h-master-db-node-107e5ca5-0-1 (34.75.26.124 | 10.142.0.62) (shards: 14)

OS / Image: https://www.googleapis.com/compute/v1/projects/scylla-images/global/images/5653798310498836444 (gce: undefined_region)

Test: longevity-10gb-3h-gce-test Test id: 107e5ca5-3de7-45b3-885c-b647450cde77 Test name: scylla-master/longevity/longevity-10gb-3h-gce-test Test config file(s):

longevity-10gb-3h.yaml

Logs and commands

Restore Monitor Stack command: $ hydra investigate show-monitor 107e5ca5-3de7-45b3-885c-b647450cde77
Restore monitor on AWS instance using Jenkins job
Show all stored logs command: $ hydra investigate show-logs 107e5ca5-3de7-45b3-885c-b647450cde77

Logs:

db-cluster-107e5ca5.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/107e5ca5-3de7-45b3-885c-b647450cde77/20240525_000907/db-cluster-107e5ca5.tar.gz
sct-runner-events-107e5ca5.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/107e5ca5-3de7-45b3-885c-b647450cde77/20240525_000907/sct-runner-events-107e5ca5.tar.gz
sct-107e5ca5.log.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/107e5ca5-3de7-45b3-885c-b647450cde77/20240525_000907/sct-107e5ca5.log.tar.gz
loader-set-107e5ca5.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/107e5ca5-3de7-45b3-885c-b647450cde77/20240525_000907/loader-set-107e5ca5.tar.gz
monitor-set-107e5ca5.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/107e5ca5-3de7-45b3-885c-b647450cde77/20240525_000907/monitor-set-107e5ca5.tar.gz
parallel-timelines-report-107e5ca5.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/107e5ca5-3de7-45b3-885c-b647450cde77/20240525_000907/parallel-timelines-report-107e5ca5.tar.gz

Jenkins job URL Argus

May 25 '24 07:05 soyacz

btw. upgrade tests seem to work, but use different OS image.

May 25 '24 07:05 soyacz

03:08:41 raise TestFailure(f"Got critical event: {event}")
03:08:41 sdcm.sct_events.events_analyzer.TestFailure: Got critical event: (SpotTerminationEvent Severity.CRITICAL) period_type=one-time event_id=a3558b96-8453-4f5b-aa27-ac02dfa05a30: node=Node longevity-10gb-3h-master-loader-node-107e5ca5-0-1 [35.190.150.148 | 10.142.0.75] message=Instance was preempted.

@soyacz if preemption happens too early we might get this confusing errors

May 25 '24 19:05 fruch

Uh, I didn't spot that. Is it worth/feasible to fix this confusion?

May 27 '24 06:05 soyacz

Uh, I didn't spot that. Is it worth/feasible to fix this confusion?

yes it worth it, we are running into it quite a lot, I think I've open issue about it more the once...

May 27 '24 07:05 fruch

seem like events weren't reported to Argus

the jenkins job is gone now, so we can't really investigate, next time we need to look at it right away in such case to collect the findings

Jul 09 '24 06:07 fruch

found more recent case of it: https://jenkins.scylladb.com/job/scylla-master/job/longevity/job/longevity-10gb-3h-gce-test/573

Jul 09 '24 06:07 fruch

send mail stage failed like this:

03:10:22  (no stderr)
03:10:22  ['/home/ubuntu/sct-results/20240612-235801-036873/test_id']
03:10:22  Results file not found
03:10:23  Cleaning SSH agent
03:10:23  Agent pid 7182 killed

Jul 09 '24 06:07 fruch

seems like we getting preemption, during the setup (i.e. running node benchmarks, which he disable recently)

======================================================================
ERROR: test_custom_time (longevity_test.LongevityTest)
Run cassandra-stress with params defined in data_dir/scylla.yaml
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/ubuntu/scylla-cluster-tests/sdcm/tester.py", line 182, in wrapper
    return method(*args, **kwargs)
  File "/home/ubuntu/scylla-cluster-tests/sdcm/utils/decorators.py", line 119, in inner
    res = func(*args, **kwargs)
  File "/home/ubuntu/scylla-cluster-tests/sdcm/tester.py", line 909, in setUp
    self.init_resources()
  File "/home/ubuntu/scylla-cluster-tests/sdcm/tester.py", line 1864, in init_resources
    self.get_cluster_gce(loader_info=loader_info, db_info=db_info,
  File "/home/ubuntu/scylla-cluster-tests/sdcm/tester.py", line 1207, in get_cluster_gce
    self.db_cluster = ScyllaGCECluster(gce_image=self.params.get('gce_image_db').strip(),
  File "/home/ubuntu/scylla-cluster-tests/sdcm/cluster_gce.py", line 553, in __init__
    super().__init__(
  File "/home/ubuntu/scylla-cluster-tests/sdcm/cluster.py", line 3908, in __init__
    super().__init__(*args, **kwargs)
  File "/home/ubuntu/scylla-cluster-tests/sdcm/cluster_gce.py", line 273, in __init__
    super().__init__(cluster_uuid=cluster_uuid,
  File "/home/ubuntu/scylla-cluster-tests/sdcm/cluster.py", line 3207, in __init__
    self.run_node_benchmarks()
  File "/home/ubuntu/scylla-cluster-tests/sdcm/cluster.py", line 3726, in run_node_benchmarks
    self.node_benchmark_manager.run_benchmarks()
  File "/home/ubuntu/scylla-cluster-tests/sdcm/utils/benchmarks.py", line 124, in run_benchmarks
    parallel.run(lambda x: x.run_benchmarks(), ignore_exceptions=False)
  File "/home/ubuntu/scylla-cluster-tests/sdcm/utils/common.py", line 483, in run
    result = future.result(time_out)
  File "/usr/local/lib/python3.10/concurrent/futures/_base.py", line 453, in result
    self._condition.wait(timeout)
  File "/usr/local/lib/python3.10/threading.py", line 324, in wait
    gotit = waiter.acquire(True, timeout)
  File "/home/ubuntu/scylla-cluster-tests/sdcm/tester.py", line 276, in critical_failure_handler
    raise CriticalTestFailure("Critical Error has failed the test")  # pylint: disable=raise-missing-from
sdcm.tester.CriticalTestFailure: Critical Error has failed the test

======================================================================
FAIL: test_custom_time (longevity_test.LongevityTest)
Run cassandra-stress with params defined in data_dir/scylla.yaml
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/ubuntu/scylla-cluster-tests/sdcm/sct_events/events_analyzer.py", line 50, in run
    raise TestFailure(f"Got critical event: {event}")
sdcm.sct_events.events_analyzer.TestFailure: Got critical event: (SpotTerminationEvent Severity.CRITICAL) period_type=one-time event_id=f196caa7-9bc5-4501-ad26-944759600d00: node=Node longevity-10gb-3h-master-db-node-537a1c92-0-6 [34.23.91.92 | 10.142.0.65] message=Instance was preempted.

----------------------------------------------------------------------

Jul 09 '24 06:07 fruch

from sct log, we can see we got here:

        self.destroy_localhost()
>      self.stop_event_device()
        if self.params.get('collect_logs'):
            self.collect_sct_logs()
        with silence(parent=self, name='Cleaning up SSL config directory'):
            cleanup_ssl_config()

        self.finalize_teardown()
        self.argus_finalize_test_run()
        self.argus_heartbeat_stop_signal.set()

but never got into argus_finalize_test_run() that sends out the events

doing the test kill seems to be killing the teardown in the middle as well, in the wrong timing not clear how we can't prevent it

maybe we should send the event once again, just in case on followup stage ?

Jul 09 '24 07:07 fruch

@k0machi let's send the events again, during log collection phase

do we have a way to test with Argus if it got events yet or not ?

Jul 09 '24 07:07 fruch

did you ever considered to send events as they happen (not only at the end)? I mean ERROR/Critical only ones

Jul 09 '24 07:07 soyacz

did you ever considered to send events as they happen (not only at the end)? I mean ERROR/Critical only ones

Argus need to support it first

Jul 09 '24 11:07 fruch

@k0machi let's send the events again, during log collection phase

do we have a way to test with Argus if it got events yet or not ?

Not at this moment, so we'd have to either add some logic on the backend to deduplicate events or manually call the get_run_by_id endpoint through the client, and then check if events field is populated.

Jul 09 '24 22:07 k0machi

https://github.com/scylladb/argus/pull/426 - Client addition to make it possible for SCT to query the run.

Aug 02 '24 21:08 k0machi

scylla-cluster-tests scylla-cluster-tests copied to clipboard

GCE tests fail to start due issue with ssh connection (spot failure wasn't visible in Argus)

Packages

Issue description

Impact

How frequently does it reproduce?

Installation details

Logs:

scylla-cluster-tests
scylla-cluster-tests copied to clipboard