scylla-cluster-tests
scylla-cluster-tests copied to clipboard
GCE tests fail to start due issue with ssh connection (spot failure wasn't visible in Argus)
Packages
Issue description
- [ ] This issue is a regression.
- [ ] It is unknown if this issue is a regression.
error Failed to connect in 60 seconds, last error: (ConnectError)Error connecting to host '10.142.0.75:22' - timed out
possibly regression introduced in https://github.com/scylladb/scylla-cluster-tests/pull/7461
Impact
Test fails without execution.
How frequently does it reproduce?
Describe the frequency with how this issue can be reproduced.
Installation details
Cluster size: 6 nodes (n2-highmem-16)
Scylla Nodes used in this run:
- longevity-10gb-3h-master-db-node-107e5ca5-0-6 (34.73.197.55 | 10.142.0.74) (shards: 14)
- longevity-10gb-3h-master-db-node-107e5ca5-0-5 (35.196.139.154 | 10.142.0.73) (shards: 14)
- longevity-10gb-3h-master-db-node-107e5ca5-0-4 (34.23.238.116 | 10.142.0.72) (shards: 14)
- longevity-10gb-3h-master-db-node-107e5ca5-0-3 (34.148.55.75 | 10.142.0.71) (shards: 14)
- longevity-10gb-3h-master-db-node-107e5ca5-0-2 (104.196.107.253 | 10.142.0.63) (shards: 14)
- longevity-10gb-3h-master-db-node-107e5ca5-0-1 (34.75.26.124 | 10.142.0.62) (shards: 14)
OS / Image: https://www.googleapis.com/compute/v1/projects/scylla-images/global/images/5653798310498836444 (gce: undefined_region)
Test: longevity-10gb-3h-gce-test
Test id: 107e5ca5-3de7-45b3-885c-b647450cde77
Test name: scylla-master/longevity/longevity-10gb-3h-gce-test
Test config file(s):
Logs and commands
- Restore Monitor Stack command:
$ hydra investigate show-monitor 107e5ca5-3de7-45b3-885c-b647450cde77 - Restore monitor on AWS instance using Jenkins job
- Show all stored logs command:
$ hydra investigate show-logs 107e5ca5-3de7-45b3-885c-b647450cde77
Logs:
- db-cluster-107e5ca5.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/107e5ca5-3de7-45b3-885c-b647450cde77/20240525_000907/db-cluster-107e5ca5.tar.gz
- sct-runner-events-107e5ca5.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/107e5ca5-3de7-45b3-885c-b647450cde77/20240525_000907/sct-runner-events-107e5ca5.tar.gz
- sct-107e5ca5.log.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/107e5ca5-3de7-45b3-885c-b647450cde77/20240525_000907/sct-107e5ca5.log.tar.gz
- loader-set-107e5ca5.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/107e5ca5-3de7-45b3-885c-b647450cde77/20240525_000907/loader-set-107e5ca5.tar.gz
- monitor-set-107e5ca5.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/107e5ca5-3de7-45b3-885c-b647450cde77/20240525_000907/monitor-set-107e5ca5.tar.gz
- parallel-timelines-report-107e5ca5.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/107e5ca5-3de7-45b3-885c-b647450cde77/20240525_000907/parallel-timelines-report-107e5ca5.tar.gz
btw. upgrade tests seem to work, but use different OS image.
03:08:41 raise TestFailure(f"Got critical event: {event}")
03:08:41 sdcm.sct_events.events_analyzer.TestFailure: Got critical event: (SpotTerminationEvent Severity.CRITICAL) period_type=one-time event_id=a3558b96-8453-4f5b-aa27-ac02dfa05a30: node=Node longevity-10gb-3h-master-loader-node-107e5ca5-0-1 [35.190.150.148 | 10.142.0.75] message=Instance was preempted.
@soyacz if preemption happens too early we might get this confusing errors
Uh, I didn't spot that. Is it worth/feasible to fix this confusion?
Uh, I didn't
spotthat. Is it worth/feasible to fix this confusion?
yes it worth it, we are running into it quite a lot, I think I've open issue about it more the once...
seem like events weren't reported to Argus
the jenkins job is gone now, so we can't really investigate, next time we need to look at it right away in such case to collect the findings
found more recent case of it: https://jenkins.scylladb.com/job/scylla-master/job/longevity/job/longevity-10gb-3h-gce-test/573
send mail stage failed like this:
03:10:22 (no stderr)
03:10:22 ['/home/ubuntu/sct-results/20240612-235801-036873/test_id']
03:10:22 Results file not found
03:10:23 Cleaning SSH agent
03:10:23 Agent pid 7182 killed
seems like we getting preemption, during the setup (i.e. running node benchmarks, which he disable recently)
======================================================================
ERROR: test_custom_time (longevity_test.LongevityTest)
Run cassandra-stress with params defined in data_dir/scylla.yaml
----------------------------------------------------------------------
Traceback (most recent call last):
File "/home/ubuntu/scylla-cluster-tests/sdcm/tester.py", line 182, in wrapper
return method(*args, **kwargs)
File "/home/ubuntu/scylla-cluster-tests/sdcm/utils/decorators.py", line 119, in inner
res = func(*args, **kwargs)
File "/home/ubuntu/scylla-cluster-tests/sdcm/tester.py", line 909, in setUp
self.init_resources()
File "/home/ubuntu/scylla-cluster-tests/sdcm/tester.py", line 1864, in init_resources
self.get_cluster_gce(loader_info=loader_info, db_info=db_info,
File "/home/ubuntu/scylla-cluster-tests/sdcm/tester.py", line 1207, in get_cluster_gce
self.db_cluster = ScyllaGCECluster(gce_image=self.params.get('gce_image_db').strip(),
File "/home/ubuntu/scylla-cluster-tests/sdcm/cluster_gce.py", line 553, in __init__
super().__init__(
File "/home/ubuntu/scylla-cluster-tests/sdcm/cluster.py", line 3908, in __init__
super().__init__(*args, **kwargs)
File "/home/ubuntu/scylla-cluster-tests/sdcm/cluster_gce.py", line 273, in __init__
super().__init__(cluster_uuid=cluster_uuid,
File "/home/ubuntu/scylla-cluster-tests/sdcm/cluster.py", line 3207, in __init__
self.run_node_benchmarks()
File "/home/ubuntu/scylla-cluster-tests/sdcm/cluster.py", line 3726, in run_node_benchmarks
self.node_benchmark_manager.run_benchmarks()
File "/home/ubuntu/scylla-cluster-tests/sdcm/utils/benchmarks.py", line 124, in run_benchmarks
parallel.run(lambda x: x.run_benchmarks(), ignore_exceptions=False)
File "/home/ubuntu/scylla-cluster-tests/sdcm/utils/common.py", line 483, in run
result = future.result(time_out)
File "/usr/local/lib/python3.10/concurrent/futures/_base.py", line 453, in result
self._condition.wait(timeout)
File "/usr/local/lib/python3.10/threading.py", line 324, in wait
gotit = waiter.acquire(True, timeout)
File "/home/ubuntu/scylla-cluster-tests/sdcm/tester.py", line 276, in critical_failure_handler
raise CriticalTestFailure("Critical Error has failed the test") # pylint: disable=raise-missing-from
sdcm.tester.CriticalTestFailure: Critical Error has failed the test
======================================================================
FAIL: test_custom_time (longevity_test.LongevityTest)
Run cassandra-stress with params defined in data_dir/scylla.yaml
----------------------------------------------------------------------
Traceback (most recent call last):
File "/home/ubuntu/scylla-cluster-tests/sdcm/sct_events/events_analyzer.py", line 50, in run
raise TestFailure(f"Got critical event: {event}")
sdcm.sct_events.events_analyzer.TestFailure: Got critical event: (SpotTerminationEvent Severity.CRITICAL) period_type=one-time event_id=f196caa7-9bc5-4501-ad26-944759600d00: node=Node longevity-10gb-3h-master-db-node-537a1c92-0-6 [34.23.91.92 | 10.142.0.65] message=Instance was preempted.
----------------------------------------------------------------------
from sct log, we can see we got here:
self.destroy_localhost()
> self.stop_event_device()
if self.params.get('collect_logs'):
self.collect_sct_logs()
with silence(parent=self, name='Cleaning up SSL config directory'):
cleanup_ssl_config()
self.finalize_teardown()
self.argus_finalize_test_run()
self.argus_heartbeat_stop_signal.set()
but never got into argus_finalize_test_run() that sends out the events
doing the test kill seems to be killing the teardown in the middle as well, in the wrong timing not clear how we can't prevent it
maybe we should send the event once again, just in case on followup stage ?
@k0machi let's send the events again, during log collection phase
do we have a way to test with Argus if it got events yet or not ?
did you ever considered to send events as they happen (not only at the end)? I mean ERROR/Critical only ones
did you ever considered to send events as they happen (not only at the end)? I mean ERROR/Critical only ones
Argus need to support it first
@k0machi let's send the events again, during log collection phase
do we have a way to test with Argus if it got events yet or not ?
Not at this moment, so we'd have to either add some logic on the backend to deduplicate events or manually call the get_run_by_id endpoint through the client, and then check if events field is populated.
https://github.com/scylladb/argus/pull/426 - Client addition to make it possible for SCT to query the run.