scylla-cluster-tests Gemini verification started during Tear-down and failed for quorum unavailable nodes

Packages

Scylla version: 6.0.3-20240808.a56f7ce21ad4 with build-id 00ad3169bb53c452cf2ab93d97785dc56117ac3e

Kernel Version: 5.15.0-1067-aws

Issue description

[ ] This issue is a regression.
[ ] It is unknown if this issue is a regression.

Describe your issue in detail and steps it took to produce it.

health check failed (due to an issue with remove-node nemesis)
Tear down started.
Gemini verification started and failed getting a quorum

< t:2024-08-11 15:47:02,573 f:tester.py       l:2887 c:GeminiTest           p:INFO  > TearDown is starting...

< t:2024-08-11 15:47:35,018 f:file_logger.py  l:101  c:sdcm.sct_events.file_logger p:ERROR >     self.verify_results()

2024-08-11 15:47:35.015: (TestFrameworkEvent Severity.ERROR) period_type=one-time event_id=b4a77eac-1b45-4690-9c0a-7e80372055af, source=GeminiTest.test_load_random_with_nemesis (gemini_test.GeminiTest)() message=Traceback (most recent call last):
File "/home/ubuntu/scylla-cluster-tests/gemini_test.py", line 68, in test_load_random_with_nemesis
self.verify_results()
File "/home/ubuntu/scylla-cluster-tests/gemini_test.py", line 127, in verify_results
self.fail(self.gemini_results['results'])
AssertionError: [{'errors': [{'timestamp': '2024-08-11T15:46:28.472514048Z', 'message': 'Validation failed: unable to load check data from the test store: Cannot achieve consistency level for cl QUORUM. Requires 2, alive 1', 'query': 'SELECT * FROM ks1.table1_mv_0 WHERE col8= AND pk0=674687930108493689 AND pk1=8424091603174626176 ', 'stmt-type': 'SelectStatement'}

Perhaps It would be best if Teardown first update other thread or stop it, in order to avoid such collisions.

Impact

Describe the impact this issue causes to the user.

How frequently does it reproduce?

Describe the frequency with how this issue can be reproduced.

Installation details

Cluster size: 3 nodes (i4i.2xlarge)

Scylla Nodes used in this run:

gemini-with-nemesis-3h-normal-6-0-oracle-db-node-5d11f833-1 (34.205.191.210 | 10.12.0.124) (shards: 30)
gemini-with-nemesis-3h-normal-6-0-db-node-5d11f833-3 (44.199.250.238 | 10.12.2.31) (shards: 7)
gemini-with-nemesis-3h-normal-6-0-db-node-5d11f833-2 (3.218.145.168 | 10.12.1.17) (shards: 7)
gemini-with-nemesis-3h-normal-6-0-db-node-5d11f833-1 (3.231.146.121 | 10.12.1.195) (shards: 7)

OS / Image: ami-0c6a6957b89f8504f (aws: undefined_region)

Test: gemini-3h-with-nemesis-test Test id: 5d11f833-59fd-4573-ba63-afec8d1b175b Test name: scylla-6.0/gemini/gemini-3h-with-nemesis-test Test method: gemini_test.GeminiTest.test_load_random_with_nemesis Test config file(s):

gemini-3h-with-nemesis.yaml

Logs and commands

Restore Monitor Stack command: $ hydra investigate show-monitor 5d11f833-59fd-4573-ba63-afec8d1b175b
Restore monitor on AWS instance using Jenkins job
Show all stored logs command: $ hydra investigate show-logs 5d11f833-59fd-4573-ba63-afec8d1b175b

Logs:

db-cluster-5d11f833.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/5d11f833-59fd-4573-ba63-afec8d1b175b/20240811_154920/db-cluster-5d11f833.tar.gz
sct-runner-events-5d11f833.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/5d11f833-59fd-4573-ba63-afec8d1b175b/20240811_154920/sct-runner-events-5d11f833.tar.gz
sct-5d11f833.log.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/5d11f833-59fd-4573-ba63-afec8d1b175b/20240811_154920/sct-5d11f833.log.tar.gz
loader-set-5d11f833.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/5d11f833-59fd-4573-ba63-afec8d1b175b/20240811_154920/loader-set-5d11f833.tar.gz
monitor-set-5d11f833.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/5d11f833-59fd-4573-ba63-afec8d1b175b/20240811_154920/monitor-set-5d11f833.tar.gz
parallel-timelines-report-5d11f833.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/5d11f833-59fd-4573-ba63-afec8d1b175b/20240811_154920/parallel-timelines-report-5d11f833.tar.gz

Jenkins job URL Argus

Aug 15 '24 12:08 yarongilor

@fruch do you have any idea if SCT already bumped similar issues? or is there already any suggested improvement?

Aug 15 '24 12:08 yarongilor

@fruch do you have any idea if SCT already bumped similar issues? or is there already any suggested improvement?

Are you sure of the order of things ?

Test isn't supposed to end before stress commands are finished.

If it stopped with timeout of the test, something isn't working as expected, or stress took longer then it was asked to run, or test timeout is too small.

If stress is running during teardown, it's also not a reason for nodes to be gone

Aug 15 '24 12:08 fruch

You are completely barking at the wrong tree, that an abort during the test during a nemesis that changes topology.

You clearly lost quorum, and SCT has nothing to do about it.

Aug 15 '24 12:08 fruch

This is not an issue with SCT,

Gemini shows its failure once it finishes, it has nothing to do if it's during teardown or not.

DB nodes are not stopped on teardown

Aug 15 '24 12:08 fruch

looking at it again, one node is lost in disrupt_remove_node_then_add_node, and wasn't replaced, cause of failure in removenode

and then one more node stopped during enospc nemesis

this case has only 3 nodes, and two are gone, guess what gemini would fail....

Aug 15 '24 13:08 fruch

closing, this is not a gemini not SCT issue

but other issues causing the test to have less nodes then expected for gemini to function.

Jan 02 '25 13:01 fruch

scylla-cluster-tests scylla-cluster-tests copied to clipboard

Gemini verification started during Tear-down and failed for quorum unavailable nodes

Packages

Issue description

Impact

How frequently does it reproduce?

Installation details

Logs:

scylla-cluster-tests
scylla-cluster-tests copied to clipboard