scylla-cluster-tests icon indicating copy to clipboard operation
scylla-cluster-tests copied to clipboard

Gemini verification started during Tear-down and failed for quorum unavailable nodes

Open yarongilor opened this issue 1 year ago • 5 comments

Packages

Scylla version: 6.0.3-20240808.a56f7ce21ad4 with build-id 00ad3169bb53c452cf2ab93d97785dc56117ac3e

Kernel Version: 5.15.0-1067-aws

Issue description

  • [ ] This issue is a regression.
  • [ ] It is unknown if this issue is a regression.

Describe your issue in detail and steps it took to produce it.

  1. health check failed (due to an issue with remove-node nemesis)
  2. Tear down started.
  3. Gemini verification started and failed getting a quorum
< t:2024-08-11 15:47:02,573 f:tester.py       l:2887 c:GeminiTest           p:INFO  > TearDown is starting...
< t:2024-08-11 15:47:35,018 f:file_logger.py  l:101  c:sdcm.sct_events.file_logger p:ERROR >     self.verify_results()
2024-08-11 15:47:35.015: (TestFrameworkEvent Severity.ERROR) period_type=one-time event_id=b4a77eac-1b45-4690-9c0a-7e80372055af, source=GeminiTest.test_load_random_with_nemesis (gemini_test.GeminiTest)() message=Traceback (most recent call last):
File "/home/ubuntu/scylla-cluster-tests/gemini_test.py", line 68, in test_load_random_with_nemesis
self.verify_results()
File "/home/ubuntu/scylla-cluster-tests/gemini_test.py", line 127, in verify_results
self.fail(self.gemini_results['results'])
AssertionError: [{'errors': [{'timestamp': '2024-08-11T15:46:28.472514048Z', 'message': 'Validation failed: unable to load check data from the test store: Cannot achieve consistency level for cl QUORUM. Requires 2, alive 1', 'query': 'SELECT * FROM ks1.table1_mv_0 WHERE col8= AND pk0=674687930108493689 AND pk1=8424091603174626176 ', 'stmt-type': 'SelectStatement'}

Perhaps It would be best if Teardown first update other thread or stop it, in order to avoid such collisions.

Impact

Describe the impact this issue causes to the user.

How frequently does it reproduce?

Describe the frequency with how this issue can be reproduced.

Installation details

Cluster size: 3 nodes (i4i.2xlarge)

Scylla Nodes used in this run:

  • gemini-with-nemesis-3h-normal-6-0-oracle-db-node-5d11f833-1 (34.205.191.210 | 10.12.0.124) (shards: 30)
  • gemini-with-nemesis-3h-normal-6-0-db-node-5d11f833-3 (44.199.250.238 | 10.12.2.31) (shards: 7)
  • gemini-with-nemesis-3h-normal-6-0-db-node-5d11f833-2 (3.218.145.168 | 10.12.1.17) (shards: 7)
  • gemini-with-nemesis-3h-normal-6-0-db-node-5d11f833-1 (3.231.146.121 | 10.12.1.195) (shards: 7)

OS / Image: ami-0c6a6957b89f8504f (aws: undefined_region)

Test: gemini-3h-with-nemesis-test Test id: 5d11f833-59fd-4573-ba63-afec8d1b175b Test name: scylla-6.0/gemini/gemini-3h-with-nemesis-test Test method: gemini_test.GeminiTest.test_load_random_with_nemesis Test config file(s):

Logs and commands
  • Restore Monitor Stack command: $ hydra investigate show-monitor 5d11f833-59fd-4573-ba63-afec8d1b175b
  • Restore monitor on AWS instance using Jenkins job
  • Show all stored logs command: $ hydra investigate show-logs 5d11f833-59fd-4573-ba63-afec8d1b175b

Logs:

Jenkins job URL Argus

yarongilor avatar Aug 15 '24 12:08 yarongilor

@fruch do you have any idea if SCT already bumped similar issues? or is there already any suggested improvement?

yarongilor avatar Aug 15 '24 12:08 yarongilor

@fruch do you have any idea if SCT already bumped similar issues? or is there already any suggested improvement?

Are you sure of the order of things ?

Test isn't supposed to end before stress commands are finished.

If it stopped with timeout of the test, something isn't working as expected, or stress took longer then it was asked to run, or test timeout is too small.

If stress is running during teardown, it's also not a reason for nodes to be gone

fruch avatar Aug 15 '24 12:08 fruch

You are completely barking at the wrong tree, that an abort during the test during a nemesis that changes topology.

You clearly lost quorum, and SCT has nothing to do about it.

fruch avatar Aug 15 '24 12:08 fruch

This is not an issue with SCT,

Gemini shows its failure once it finishes, it has nothing to do if it's during teardown or not.

DB nodes are not stopped on teardown

fruch avatar Aug 15 '24 12:08 fruch

looking at it again, one node is lost in disrupt_remove_node_then_add_node, and wasn't replaced, cause of failure in removenode

and then one more node stopped during enospc nemesis

this case has only 3 nodes, and two are gone, guess what gemini would fail....

fruch avatar Aug 15 '24 13:08 fruch

closing, this is not a gemini not SCT issue

but other issues causing the test to have less nodes then expected for gemini to function.

fruch avatar Jan 02 '25 13:01 fruch