scylla-cluster-tests icon indicating copy to clipboard operation
scylla-cluster-tests copied to clipboard

improvement(tomstone-gc): Added a verification to large-partitions tests

Open yarongilor opened this issue 2 years ago • 20 comments

Assuming some deletion nemeses ran on these tests,
The tomstone-gc-verification thread can validate its deletion.

PR pre-checks (self review)

  • [x] I followed KISS principle and best practices
  • [x] I didn't leave commented-out/debugging code
  • [x] I added the relevant backport labels
  • [ ] New configuration option are added and documented (in sdcm/sct_config.py)
  • [ ] I have added tests to cover my changes (Infrastructure only - under unit-test/ folder)
  • [ ] All new and existing unit tests passed (CI)
  • [ ] I have updated the Readme/doc folder accordingly (if needed)

yarongilor avatar Jan 29 '23 17:01 yarongilor

Comment: longevity-large-partition-4days.yaml is not added in this PR since (from some reason) it doesn't have the required parameter for running deletion nemesis - partition_range_with_data_validation

yarongilor avatar Jan 29 '23 17:01 yarongilor

Assuming some deletion nemeses ran on these tests,
The tomstone-gc-verification thread can validate its deletion.

what will happen if, for some reason, no deletion nemeses ran at all?

fgelcer avatar Jan 29 '23 18:01 fgelcer

@yarongilor resolve conflict, and address a comment left

fgelcer avatar Feb 02 '23 07:02 fgelcer

Assuming some deletion nemeses ran on these tests,
The tomstone-gc-verification thread can validate its deletion.

what will happen if, for some reason, no deletion nemeses ran at all?

if no tombstones found then nothing is checked.

yarongilor avatar Feb 09 '23 14:02 yarongilor

@yarongilor , did we have it passing?

fgelcer avatar Feb 09 '23 21:02 fgelcer

@yarongilor , did we have it passing?

Tested now in: https://jenkins.scylladb.com/job/scylla-staging/job/yarongilor/job/longevity-yaron-test/16/

yarongilor avatar Feb 12 '23 09:02 yarongilor

@yarongilor , did we have it passing?

Tested now in: https://jenkins.scylladb.com/job/scylla-staging/job/yarongilor/job/longevity-yaron-test/16/

link is broken

also, precommit checks are failing

fgelcer avatar Feb 12 '23 12:02 fgelcer

Checks are still failing

roydahan avatar Feb 12 '23 12:02 roydahan

@yarongilor , did we have it passing?

Tested now in: https://jenkins.scylladb.com/job/scylla-staging/job/yarongilor/job/longevity-yaron-test/16/

link is broken

also, precommit checks are failing

@fgelcer , unfortunately i don't have permissions to Argus, so in order to workaround the Argus issue with new jobs i have to rename new staging jobs to old existing Argus jobs.. link: https://jenkins.scylladb.com/job/scylla-staging/job/yarongilor/job/longevity-lwt-500G-3d-gc-verification/17/

It looks like the pre-commit is ok, yet fails with an un related issue:

13:21:39      Traceback (most recent call last):
13:21:39        File "/usr/local/lib/python3.10/runpy.py", line 196, in _run_module_as_main
13:21:39          return _run_code(code, main_globals, None,
13:21:39        File "/usr/local/lib/python3.10/runpy.py", line 86, in _run_code
13:21:39          exec(code, run_globals)
13:21:39        File "/usr/local/lib/python3.10/site-packages/nodeenv.py", line 1519, in <module>
13:21:39          main()
13:21:39        File "/usr/local/lib/python3.10/site-packages/nodeenv.py", line 1104, in main
13:21:39          create_environment(env_dir, args)
13:21:39        File "/usr/local/lib/python3.10/site-packages/nodeenv.py", line 980, in create_environment
13:21:39          install_node(env_dir, src_dir, args)
13:21:39        File "/usr/local/lib/python3.10/site-packages/nodeenv.py", line 739, in install_node
13:21:39          install_node_wrapped(env_dir, src_dir, args)
13:21:39        File "/usr/local/lib/python3.10/site-packages/nodeenv.py", line 762, in install_node_wrapped
13:21:39          download_node_src(node_url, src_dir, args)
13:21:39        File "/usr/local/lib/python3.10/site-packages/nodeenv.py", line 602, in download_node_src
13:21:39          with ctx as archive:
13:21:39        File "/usr/local/lib/python3.10/contextlib.py", line 135, in __enter__
13:21:39          return next(self.gen)
13:21:39        File "/usr/local/lib/python3.10/site-packages/nodeenv.py", line 573, in tarfile_open
13:21:39          tf = tarfile.open(*args, **kwargs)
13:21:39        File "/usr/local/lib/python3.10/tarfile.py", line 1611, in open
13:21:39          saved_pos = fileobj.tell()
13:21:39      AttributeError: 'bytes' object has no attribute 'tell'
13:21:39      
13:21:39  Check the log at /home/jenkins/.cache/pre-commit/pre-commit.log
13:21:39  Trim Trailing Whitespace.................................................Passed
13:21:39  Fix End of Files.........................................................Passed
13:21:39  Check Yaml...............................................................Passed
13:21:39  Check for added large files..............................................Passed
13:21:39  Check JSON...............................................................Passed
13:21:39  Detect AWS Credentials...................................................Passed
13:21:39  Detect Private Key.......................................................Passed
13:21:48  autopep8.................................................................Passed
13:23:29  pylint...................................................................Passed

yarongilor avatar Feb 12 '23 13:02 yarongilor

The test runs ok with nemesis, but there might be an issue with the gc-tombtone-verification thread. Perhaps it's following the recent refactor of this code..

ubuntu@ip-10-0-3-71:~/sct-results/latest$ grep -i gc_verification sct.log
< t:2023-02-12 11:04:24,896 f:sct_config.py   l:1769 c:sdcm.sct_config      p:INFO  > run_tombstone_gc_verification: '{"ks_cf": "scylla_bench.test", "interval": 300, "propagation_delay_in_seconds":
< t:2023-02-12 11:32:47,555 f:tombstone_gc_verification_thread.py l:172  c:TombstoneGcVerificationThread p:DEBUG > Executed TombstoneGcVerificationEvent

The nemeses ran ok, including deletion ones. Have to start debug (possibly the refactored code) to see why the thread ran only once and completed immediately.

yarongilor avatar Feb 12 '23 13:02 yarongilor

@yarongilor , did we have it passing?

Tested now in: https://jenkins.scylladb.com/job/scylla-staging/job/yarongilor/job/longevity-yaron-test/16/

link is broken also, precommit checks are failing

@fgelcer , unfortunately i don't have permissions to Argus, so in order to workaround the Argus issue with new jobs i have to rename new staging jobs to old existing Argus jobs.. link: https://jenkins.scylladb.com/job/scylla-staging/job/yarongilor/job/longevity-lwt-500G-3d-gc-verification/17/

what Argus has to do witht he broken link you pasted above???

fgelcer avatar Feb 13 '23 12:02 fgelcer

@yarongilor , please fix conflicts

fgelcer avatar Mar 05 '23 11:03 fgelcer

@fgelcer , i think we wanted to merge this PR, but didn't do it. Perhaps we forgot about it.

yarongilor avatar Jun 25 '23 09:06 yarongilor

@fgelcer , i think we wanted to merge this PR, but didn't do it. Perhaps we forgot about it.

if you do, you need to ping reviewers and maintainers to review, approve and merge

fgelcer avatar Jun 29 '23 12:06 fgelcer

@fgelcer , i think we wanted to merge this PR, but didn't do it. Perhaps we forgot about it.

if you do, you need to ping reviewers and maintainers to review, approve and merge

@roydahan , this is part of tombstones-gc-verification coverage that didn't get merged.

yarongilor avatar Jul 02 '23 08:07 yarongilor

Where can I see the runs results that shows this passing?

roydahan avatar Jul 02 '23 16:07 roydahan

Still want to see full runs on 2023.1 before we merge and backport.

roydahan avatar Jul 02 '23 17:07 roydahan

Still want to see full runs on 2023.1 before we merge and backport.

@yarongilor , do we have runs on 2023.1?

fgelcer avatar Jul 09 '23 06:07 fgelcer

This PR is stale because it has been open 365 days with no activity. Remove stale label or comment or this will be closed in 5 days.

github-actions[bot] avatar Jul 09 '24 00:07 github-actions[bot]

@yarongilor , do we have runs on 2023.1?

@yarongilor Could you please rebase this and try it out on 2024.1 so we can push this over the finish line? (Unless it is no longer needed ofc)

pehala avatar Sep 04 '24 11:09 pehala

@yarongilor please make sure you fully test it and add the links to Argus here.

roydahan avatar Nov 12 '24 13:11 roydahan

Currently the tombstone gc thread doesn't run. Have to check what's broken.

yarongilor avatar Nov 12 '24 19:11 yarongilor

@yarongilor please make sure you fully test it and add the links to Argus here.

Fixed and tested in: 200k-pk large partitions 8h

yarongilor avatar Nov 13 '24 15:11 yarongilor

Fixed and tested in:

When you're claiming as tested but referring to FAILED un-investigated tests, with nothing in their "Discussion" tab, what do you expect us to do?

roydahan avatar Nov 14 '24 23:11 roydahan

Fixed and tested in:

When you're claiming as tested but referring to FAILED un-investigated tests, with nothing in their "Discussion" tab, what do you expect us to do?

right, i moved this PR from ready to draft since i have to check the fixes and verify tests passed ok. i'll update results soon and switch PR back to ready.

yarongilor avatar Nov 27 '24 16:11 yarongilor

found another thing that better be fixed: this code block:

                query = f"SELECT repair_time from system.repair_history WHERE keyspace_name = '{self.keyspace}' " \
                        f"AND table_name = '{self.table}' ALLOW FILTERING;"
                results = session.execute(query)
                output = results.all()
                self.log.debug('SELECT repair_time results: %s', output)

prints huge amount of mostly unneeded data to log like:

< t:2024-11-13 22:35:26,628 f:common.py       l:1325 c:utils                p:DEBUG > Executing CQL 'SELECT repair_time from system.repair_history WHERE keyspace_name = 'scylla_bench' AND table_name = 'test' ALLOW FILTERING;' ...
< t:2024-11-13 22:35:26,723 f:sstable_utils.py l:173  c:SstableUtils         p:DEBUG > SELECT repair_time results: [Row(repair_time=datetime.datetime(2024, 11, 13, 15, 52, 39)), Row(repair_time=datetime.datetime(2024, 11, 13, 15, 52, 39)), Row(repair_time=datetime.datetime(2024, 11, 13, 15, 52, 39)), Row(repair_time=datetime.datetime(2024, 11, 13, 15, 52, 39)), Row(repair_time=datetime.datetime(2024, 11, 13, 15, 52, 39)), Row(repair_time=datetime.datetime(2024, 11, 13, 15, 52, 39)), Row(repair_time=datetime.datetime(2024, 11, 13, 15, 52, 39)), Row(repair_time=datetime.datetime(2024, 11, 13, 15, 52, 39)), Row(repair_time=datetime.datetime(2024, 11, 13, 15, 52, 39)), Row(repair_time=datetime.datetime(2024, 11, 13, 15, 52, 39)), Row(repair_time=datetime.datetime(2024, 11, 13, 15, 52, 39)), Row(repair_time=datetime.datetime(2024, 11, 13, 15, 52, 39)), Row(repair_time=datetime.datetime(2024, 11, 13, 15, 52, 39)), Row(repair_time=datetime.datetime(2024, 11, 13, 15, 52, 39)), Row(repair_time=datetime.datetime(2024, 11, 13, 15, 52, 39)), Row(repair_time=datetime.datetime(2024, 11, 13, 15, 52, 39)), Row(repair_time=datetime.datetime(2024, 11, 13, 15, 52, 39)), Row(repair_time=datetime.datetime(2024, 11, 13, 15, 52, 39)), Row(repair_time=datetime.datetime(2024, 11, 13, 15, 52, 39)), Row(repair_time=datetime.datetime(2024, 11, 13, 15, 52, 39)), R

yarongilor avatar Dec 04 '24 17:12 yarongilor

Fix 2 places:

+++ b/sdcm/tombstone_gc_verification_thread.py
@@ -47,7 +47,15 @@ class TombstoneGcVerificationThread:
             return
         table_repair_date = datetime.datetime.strptime(table_repair_date, '%Y-%m-%d %H:%M:%S')
         delta_repair_date_minutes = int(
-            (datetime.datetime.now() - table_repair_date).seconds / 60) - self._sstable_utils.propagation_delay_in_seconds
+            (datetime.datetime.now() - table_repair_date).total_seconds() / 60
+        ) - self._sstable_utils.propagation_delay_in_seconds
+        self.log.debug(
+            "Calculated delta_repair_date_minutes: %d, table_repair_date: %s, propagation_delay: %d",
+            delta_repair_date_minutes,
+            table_repair_date,
+            self._sstable_utils.propagation_delay_in_seconds,
+        )
+
+++ b/sdcm/utils/sstable/sstable_utils.py
@@ -170,10 +170,12 @@ class SstableUtils:
                         f"AND table_name = '{self.table}' ALLOW FILTERING;"
                 results = session.execute(query)
                 output = results.all()
-                self.log.debug('SELECT repair_time results: %s', output)
-                if len(output) == 0:
+                output_length = len(output)
+                if output_length == 0:
                     self.log.debug('No repair history found for %s.%s', self.keyspace, self.table)
                     return None
+                self.log.debug('Number of rows in repair_time results: %d', output_length)
+                self.log.debug('Last row in repair_time results: %s', output[-1])

and retested

yarongilor avatar Dec 04 '24 18:12 yarongilor

After fixing test issues, i found out the sstabledump usage is partially obsolete and not updated. It was updated on another test in the PR of https://github.com/scylladb/scylla-cluster-tests/pull/6828 by @fruch . So i'll have to similarly adjust it here as well. @pehala , we'll just have to confirm it's ok for this fix to go in the context of this PR and not a separate one.

A related scylla error to workaround is:

scyllaadm@test-tombstone-gc-thread-master-db-node-baa8b9e9-3:~$ scylla sstable dump-data   /var/lib/scylla/data/scylla_bench/test-0ccfc260b30711ef91160450974f66b6/me-3glt_10ej_33vs028yaawokwr72b-big-Data.db --scylla-yaml-file /etc/scylla/scylla.yaml  --logger-log-level scylla-sstable=debug
DEBUG 2024-12-05 13:14:14,970 [shard 0:main] scylla-sstable - Successfully read scylla.yaml from user provided location of /etc/scylla/scylla.yaml
DEBUG 2024-12-05 13:14:14,970 [shard 0:main] scylla-sstable - Single schema source provided
error processing arguments: could not load schema via schema-tables: std::runtime_error (Failed to find scylla_bench.test in schema tables)

yarongilor avatar Dec 05 '24 13:12 yarongilor

It looks like the tombstone-verification-thread runs ok with the new dump-data command, but the tombstones are not found by SCT grepping for "marked_deleted". So it might be a different representation to look for. The sstabledump output now has, for example:

0000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000"}}}]},{"key":{"token":"-225499587252888099","raw":"0008000000000000007e","value":"126"},"tombstone":{"timestamp":1738230562965937,"deletion_time":"2025-01-30 09:49:23z"}},{"key":{"token":"-197064988801513361","raw":"00080000000000000032","value":"50"},"tombstone":{"timestamp":1738230562965937,"deletion_time":"2025-01-30 09:49:23z"}},{"key":{"token":"-185374451944754211","raw":"000800000000000000b8","value":"184"},"tombstone":{"timestamp":1738230562965937,"deletion_time":"2025-01-30 09:49:23z"}},{"key":{"token":"-149517308796957695","raw":"000800000000000000f1","value":"241"},"clustering_elements":[{"type":"clustering-row","key":{"raw":"00080000000000000000","value":"0"},"marker":{"timestamp":1738230561815345},"columns":{"v":{"is_live":true,"type":"regular","timestamp":1738230561815345,"value":"0000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000

So it can be understood that this json output represent a tombstone in a different way:

"tombstone":{"timestamp":1738230562965937,"deletion_time":"2025-01-30 09:49:23z"}}

Dtest replaced numfound = jsoninfo.count("marked_deleted") with numfound = sum("tombstone" in partition for partition in jsoninfo) in https://github.com/scylladb/scylla-dtest/pull/3444 ...

@tchaikov , can you please advise - would it be right to keep using the "marked_deleted" search for scylla versions older than 2023.1 (that uses the old "sstabledump" command)? and use the new search format otherwise?

@pehala , i think it would be a good practice that any scylla tool/api change will be at least notified to sct maintainers. this change wasn't updated in sct for more than a year. i guess we might had a chance to catch it if the test of test_switch_tombstone_gc_modes was executed.

yarongilor avatar Jan 30 '25 10:01 yarongilor

Just don't backport any of those changes to 2023.1

Only to branches that really needs it

Since this PR is open for a whole year now, I guess it's not really needed in any release branch (it's too late)

As for notification, that not enough, someone would need to attend to it.

If that test isn't running regularly, no one would notice that. If this test is important enough make sure it's regularly run.

Regardless, the change sound quite straight forward and simple (that it can be done even during a release quickly enough)

fruch avatar Jan 30 '25 11:01 fruch