scylla-cluster-tests icon indicating copy to clipboard operation
scylla-cluster-tests copied to clipboard

fix(nemesis): skip the `mgmt_restore` nemesis as unstable

Open vponomaryov opened this issue 1 year ago • 7 comments

Testing

  • [ ]

PR pre-checks (self review)

  • [x] I added the relevant backport labels
  • [x] I didn't leave commented-out/debugging code

Reminders

  • Add New configuration option and document them (in sdcm/sct_config.py)
  • Add unit tests to cover my changes (under unit-test/ folder)
  • Update the Readme/doc folder relevent to this change (if needed)

vponomaryov avatar Dec 28 '23 09:12 vponomaryov

@mykaul & @tzach please approve this change. It's going to drop the current coverage of mgmt_restore nemesis that creates instability in regression testing.

Someone from manager team probably needs to map all the cases this nemesis can/may fail and decide how to proceed. For example, restoring OSS to Enterprise - should work or not and how far back? (only matching releases e.g. 5.4 to 2024.1 or also 5.2 to 2024.1?) Restoring enterprise to enterprise and how far back. Restoring non-encrypted to fully encrypted cluster Restoring backup that was done in one region to cluster that runs in another region (restore works, but the reads will fail until one will understand to alter the keyspace to the correct region) Etc.

roydahan avatar Dec 28 '23 21:12 roydahan

The problem is not just the test - we have a real regression in restore we need to fix in 5.4 The test should be fixed to capture this issue.

tzach avatar Dec 31 '23 08:12 tzach

The test is what found the problem and it's one of the edge cases I mentioned above.

However, there are too many problems that the test finds with regard to all the questions above and there is one general issue about it for the test failure (Currently assigned to @rayakurl).

roydahan avatar Dec 31 '23 11:12 roydahan

@roydahan, @vponomaryov - @dkropachev is already working on fixing the problem with the nemesis. You can monitor the progress in https://github.com/scylladb/scylla-cluster-tests/pull/7029. This PR should be closed.

rayakurl avatar Dec 31 '23 11:12 rayakurl

@roydahan, @vponomaryov - @dkropachev is already working on fixing the problem with the nemesis. You can monitor the progress in #7029. This PR should be closed.

Let's merge https://github.com/scylladb/scylla-cluster-tests/pull/7029 instead of disabling it, it will make it work for scylla up to 5.2.

For scylla 5.4 restore procedure is not working due to https://github.com/scylladb/scylladb/issues/16349, and probably you would want to disable it for 5.4.

dkropachev avatar Jan 07 '24 17:01 dkropachev

@roydahan, @vponomaryov - @dkropachev is already working on fixing the problem with the nemesis. You can monitor the progress in #7029. This PR should be closed.

Let's merge https://github.com/scylladb/scylla-cluster-tests/pull/7029 instead of disabling it, it will make it work for scylla up to 5.2.

For scylla 5.4 restore procedure is not working due to https://github.com/scylladb/scylladb/issues/16349, and probably you would want to disable it for 5.4.

If so we do need to disable it on master FYI it never reached the 5.2 branch, it wasn't ready when it started.

If it's broken for 5.4 and 2024.1, we should disable it until proven working

fruch avatar Jan 07 '24 17:01 fruch

Need also to consider following bug:

  • https://github.com/scylladb/scylla-cluster-tests/issues/7122

It may really affect the results based on the used mgmt snapshot for the restore operation.

vponomaryov avatar Jan 19 '24 17:01 vponomaryov