scylla-cluster-tests improvement(test-stages): add ability to skip test stages

The change introduces a feature that allows specific stages of a test to be skipped. This would allow to optionally:

start from a specific test stage (e.g. start from the main load while skipping the prepare load)
skip intermediary test stages (e.g. bypass nemesis disruption stage)

The new functionality is intended to complement the reuse_cluster capability, allowing re-running subset of test stages of initial test scenario. However, it can also be used standalone when certain stages are non-essential for a particular scenario.

Closes: https://github.com/scylladb/qa-tasks/issues/1761

Testing

[ ] :clock1: PR-provision-test with skipped prepare cmd

PR pre-checks (self review)

[ ] I added the relevant backport labels
[x] I didn't leave commented-out/debugging code

Reminders

Add New configuration option and document them (in sdcm/sct_config.py)
Add unit tests to cover my changes (under unit-test/ folder)
Update the Readme/doc folder relevant to this change (if needed)

Sep 17 '24 17:09 dimakr

@fruch draft for the https://github.com/scylladb/qa-tasks/issues/1761 task implementation. Do you think it is something we can continue with? I tried some other options, but they are messy. This one is readable, makes the least amount of changes in the test itself.

Sep 17 '24 17:09 dimakr

@fruch draft for the https://github.com/scylladb/qa-tasks/issues/1761 task implementation. Do you think it is something we can continue with? I tried some other options, but they are messy. This one is readable, makes the least amount of changes in the test itself.

The idea looks quite solid and not too intrusive, we need a bit of examples on how to use it.

We should check how we can override with environment variables, so we would be able to use this also in Jenkins runs

One idea that I've seen is Julia and Roy using a lot in the gradual perf test, running a specific round of a loop, if we can use a similar approach idea for running a specific iteration from a loop. but that can be an extension to this one.

Sep 17 '24 19:09 fruch

I'd propose making it automatic: create a decorator that will:

write to a file info about given function that is completed (e.g line in e.g. /tmp/completed-steps-<test_id> or somewhere in results dir)
If decorator sees that line and is reusing cluster, then skip decorated method.

Only decorate methods that are intended for skipping when reusing (remember about Perf tests methods too).

Sep 18 '24 14:09 soyacz

I'd propose making it automatic: create a decorator that will:

write to a file info about given function that is completed (e.g line in e.g. /tmp/completed-steps-<test_id> or somewhere in results dir)

If decorator sees that line and is reusing cluster, then skip decorated method.

Only decorate methods that are intended for skipping when reusing (remember about Perf tests methods too).

I think implementing the functionality this way would restrict us from some use cases:

using the feature standalone, not only with reuse_cluster. Probably it won't be a common a scenario, but if a user would need to skip some test stages of a test when running it for th 1st time, the feature would help with it.
when reusing a cluster that was created by someone (something) else. E.g. when someone prepares a reproducer, provisions and keeps a cluster and passes it to a dev who needs it for their work (update packages, rerun only subset of test stages, etc.)

What do you think?

Sep 18 '24 17:09 dimakr

I'd propose making it automatic: create a decorator that will:

write to a file info about given function that is completed (e.g line in e.g. /tmp/completed-steps-<test_id> or somewhere in results dir)

If decorator sees that line and is reusing cluster, then skip decorated method.

Only decorate methods that are intended for skipping when reusing (remember about Perf tests methods too).

I think implementing the functionality this way would restrict us from some use cases:

using the feature standalone, not only with reuse_cluster. Probably it won't be a common a scenario, but if a user would need to skip some test stages of a test when running it for th 1st time, the feature would help with it.

Even if such a feature would be needed, then we could add it to this decorator (besides checking a file, check param supplied in params).

when reusing a cluster that was created by someone (something) else. E.g. when someone prepares a reproducer, provisions and keeps a cluster and passes it to a dev who needs it for their work (update packages, rerun only subset of test stages, etc.)

This could be overcomed by keeping info about completed stages on loader (we never destroy loaders).

What do you think?

Generally, what I think is that adding feature doesn't mean people would use it because:

they don't know about it
they don't know how to use
it is not obvious for them which stages might be skipped and which not

Sep 19 '24 06:09 soyacz

I'd propose making it automatic: create a decorator that will:

write to a file info about given function that is completed (e.g line in e.g. /tmp/completed-steps-<test_id> or somewhere in results dir)

If decorator sees that line and is reusing cluster, then skip decorated method.

Only decorate methods that are intended for skipping when reusing (remember about Perf tests methods too).

I think implementing the functionality this way would restrict us from some use cases:

using the feature standalone, not only with reuse_cluster. Probably it won't be a common a scenario, but if a user would need to skip some test stages of a test when running it for th 1st time, the feature would help with it.

Even if such a feature would be needed, then we could add it to this decorator (besides checking a file, check param supplied in params).

regardless of needed or not, I don't want to tie this option with reuse_cluster, and even reuse cluster itself should be tied with local files, consider case I would run it from jenkins, each time from different builder/runner.

when reusing a cluster that was created by someone (something) else. E.g. when someone prepares a reproducer, provisions and keeps a cluster and passes it to a dev who needs it for their work (update packages, rerun only subset of test stages, etc.)

This could be overcomed by keeping info about completed stages on loader (we never destroy loaders).

What do you think?

Generally, what I think is that adding feature doesn't mean people would use it because:

they don't know about it

they don't know how to use

it is not obvious for them which stages might be skipped and which not

those statements are valid for every single feature we develop, if we'll fail to document and advertise it properly.

As for how common is this need, for working on perf test is quite common, and for any development we need to experiment a lot, and for reproductions of issues as well.

this was asking in multiple variations, as the work you had for using jupyter notebooks to exercise some specific SCT code, which people don't know either and don't use it.

Sep 19 '24 07:09 fruch

I'd propose making it automatic: create a decorator that will:

write to a file info about given function that is completed (e.g line in e.g. /tmp/completed-steps-<test_id> or somewhere in results dir)

If decorator sees that line and is reusing cluster, then skip decorated method.

Only decorate methods that are intended for skipping when reusing (remember about Perf tests methods too).

I think implementing the functionality this way would restrict us from some use cases:

using the feature standalone, not only with reuse_cluster. Probably it won't be a common a scenario, but if a user would need to skip some test stages of a test when running it for th 1st time, the feature would help with it.

Even if such a feature would be needed, then we could add it to this decorator (besides checking a file, check param supplied in params).

regardless of needed or not, I don't want to tie this option with reuse_cluster, and even reuse cluster itself should be tied with local files, consider case I would run it from jenkins, each time from different builder/runner.

when reusing a cluster that was created by someone (something) else. E.g. when someone prepares a reproducer, provisions and keeps a cluster and passes it to a dev who needs it for their work (update packages, rerun only subset of test stages, etc.)

This could be overcomed by keeping info about completed stages on loader (we never destroy loaders).

What do you think?

Generally, what I think is that adding feature doesn't mean people would use it because:

they don't know about it

they don't know how to use

it is not obvious for them which stages might be skipped and which not

those statements are valid for every single feature we develop, if we'll fail to document and advertise it properly.

As for how common is this need, for working on perf test is quite common, and for any development we need to experiment a lot, and for reproductions of issues as well.

this was asking in multiple variations, as the work you had for using jupyter notebooks to exercise some specific SCT code, which people don't know either and don't use it.

I agree, it's too hard for most of the people. That's why I proposed something easier to use and more automatic. Just a proposal, we can go the way @dimakr proposed.

Sep 19 '24 07:09 soyacz

@fruch This change adds the underlying mechanism for skipping test stages, and uses it in longevity tests to mark the appropriate test stage methods/calls. I suggest to have a separate PR(s) for enabling the mechanism in tests which structure differ from longevities (performance, manager, tests), as this PR already introduces changes to several dozens of files.

Sep 30 '24 12:09 dimakr

I think one of the main use cases for this are perf regression tests where we play with load a lot and want to skip preloading data. Please check performance_regression_gradual_grow_throughput.PerformanceRegressionPredefinedStepsTest and performance_regression_test.PerformanceRegressionTest

@soyacz I wanted to address non-longevity kind of tests - performance, manager - in the next separatePR, after this one is approved and merged. This one is already hard to maintain.

Oct 10 '24 09:10 dimakr

scylla-cluster-tests scylla-cluster-tests copied to clipboard

improvement(test-stages): add ability to skip test stages

Testing

PR pre-checks (self review)

Reminders

scylla-cluster-tests
scylla-cluster-tests copied to clipboard