scylla-cluster-tests Optimize test setup time

Test setup include several steps like provisioning instances, setting up db nodes, loaders, monitor. Some steps are doing installations that may not be necessary or could be replaced with an image (e.g. monitor image), some may be done in parallel and currently being done serially because we need an input from a previous step, etc.

This task should start by mapping run times per step for a typical longevity setup, per step per node. Mentioning which step is parallel to another and which is serial and which happens in parallel to all nodes and which is serial.

Then, we should start creating the sub-tasks to optimize each step.

Mar 04 '24 16:03 roydahan

https://github.com/scylladb/scylla-cluster-tests/issues/7025 is one such example.

Mar 07 '24 10:03 mykaul

referencing prepare monitor image PR: https://github.com/scylladb/scylla-cluster-tests/pull/7305 It has great breakdown on the timings

Apr 23 '24 08:04 soyacz

Please find my analysis of test setup timings ( numbers in nested trees are parts of root duration):

SCT starts (first log line) - out of scope (possibly we can get some time out of previous steps in jenkins)
TEST_START event (message=TEST_START) - 15s
- SCTConfiguration() - 5s
- events devices setup - 6s
update certificates - <1s
init_resources - 297s (hardware)
- getting Security groups and subnets - 1s
- init 6 db nodes (Init nodes) - 200s, sequentially:
  - detect distro, set hostname etc. 2s
  - 29s for syslogng/ssh setup (mostly syslogng) per node
- init 2 loader nodes (Init nodes) - 47 sec
  - similar, but a bit faster than db nodes
- 1 monitor node (Init nodes) - 45 sec
Init db cluster (Class instance: Cluster) - 1207s
- init wrap: 1167 (TestConfig duration) (parallel node setup)
- update scylla.yaml - set seeds and seed provider. Sequentially. - 23s
- check node health - 6s
- set auth rf: 12s
init loaders - 333s
- init wrap: 332s (TestConfig duration)
init monitor nodes: 892s
- init wrap: 891s (TestConfig duration)
Argus collect packages - 22s
argus get scylla version: 1s
validate seeds, raft: 20s

Setup alltogether ~46m30sec (2790s)

Apr 24 '24 07:04 soyacz

My proposal:

init resources: do all nodes in parallel (should be easy) or move syslogng/ssh setup to cloud init (like in Azure) so the initial gap for SCTConfig and creating events devices is not wasted.
run init db cluster/loaders/monitor nodes in parallel - so it takes half time (=time to spin up db cluster - it takes longest)
Argus collect packages/version/validate seeds,raft etc. can run in background one-off thread and proceed with the test
investigate further why init db cluster takes 20m

Apr 24 '24 12:04 soyacz

I agree with the above proposal and direction.

Apr 25 '24 16:04 roydahan

My proposal:

init resources: do all nodes in parallel (should be easy) or move syslogng/ssh setup to cloud init (like in Azure) so the initial gap for SCTConfig and creating events devices is not wasted.

We are doing it on cloud init already, we are repeating the would thing for cases we didn't had working cloud-init setup, we need to see if it is still needed or not.

run init db cluster/loaders/monitor nodes in parallel - so it takes half time (=time to spin up db cluster - it takes longest)

Argus collect packages/version/validate seeds,raft etc. can run in background one-off thread and proceed with the test

investigate further why init db cluster takes 20m

As Roy said, this is a good plan to continue with

Apr 25 '24 19:04 fruch

My proposal:

init resources: do all nodes in parallel (should be easy) or move syslogng/ssh setup to cloud init (like in Azure) so the initial gap for SCTConfig and creating events devices is not wasted.

We are doing it on cloud init already, we are repeating the would thing for cases we didn't had working cloud-init setup, we need to see if it is still needed or not.

I think I saw syslogng reinstallation - I'll investigate why script does not detect it's installed and not skipping.

Apr 26 '24 05:04 soyacz

I wonder (did not check) - can you replace syslogng with Vector (https://vector.dev/ ) ? We've implemented it in the cloud and it should be in the base AMI (@d-helios , @yaronkaikov might be able to provide more information). This will reduce the need to install syslog in the 1st place and allow testing Vector while at it. It also has a good client-side filtering capability, which might be useful for your needs (unsure).

May 02 '24 08:05 mykaul

init monitor nodes: 892s

init wrap: 891s (TestConfig duration)

Do we have logs for this?

May 02 '24 08:05 mykaul

init monitor nodes: 892s

init wrap: 891s (TestConfig duration)

Do we have logs for this?

yes, and in other PR we shaved more out of it, by using monitoring images (which we have for AWS/GCP): https://github.com/scylladb/scylla-cluster-tests/pull/7305#issuecomment-2030531485

May 02 '24 08:05 fruch

setup in parallel merged. Unfortunately due different issue I couldn't get stats on how it helped in daily sanity jobs. We need to wait couple days to have broader view on it. Another run (with fixed issue) - longevity-5gb-1h - is shortened by ~20 minutes.

I'll update @fruch with time savings for the other jobs when I have results.

In the meantime, I'll proceed with next steps:

Disable swap on monitor
Analyze DB setup time, find opportunities to shorten it

Another place for savings is possibly teardown and also other jenkins pipeline steps - but this will be handled after 2 above

May 14 '24 11:05 soyacz

CI job without the setup in parallel commit:
- https://jenkins.scylladb.com/view/staging/job/scylla-staging/job/valerii/job/vp-longevity-alternator-ttl-big-dataset-test/13/consoleFull
- Diff between config print (10:30:25,306) and test start (11:09:23,897): 38m58s
The same CI job with the setup in parallel commit:
- https://jenkins.scylladb.com/view/staging/job/scylla-staging/job/valerii/job/vp-longevity-alternator-ttl-big-dataset-test/14/consoleFull
- Diff between config print (12:02:40,664) and test start (12:20:56,959): 18m16s

So, the merge of the setup in parallel commit allowed to save 20m42s out of the 38m58s - it is more than half of the setup time.

@soyacz great job!

May 14 '24 11:05 vponomaryov

Further db setup analysis (1 node setup for convinience):

setup events devices - 0.02s
wait ssh up - unknown
disable iptables/firewalld (is rhel like only)
check pro status (ubuntu only) - 0.55 s
Update repo cache - 2.5s
install lsof ne-tools ~2s
is_scylla_installed - 0.5s
disable daily triggers - ~68s
install syslog-ng-exporter ~146s
get nic devices - 0.04s
waiting for preinstalled scylla ~0.04s
save_kallsyms_map - 2.5s
backtrace decoding setup - 17s
config_setup 9.1 scylla yaml - 4.5s 9.2 process scylla args - 1s 9.3 fix systemd config - 1.7s
_scylla_post_install - only if installing scylla (not by default)
prepare saslauthd (not by default)
stop_scylla_server - 0.5s
clean scylla data - 2.2s
remove /etc/scylla/ami_disabled" - 0.5s
additional data volume setup (not by default)
setup hybrid raid (not by default)
overriding scylla d files (not by default)
increase_jmx_heap_memory (not a problem for new scylla installations = not by default)
install scylla manager - 08:57:23,750 - 8:57:53,602 - 30s Setup all together 280 seconds

Startup process:

get io.conf - 0.5s
start scylla server - 4.5s (not waiting for startup complete, so below stuff is done in parallel to scylla startup) 2.1. adaptive timeout start - 0.5s 2.2. get io.conf - 0.5s 2.3. start/enable scylla manager agent -2s
wait db up (one node setup) 3.1 wait for port occupied - 60.2s - long because of 1m interval 3.2. _report_housekeeping_uuid - 3s
get node status - 8.5s
check_nodes_status - strange thing is that log messages from this method is shown 5s after setup is finished.

startup: 86 seconds

May 16 '24 12:05 soyacz

I think install syslog-ng-exporter, disable daily triggers and scylla-manager could be postponed to different time and/or run in background - this would bring ~4m time save (from ~20m currently)

Possibly it could be running in parallel to cluster setup (bit risky if setup would install pacakages) / startup or partially moved to cloud init.

May 16 '24 12:05 soyacz

@soyacz and @vponomaryov thanks for the details.

First, let's remember the end goal, to deploy a full test cluster (db nodes, monitor and loader) as fast as possible. So, once we reached to the point all the 3 entities happens in parallel, it doesn't matter if we shave few more seconds here and there as long as we shortened the longest one (which should be the db nodes) to the minimum. (hence for example, it's useless to deal now with removing the swap from monitor - which for sure will cause us an issues).

Back to focusing on db nodes, as @soyacz wrote:

disable daily triggers - ~68s

install syslog-ng-exporter ~146s

Are the main items we may want to focus on, but there are reasons and limitations on why they are there:

disable daily triggers caused us issues in the past locking the apt.lock IIRC, but why is it taking so long? is it running apt update?
For syslog-ng-exporter there was a suggestion to switch to the new logging tool that cloud is using and possibly will be installed in our image. In any case, we must doing it early in the process so we can start getting logs.

Last, In the example above you used one node, but actually what takes the longest part is starting a cluster of nodes. Maybe consistent-topology-changes improved, but let's check first how long it takes.

May 16 '24 18:05 roydahan

Back to focusing on db nodes, as @soyacz wrote:

disable daily triggers - ~68s

install syslog-ng-exporter ~146s

Are the main items we may want to focus on, but there are reasons and limitations on why they are there:

disable daily triggers caused us issues in the past locking the apt.lock IIRC, but why is it taking so long? is it running apt update?

There are multiple services to stop and disable - each takes time. Possibly it can be sped up by stopping/disabling multiple in one command (I don't know why they were separated, so need to first check). Stopping/disabling services should be no problem, that's why I would consider disabling them during cloud-init script (multiple in one command).

For syslog-ng-exporter there was a suggestion to switch to the new logging tool that cloud is using and possibly will be installed in our image. In any case, we must doing it early in the process so we can start getting logs.

syslog-ng-exporter is a new thing for monitoring syslogng itself. Node setup seemed a good place for it, but from test perspective it's not critical and can be postponed to later stage. I'd propose to do it in parallel to node startup (in init wrapper).

Last, In the example above you used one node, but actually what takes the longest part is starting a cluster of nodes. Maybe consistent-topology-changes improved, but let's check first how long it takes.

Yes. Currently 6 node cluster startup takes ~370 seconds and nodes are started serially. In my measurements I wanted to highlight steps that are done for each node. E.g. we verify nodetool status which in my opinion is redundant (we should do it at the end only), or wait_db_up could have finer grained step (like 10s instead of 60s) as we won't be able to get below 60s/node (actually this is an easy fix I'll propose today).

May 17 '24 06:05 soyacz

Back to focusing on db nodes, as @soyacz wrote:

disable daily triggers - ~68s

install syslog-ng-exporter ~146s

Are the main items we may want to focus on, but there are reasons and limitations on why they are there:

disable daily triggers caused us issues in the past locking the apt.lock IIRC, but why is it taking so long? is it running apt update?

There are multiple services to stop and disable - each takes time. Possibly it can be sped up by stopping/disabling multiple in one command (I don't know why they were separated, so need to first check). Stopping/disabling services should be no problem, that's why I would consider disabling them during cloud-init script (multiple in one command).

ok, make sense if it's possible.

For syslog-ng-exporter there was a suggestion to switch to the new logging tool that cloud is using and possibly will be installed in our image. In any case, we must doing it early in the process so we can start getting logs.

syslog-ng-exporter is a new thing for monitoring syslogng itself. Node setup seemed a good place for it, but from test perspective it's not critical and can be postponed to later stage. I'd propose to do it in parallel to node startup (in init wrapper).

Last, In the example above you used one node, but actually what takes the longest part is starting a cluster of nodes. Maybe consistent-topology-changes improved, but let's check first how long it takes.

Oh, ok, so in that case make sense to do it later in the process and in the background.

Yes. Currently 6 node cluster startup takes ~370 seconds and nodes are started serially. In my measurements I wanted to highlight steps that are done for each node. E.g. we verify nodetool status which in my opinion is redundant (we should do it at the end only), or wait_db_up could have finer grained step (like 10s instead of 60s) as we won't be able to get below 60s/node (actually this is an easy fix I'll propose today).

Let's examine each suggested change and make sure we test it of course. I'm sure we can optimize some of these.

May 19 '24 14:05 roydahan

closing this one in favor of https://github.com/scylladb/qa-tasks/issues/1650 where we track the different tasks we have for this effort

May 27 '24 22:05 fruch

scylla-cluster-tests scylla-cluster-tests copied to clipboard

Optimize test setup time

scylla-cluster-tests
scylla-cluster-tests copied to clipboard