scylla-cluster-tests
scylla-cluster-tests copied to clipboard
Optimize test setup time
Test setup include several steps like provisioning instances, setting up db nodes, loaders, monitor. Some steps are doing installations that may not be necessary or could be replaced with an image (e.g. monitor image), some may be done in parallel and currently being done serially because we need an input from a previous step, etc.
This task should start by mapping run times per step for a typical longevity setup, per step per node. Mentioning which step is parallel to another and which is serial and which happens in parallel to all nodes and which is serial.
Then, we should start creating the sub-tasks to optimize each step.
https://github.com/scylladb/scylla-cluster-tests/issues/7025 is one such example.
referencing prepare monitor image PR: https://github.com/scylladb/scylla-cluster-tests/pull/7305 It has great breakdown on the timings
Please find my analysis of test setup timings ( numbers in nested trees are parts of root duration):
- SCT starts (first log line) - out of scope (possibly we can get some time out of previous steps in jenkins)
- TEST_START event (
message=TEST_START) - 15sSCTConfiguration()- 5s- events devices setup - 6s
- update certificates - <1s
- init_resources - 297s (hardware)
- getting Security groups and subnets - 1s
- init 6 db nodes (
Init nodes) - 200s, sequentially:- detect distro, set hostname etc. 2s
- 29s for syslogng/ssh setup (mostly syslogng) per node
- init 2 loader nodes (
Init nodes) - 47 sec- similar, but a bit faster than db nodes
- 1 monitor node (
Init nodes) - 45 sec
- Init db cluster (
Class instance: Cluster) - 1207s- init wrap: 1167 (
TestConfig duration) (parallel node setup) - update scylla.yaml - set seeds and seed provider. Sequentially. - 23s
- check node health - 6s
- set auth rf: 12s
- init wrap: 1167 (
- init loaders - 333s
- init wrap: 332s (
TestConfig duration)
- init wrap: 332s (
- init monitor nodes: 892s
- init wrap: 891s (
TestConfig duration)
- init wrap: 891s (
- Argus collect packages - 22s
- argus get scylla version: 1s
- validate seeds, raft: 20s
Setup alltogether ~46m30sec (2790s)
My proposal:
- init resources: do all nodes in parallel (should be easy) or move syslogng/ssh setup to cloud init (like in Azure) so the initial gap for SCTConfig and creating events devices is not wasted.
- run init db cluster/loaders/monitor nodes in parallel - so it takes half time (=time to spin up db cluster - it takes longest)
- Argus collect packages/version/validate seeds,raft etc. can run in background one-off thread and proceed with the test
- investigate further why init db cluster takes 20m
I agree with the above proposal and direction.
My proposal:
- init resources: do all nodes in parallel (should be easy) or move syslogng/ssh setup to cloud init (like in Azure) so the initial gap for SCTConfig and creating events devices is not wasted.
We are doing it on cloud init already, we are repeating the would thing for cases we didn't had working cloud-init setup, we need to see if it is still needed or not.
- run init db cluster/loaders/monitor nodes in parallel - so it takes half time (=time to spin up db cluster - it takes longest)
- Argus collect packages/version/validate seeds,raft etc. can run in background one-off thread and proceed with the test
- investigate further why init db cluster takes 20m
As Roy said, this is a good plan to continue with
My proposal:
- init resources: do all nodes in parallel (should be easy) or move syslogng/ssh setup to cloud init (like in Azure) so the initial gap for SCTConfig and creating events devices is not wasted.
We are doing it on cloud init already, we are repeating the would thing for cases we didn't had working cloud-init setup, we need to see if it is still needed or not.
I think I saw syslogng reinstallation - I'll investigate why script does not detect it's installed and not skipping.
I wonder (did not check) - can you replace syslogng with Vector (https://vector.dev/ ) ? We've implemented it in the cloud and it should be in the base AMI (@d-helios , @yaronkaikov might be able to provide more information). This will reduce the need to install syslog in the 1st place and allow testing Vector while at it. It also has a good client-side filtering capability, which might be useful for your needs (unsure).
init monitor nodes: 892s
- init wrap: 891s (
TestConfig duration)
Do we have logs for this?
init monitor nodes: 892s
- init wrap: 891s (
TestConfig duration)Do we have logs for this?
yes, and in other PR we shaved more out of it, by using monitoring images (which we have for AWS/GCP): https://github.com/scylladb/scylla-cluster-tests/pull/7305#issuecomment-2030531485
setup in parallel merged. Unfortunately due different issue I couldn't get stats on how it helped in daily sanity jobs. We need to wait couple days to have broader view on it. Another run (with fixed issue) - longevity-5gb-1h - is shortened by ~20 minutes.
I'll update @fruch with time savings for the other jobs when I have results.
In the meantime, I'll proceed with next steps:
- Disable swap on monitor
- Analyze DB setup time, find opportunities to shorten it
Another place for savings is possibly teardown and also other jenkins pipeline steps - but this will be handled after 2 above
- CI job without the
setup in parallelcommit:- https://jenkins.scylladb.com/view/staging/job/scylla-staging/job/valerii/job/vp-longevity-alternator-ttl-big-dataset-test/13/consoleFull
- Diff between config print (
10:30:25,306) andtest start(11:09:23,897):38m58s
- The same CI job
withthesetup in parallelcommit:- https://jenkins.scylladb.com/view/staging/job/scylla-staging/job/valerii/job/vp-longevity-alternator-ttl-big-dataset-test/14/consoleFull
- Diff between config print (
12:02:40,664) andtest start(12:20:56,959):18m16s
So, the merge of the setup in parallel commit allowed to save 20m42s out of the 38m58s - it is more than half of the setup time.
@soyacz great job!
Further db setup analysis (1 node setup for convinience):
- setup events devices -
0.02s - wait ssh up -
unknown - disable iptables/firewalld (is rhel like only)
- check pro status (ubuntu only) -
0.55 s - Update repo cache -
2.5s - install lsof ne-tools
~2s - is_scylla_installed -
0.5s - disable daily triggers -
~68s - install syslog-ng-exporter
~146s - get nic devices -
0.04s - waiting for preinstalled scylla
~0.04s - save_kallsyms_map -
2.5s - backtrace decoding setup -
17s - config_setup
9.1 scylla yaml -
4.5s9.2 process scylla args -1s9.3 fix systemd config -1.7s - _scylla_post_install - only if installing scylla (not by default)
- prepare saslauthd (not by default)
- stop_scylla_server -
0.5s - clean scylla data -
2.2s - remove /etc/scylla/ami_disabled" -
0.5s - additional data volume setup (not by default)
- setup hybrid raid (not by default)
- overriding scylla d files (not by default)
- increase_jmx_heap_memory (not a problem for new scylla installations = not by default)
- install scylla manager - 08:57:23,750 - 8:57:53,602 -
30sSetup all together280 seconds
Startup process:
- get io.conf -
0.5s - start scylla server -
4.5s(not waiting for startup complete, so below stuff is done in parallel to scylla startup) 2.1. adaptive timeout start -0.5s2.2. get io.conf -0.5s2.3. start/enable scylla manager agent -2s - wait db up (one node setup)
3.1 wait for port occupied -
60.2s- long because of 1m interval 3.2. _report_housekeeping_uuid -3s - get node status -
8.5s - check_nodes_status - strange thing is that log messages from this method is shown
5safter setup is finished.
startup: 86 seconds
I think install syslog-ng-exporter, disable daily triggers and scylla-manager could be postponed to different time and/or run in background - this would bring ~4m time save (from ~20m currently)
Possibly it could be running in parallel to cluster setup (bit risky if setup would install pacakages) / startup or partially moved to cloud init.
@soyacz and @vponomaryov thanks for the details.
First, let's remember the end goal, to deploy a full test cluster (db nodes, monitor and loader) as fast as possible. So, once we reached to the point all the 3 entities happens in parallel, it doesn't matter if we shave few more seconds here and there as long as we shortened the longest one (which should be the db nodes) to the minimum. (hence for example, it's useless to deal now with removing the swap from monitor - which for sure will cause us an issues).
Back to focusing on db nodes, as @soyacz wrote:
- disable daily triggers -
~68s- install syslog-ng-exporter
~146s
Are the main items we may want to focus on, but there are reasons and limitations on why they are there:
- disable daily triggers caused us issues in the past locking the apt.lock IIRC, but why is it taking so long? is it running apt update?
- For syslog-ng-exporter there was a suggestion to switch to the new logging tool that cloud is using and possibly will be installed in our image. In any case, we must doing it early in the process so we can start getting logs.
Last, In the example above you used one node, but actually what takes the longest part is starting a cluster of nodes. Maybe consistent-topology-changes improved, but let's check first how long it takes.
Back to focusing on db nodes, as @soyacz wrote:
- disable daily triggers -
~68s- install syslog-ng-exporter
~146sAre the main items we may want to focus on, but there are reasons and limitations on why they are there:
- disable daily triggers caused us issues in the past locking the apt.lock IIRC, but why is it taking so long? is it running apt update?
There are multiple services to stop and disable - each takes time. Possibly it can be sped up by stopping/disabling multiple in one command (I don't know why they were separated, so need to first check). Stopping/disabling services should be no problem, that's why I would consider disabling them during cloud-init script (multiple in one command).
- For syslog-ng-exporter there was a suggestion to switch to the new logging tool that cloud is using and possibly will be installed in our image. In any case, we must doing it early in the process so we can start getting logs.
syslog-ng-exporter is a new thing for monitoring syslogng itself. Node setup seemed a good place for it, but from test perspective it's not critical and can be postponed to later stage. I'd propose to do it in parallel to node startup (in init wrapper).
Last, In the example above you used one node, but actually what takes the longest part is starting a cluster of nodes. Maybe consistent-topology-changes improved, but let's check first how long it takes.
Yes. Currently 6 node cluster startup takes ~370 seconds and nodes are started serially. In my measurements I wanted to highlight steps that are done for each node. E.g. we verify nodetool status which in my opinion is redundant (we should do it at the end only), or wait_db_up could have finer grained step (like 10s instead of 60s) as we won't be able to get below 60s/node (actually this is an easy fix I'll propose today).
Back to focusing on db nodes, as @soyacz wrote:
- disable daily triggers -
~68s- install syslog-ng-exporter
~146sAre the main items we may want to focus on, but there are reasons and limitations on why they are there:
- disable daily triggers caused us issues in the past locking the apt.lock IIRC, but why is it taking so long? is it running apt update?
There are multiple services to stop and disable - each takes time. Possibly it can be sped up by stopping/disabling multiple in one command (I don't know why they were separated, so need to first check). Stopping/disabling services should be no problem, that's why I would consider disabling them during cloud-init script (multiple in one command).
ok, make sense if it's possible.
- For syslog-ng-exporter there was a suggestion to switch to the new logging tool that cloud is using and possibly will be installed in our image. In any case, we must doing it early in the process so we can start getting logs.
syslog-ng-exporter is a new thing for monitoring syslogng itself. Node setup seemed a good place for it, but from test perspective it's not critical and can be postponed to later stage. I'd propose to do it in parallel to node startup (in init wrapper).
Last, In the example above you used one node, but actually what takes the longest part is starting a cluster of nodes. Maybe consistent-topology-changes improved, but let's check first how long it takes.
Oh, ok, so in that case make sense to do it later in the process and in the background.
Yes. Currently 6 node cluster startup takes ~370 seconds and nodes are started serially. In my measurements I wanted to highlight steps that are done for each node. E.g. we verify nodetool status which in my opinion is redundant (we should do it at the end only), or wait_db_up could have finer grained step (like 10s instead of 60s) as we won't be able to get below 60s/node (actually this is an easy fix I'll propose today).
Let's examine each suggested change and make sure we test it of course. I'm sure we can optimize some of these.
closing this one in favor of https://github.com/scylladb/qa-tasks/issues/1650 where we track the different tasks we have for this effort