elemental e2e: minimal scale test

Goal: set a reproducibly tested baseline for the number of systems that can be deployed/upgraded concurrently via the Elemental operator.

Outline of needed changes to the e2e testsuite:

allow to deploy/upgrade an arbitrary number of VMs concurrently. Initially this could mean copy-pasting a block of code, eventually it should become a single numeric configuration parameter
change test pass conditions so that all deployed/upgraded VMs are checked
change the definition of deployment VMs down to minimum requirements for RKE2, to increase density (eg. use 2 vCPUs/3 GB RAM per VM down from the current 4vCPUs/4GB RAM)
potentially, change the definition of the GCP hypervisor instance to allow more VMs to be tested. Educated best guesses are:
- to downsize the current N2 16vCPU/64GB RAM instance to 8vCPU/32GB RAM to fit up to 10 VMs (~halves hourly cost)
- to keep the current N2 16vCPU/64GB RAM instance to fit up to 20 VMs
- to upsize keep the current N2 16vCPU/64GB RAM instance to 32vCPU/128GB RAM fit up to 40 VMs (~doubles hourly cost)
- extrapolating, it should be possible to go above 150 VMs while still keeping the same infrastucture (~8x hourly cost)

20 VMs is probably a good starting point but should of course be validated. The minimal scale test should only be run sporadically (eg. before releases) for the time being - so ideally pipeline components should be organized to share as much code as possible with the current e2e testsuite. Scaling up to ~150 parallel deployments should only be considered if there is demand, and scaling up above 150 parallel deployments probably requires a different infrastructure which is not in the scope of this effort, and would require different considerations.

Refactoring the e2e testsuite to use GCP VMs to deploy Rancher itself (as opposed to using a hypervisor instance) is also not in scope of this issue, although it could be a follow-up (and could potentially reuse some of the work in this repo).

Oct 17 '22 09:10 moio

@ldevulder @juadk: I hope I captured all relevant details from our discussion accurately here. Please point out any discrepancies or potential problems from your perspective.

@kkaempf, @pgonin please consider this card for the next planning, any comment from you is also welcome. Feel free to CC any other relevant person in the team.

Oct 17 '22 09:10 moio

@moio - just add the card to the Elemental project board and raise your voice at the next planning :wink:

Oct 17 '22 12:10 kkaempf

Things to do in CI here:

OS provisioning in parallel (this should stress elemental-operator a bit)
cluster deployment (K3s and RKE2) on multiple node in parallel

Oct 25 '22 09:10 ldevulder

JFYI: beware of https://github.com/k3s-io/k3s/issues/2306 when deploying k3s/rke2 clusters with multiple nodes in parallel. You might need to add master nodes gradually or leader election might fail.

Crude workaround example: https://github.com/moio/scalability-tests/blob/ce046f199a28106dec3072b92d63980bf85a08b7/rke2/install_rke2.sh#L5-L6

Oct 25 '22 09:10 moio

JFYI: beware of k3s-io/k3s#2306 when deploying k3s/rke2 clusters with multiple nodes in parallel. You might need to add master nodes gradually or leader election might fail.

Yes, the parallelization will be more for ths OS installation.

@moio FYI, first step to be able to add more nodes easily: https://github.com/rancher/elemental/pull/484

Nov 08 '22 07:11 ldevulder

@moio test with 160 nodes: https://github.com/rancher/elemental/actions/runs/4439953126.

Last step is to be able to provision multiple clusters. After this we should have a good scalability test.

Mar 17 '23 12:03 ldevulder

Do we have any monitoring during these tests? It would be really interested in some graphs for the management cluster

I'm thinking mapping node/cluster counts against just the basics would be enough to start seeing emergent patterns:

mem/cpu
etcd call rate & latency
kubeapi call rate & latency
pod restart events

Mar 17 '23 13:03 agracey

closing as done. Multiple clusters are an extension (and tracked in todo)

Mar 21 '23 09:03 kkaempf

@agracey Not as I'm aware. I'm not a Rancher Manager expert at all and I have no knowledge on how to do this. If you know how to do this you can add the information here. But this request looks more related to Rancher Manager rather than Elemental.

Mar 21 '23 13:03 ldevulder

@ldevulder JFYI, once one has a cluster registered to Rancher there is a super easy way to add monitoring to it: Homepage -> Cluster Name -> Cluster Tools -> Monitoring -> Install.

Few minutes later a new menu row appears on the left pane: Monitoring -> Grafana. From there you will see several pretty detailed dashboards containing a lot of information including everything @agracey referred to.

It could be useful to automate the installation of this cluster tool and somehow download produced data/graphs after the test is concluded (or leave the cluster running for manual inspection).

That does seem like a separate work item to me, up to you and the team to determine how important it is compared to other things you have on your table.

Thanks for this effort and have fun!

Mar 21 '23 14:03 moio

This test is not done with the UI but entirely with CLI (and all automated). If you have the same with kubectl or rancherctl command it would be easier 😉

Mar 21 '23 14:03 ldevulder

I do not have it right now, but chances are high I will get to the same problem for an unrelated project I'll be starting... Today. Will keep you posted if I do manage to solve it UI-less!

Mar 22 '23 09:03 moio

@ldevulder FYI: installing Rancher Monitoring can be done via Helm:

https://github.com/moio/scalability-tests/blob/cd9ab138df85e21a7b589ef552a20af54557b57e/bin/setup.mjs#L55-L105

(note: the helm_install method is nothing magical, it just calls helm with a few parameters that make sense in my environment):

https://github.com/moio/scalability-tests/blob/cd9ab138df85e21a7b589ef552a20af54557b57e/bin/lib/common.mjs#L49

Note also I am setting up tolerations because I want all monitoring stuff to run in a dedicated node. This might or might not be something you need. I am also setting up a separate Mimir server for long-term result storage, that is still WIP at the moment.

HTH

May 11 '23 10:05 moio

elemental elemental copied to clipboard

e2e: minimal scale test

elemental
elemental copied to clipboard