elemental
elemental copied to clipboard
e2e: minimal scale test
Goal: set a reproducibly tested baseline for the number of systems that can be deployed/upgraded concurrently via the Elemental operator.
Outline of needed changes to the e2e testsuite:
- allow to deploy/upgrade an arbitrary number of VMs concurrently. Initially this could mean copy-pasting a block of code, eventually it should become a single numeric configuration parameter
- change test pass conditions so that all deployed/upgraded VMs are checked
- change the definition of deployment VMs down to minimum requirements for RKE2, to increase density (eg. use 2 vCPUs/3 GB RAM per VM down from the current 4vCPUs/4GB RAM)
- potentially, change the definition of the GCP hypervisor instance to allow more VMs to be tested. Educated best guesses are:
- to downsize the current N2 16vCPU/64GB RAM instance to 8vCPU/32GB RAM to fit up to 10 VMs (~halves hourly cost)
- to keep the current N2 16vCPU/64GB RAM instance to fit up to 20 VMs
- to upsize keep the current N2 16vCPU/64GB RAM instance to 32vCPU/128GB RAM fit up to 40 VMs (~doubles hourly cost)
- extrapolating, it should be possible to go above 150 VMs while still keeping the same infrastucture (~8x hourly cost)
20 VMs is probably a good starting point but should of course be validated. The minimal scale test should only be run sporadically (eg. before releases) for the time being - so ideally pipeline components should be organized to share as much code as possible with the current e2e testsuite. Scaling up to ~150 parallel deployments should only be considered if there is demand, and scaling up above 150 parallel deployments probably requires a different infrastructure which is not in the scope of this effort, and would require different considerations.
Refactoring the e2e testsuite to use GCP VMs to deploy Rancher itself (as opposed to using a hypervisor instance) is also not in scope of this issue, although it could be a follow-up (and could potentially reuse some of the work in this repo).
@ldevulder @juadk: I hope I captured all relevant details from our discussion accurately here. Please point out any discrepancies or potential problems from your perspective.
@kkaempf, @pgonin please consider this card for the next planning, any comment from you is also welcome. Feel free to CC any other relevant person in the team.
@moio - just add the card to the Elemental project board and raise your voice at the next planning :wink:
Things to do in CI here:
- OS provisioning in parallel (this should stress elemental-operator a bit)
- cluster deployment (K3s and RKE2) on multiple node in parallel
JFYI: beware of https://github.com/k3s-io/k3s/issues/2306 when deploying k3s/rke2 clusters with multiple nodes in parallel. You might need to add master nodes gradually or leader election might fail.
Crude workaround example: https://github.com/moio/scalability-tests/blob/ce046f199a28106dec3072b92d63980bf85a08b7/rke2/install_rke2.sh#L5-L6
JFYI: beware of k3s-io/k3s#2306 when deploying k3s/rke2 clusters with multiple nodes in parallel. You might need to add master nodes gradually or leader election might fail.
Yes, the parallelization will be more for ths OS installation.
@moio FYI, first step to be able to add more nodes easily: https://github.com/rancher/elemental/pull/484
@moio test with 160 nodes: https://github.com/rancher/elemental/actions/runs/4439953126.
Last step is to be able to provision multiple clusters. After this we should have a good scalability test.
Do we have any monitoring during these tests? It would be really interested in some graphs for the management cluster
I'm thinking mapping node/cluster counts against just the basics would be enough to start seeing emergent patterns:
- mem/cpu
- etcd call rate & latency
- kubeapi call rate & latency
- pod restart events
closing as done. Multiple clusters are an extension (and tracked in todo)
@agracey Not as I'm aware. I'm not a Rancher Manager expert at all and I have no knowledge on how to do this. If you know how to do this you can add the information here. But this request looks more related to Rancher Manager rather than Elemental.
@ldevulder JFYI, once one has a cluster registered to Rancher there is a super easy way to add monitoring to it: Homepage -> Cluster Name -> Cluster Tools -> Monitoring -> Install.
Few minutes later a new menu row appears on the left pane: Monitoring -> Grafana. From there you will see several pretty detailed dashboards containing a lot of information including everything @agracey referred to.
It could be useful to automate the installation of this cluster tool and somehow download produced data/graphs after the test is concluded (or leave the cluster running for manual inspection).
That does seem like a separate work item to me, up to you and the team to determine how important it is compared to other things you have on your table.
Thanks for this effort and have fun!
This test is not done with the UI but entirely with CLI (and all automated). If you have the same with kubectl or rancherctl command it would be easier 😉
I do not have it right now, but chances are high I will get to the same problem for an unrelated project I'll be starting... Today. Will keep you posted if I do manage to solve it UI-less!
@ldevulder FYI: installing Rancher Monitoring can be done via Helm:
https://github.com/moio/scalability-tests/blob/cd9ab138df85e21a7b589ef552a20af54557b57e/bin/setup.mjs#L55-L105
(note: the helm_install method is nothing magical, it just calls helm with a few parameters that make sense in my environment):
https://github.com/moio/scalability-tests/blob/cd9ab138df85e21a7b589ef552a20af54557b57e/bin/lib/common.mjs#L49
Note also I am setting up tolerations because I want all monitoring stuff to run in a dedicated node. This might or might not be something you need. I am also setting up a separate Mimir server for long-term result storage, that is still WIP at the moment.
HTH