Automated full stack test
If possible, it would be nice to automate a test of the full stack including creating a cluster and installing a few common tools (such as Longhorn and NeuVector but potentially others).
I imagine the test would look something like:
- Install operator and configure
- Build bootstrap iso
- Boot machines with iso and register then reboot
- Install k3s using rancher provisioning
- install Longhorn and NeuVector
- Smoke test Longhorn with a workload that can save and recall files from a volume
- Smoke test NV by adding rule that will quarantine workload
- Upgrade
- Repeat smoke tests
- Teardown and report
4. Install k3s using rancher provisioning
K3s is Tech Preview in Rancher, better to use RKE2 (E2E tests have been modified for this).
5. install Longhorn and NeuVector
Not sure that we currently have enough runner to use longhorn on it. And is it really a useful usecase currently?
- Install k3s using rancher provisioning
K3s is Tech Preview in Rancher, better to use RKE2 (E2E tests have been modified for this).
My thinking for K3s is that it is what we expect our edge users to use primarily. It may be marked as tech preview but we intend to support both for the edge.
- install Longhorn and NeuVector
Not sure that we currently have enough runner to use longhorn on it. And is it really a useful usecase currently?
Yes, we need to provide a storage layer that allows for workload moving between nodes (localpath provisioner adds node constraints due to it being storage on that specific node)
why do we need this? Once we have installed rke2/k3s and those deployments are up, then the usage of those are out of elemental scope, aren't them?
if they are installed using the official deployments from rancher, elemental ahs nothing to do in there unless there is some paths or something that are needed, they should work as a manual installed cluster?
They rely on host level capabilities. For example, Longhorn needs the iscsi packages.
My rationale for the tests is to make sure that we can be confident that something trivial like that doesn't get missed causing a bad upgrade gets distributed to deployments that are hard to repair if something goes wrong. The automated rollback should kick in for anything that's hard to recover from so it's just a second line of protection.
It's not high priority, but I think this type of test is worth while even if it's just a sanity check that we expect to never fail.
Interesting. Do we have a list of products that would we support 100%, i.e. the rancher products (longhorn, harvester, etc..)?
We could start by that, having a list of those so we can assure 100% that those rancher products that are usually deployed and we offer support for them are properly supported.
I think Neuvector and Longhorn would the be ones that I'm most concerned with. Harvester is a bit of an outlier there as it's currently built into its own appliance stack and I expect that they would do the e2e testing.
It likely doesn't need to be a test on every push to main but should be something that runs before a new image is published.