mantle
mantle copied to clipboard
kola: Add --max-machines
This is only implemented for qemu at the moment, though it'd be a mostly mechanical change to propagate it to the other providers.
For our pipeline testing, we need to have a hard cap on the number of qemu instances we spawn, otherwise we can go over the RAM allocated to the pod.
Actually the FCOS pipeline today doesn't impose a hard cap, and my test pipeline in the coreosci (nested GCP virt) ended up bringing down the node via the OOM killer.
There were a few bugs here; first we were leaking the spawned
qemu instance. We also need to invoke Wait() synchronously in
destruction.
Then, add a dependency on the golang/x/semaphore library, and
use it to implement a max limit.
Closes: https://github.com/coreos/mantle/issues/1157
Tweaked this to default --max-machines to the value of --parallel.
A couple questions:
What happens if:
- The max-machine count is less than the required amount of machines for a given test (e.x. the test needs 3 machines and max-machines is 2
- We're running tests serially with a max-machines count of 1 and we hit a test like the NFS test that spawns machines inside of the test.
Nice, good catch!
What happens if:
Probably we should automatically skip tests that have a required machine count > --max-machines? And then have tests like the NFS one actually declare how many machines it intends to use. Seems good practice anyway for e.g. estimating resource needs when allocating the pod that will run kola.
I think we could at least get in the leaks and fixes for now, right? That alone should greatly reduce memory stress on the nodes we're allocated on. WDYT about splitting those out as a separate PR?
Probably we should automatically skip tests that have a required machine count > --max-machines? And then have tests like the NFS one actually declare how many machines it intends to use.
Yeah, something like that.
WDYT about splitting those out as a separate PR?
OK done in
https://github.com/coreos/mantle/pull/1163
But, I think we'll need this PR in order to make the pipeline truly reliable in the face of hard memory caps.
Probably we should automatically skip tests that have a required machine count > --max-machines? And then have tests like the NFS one actually declare how many machines it intends to use.
Yeah, something like that.
Added a few reviewers, though it sounds like the above is needed so we avoid timeouts before this is ready. Is it fair to mark this WIP for now?
And I think this might might sense to be redirected to the cosa repo.