mantle icon indicating copy to clipboard operation
mantle copied to clipboard

kola: Add --max-machines

Open cgwalters opened this issue 5 years ago • 6 comments

This is only implemented for qemu at the moment, though it'd be a mostly mechanical change to propagate it to the other providers.

For our pipeline testing, we need to have a hard cap on the number of qemu instances we spawn, otherwise we can go over the RAM allocated to the pod.

Actually the FCOS pipeline today doesn't impose a hard cap, and my test pipeline in the coreosci (nested GCP virt) ended up bringing down the node via the OOM killer.

There were a few bugs here; first we were leaking the spawned qemu instance. We also need to invoke Wait() synchronously in destruction.

Then, add a dependency on the golang/x/semaphore library, and use it to implement a max limit.

Closes: https://github.com/coreos/mantle/issues/1157

cgwalters avatar Jan 14 '20 14:01 cgwalters

Tweaked this to default --max-machines to the value of --parallel.

cgwalters avatar Jan 14 '20 15:01 cgwalters

A couple questions:

What happens if:

  • The max-machine count is less than the required amount of machines for a given test (e.x. the test needs 3 machines and max-machines is 2
  • We're running tests serially with a max-machines count of 1 and we hit a test like the NFS test that spawns machines inside of the test.

arithx avatar Jan 14 '20 16:01 arithx

Nice, good catch!

What happens if:

Probably we should automatically skip tests that have a required machine count > --max-machines? And then have tests like the NFS one actually declare how many machines it intends to use. Seems good practice anyway for e.g. estimating resource needs when allocating the pod that will run kola.

I think we could at least get in the leaks and fixes for now, right? That alone should greatly reduce memory stress on the nodes we're allocated on. WDYT about splitting those out as a separate PR?

jlebon avatar Jan 14 '20 17:01 jlebon

Probably we should automatically skip tests that have a required machine count > --max-machines? And then have tests like the NFS one actually declare how many machines it intends to use.

Yeah, something like that.

WDYT about splitting those out as a separate PR?

OK done in

https://github.com/coreos/mantle/pull/1163

But, I think we'll need this PR in order to make the pipeline truly reliable in the face of hard memory caps.

cgwalters avatar Jan 14 '20 17:01 cgwalters

Probably we should automatically skip tests that have a required machine count > --max-machines? And then have tests like the NFS one actually declare how many machines it intends to use.

Yeah, something like that.

Added a few reviewers, though it sounds like the above is needed so we avoid timeouts before this is ready. Is it fair to mark this WIP for now?

ashcrow avatar May 12 '20 21:05 ashcrow

And I think this might might sense to be redirected to the cosa repo.

ashcrow avatar May 12 '20 21:05 ashcrow