openvmm icon indicating copy to clipboard operation
openvmm copied to clipboard

Support multiple Github actions runners on the same machine

Open tjones60 opened this issue 1 year ago • 13 comments

Use separate rustup and cargo directories for each Github runner in case there are multiple on the same machine, and introduce retries for apt-get to avoid errors related to multiple processes trying to install things. Also makes sure that GCC is installed on Linux.

tjones60 avatar Oct 18 '24 23:10 tjones60

This way lies madness. If we need to support multiple concurrent jobs on the same physical machine, then we need to run those jobs in containers or VMs.

jstarks avatar Oct 21 '24 17:10 jstarks

John brings up a good point, and its one I probably should've realized sooner when discussing this work with you...

Lets hold off on this PR until we align on the broader goals we're trying to achieve here

daprilik avatar Oct 21 '24 17:10 daprilik

This way lies madness. If we need to support multiple concurrent jobs on the same physical machine, then we need to run those jobs in containers or VMs.

The idea here is to make better use of the physical ARM test machines that we have, since nested virtualization is not supported yet. I've tested this with 4 concurrent runners on the same machine and it works well. Also, this might help mitigate some occasional failures on our current 1ES CI machines due to apt. Also, I think these changes are a good idea whether or not we have multiple runners on the same machine, in case there are other processes running on them.

tjones60 avatar Oct 21 '24 17:10 tjones60

The apt changes are def pure goodness, and whether its as part of this PR or a new PR, we should make sure those land. While it's certainly a bit suspect that we keep seeing those dpkg lock failures on CI machines that are ostensibly wiped between runs, it does seems like a common enough issue reported my many folks that I'm willing to chalk it up to CI weirdness, and work around it.

I think these changes are a good idea whether or not we have multiple runners on the same machine, in case there are other processes running on them.

I think this falls under the umbrella of "this way lies madness". We wouldn't want this happening in CI, and on a user's local box, I think its reasonable to assume folks are rational actors aren't trying to mutiplex system-wide tool installs on a regular basis

since nested virtualization is not supported yet

Ahh... is this so that the same ARM host box can run multiple VMM test jobs from multiple pipeline run submissions?

daprilik avatar Oct 21 '24 17:10 daprilik

Ahh... is this so that the same ARM host box can run multiple VMM test jobs from multiple pipeline run submissions?

Yes this is the primary motivation for this change. We could go without it, but CI will be slower.

tjones60 avatar Oct 21 '24 17:10 tjones60

We're going to cross compile from x64, right? So this is just for unit and VMM tests.

Can we just up the concurrency limits on ARM tests to better utilize the machines (and free them up the for the next user sooner), rather than try to run multiple jobs concurrently?

jstarks avatar Oct 21 '24 17:10 jstarks

For Linux, we could also run in containers. For Windows that isn't an option, really, due to the lack of WHP support in Windows containers. (Maybe we should get that fixed.)

jstarks avatar Oct 21 '24 17:10 jstarks

We're going to cross compile from x64, right? So this is just for unit and VMM tests.

Can we just up the concurrency limits on ARM tests to better utilize the machines (and free them up the for the next user sooner), rather than try to run multiple jobs concurrently?

We could cross compile from x64, but the machines have lots of extra resources, so I was thinking of using them for some of the compilation jobs as well. Those jobs could use VMs or containers though, even on Windows. The unit tests probably also don't need virtualization support, so this would really just be for VMM tests.

Increase concurrency limits might help a little in the future, but right now most of the time is spent doing things that aren't easily parallelized within the context of a single job.

tjones60 avatar Oct 21 '24 17:10 tjones60

@jstarks, trevor shared some of the preliminary numbers from running build jobs on the ARM boxes, and the results are really good. We are talking 2x improvements over the existing 1ES runners. I'm all for leveraging them for our compile jobs, including for cross-compiling x86 from ARM.

@tjones60, can we improve VMM test execution speed by caching VHDs on a shared persistent disk? there'd be some flowey tweaks to get persistent_dir working with these runners, but all the logic for caching is there.

daprilik avatar Oct 21 '24 17:10 daprilik

OK, great. Then we can even containers on Windows for the build steps.

Concurrent jobs on the same OS instance is not something we want to do.

jstarks avatar Oct 21 '24 18:10 jstarks

The apt changes are def pure goodness, and whether its as part of this PR or a new PR, we should make sure those land.

Maybe we should remove the retry loop for the local case (does this run in the local case?)

jstarks avatar Oct 21 '24 18:10 jstarks

The apt changes are def pure goodness, and whether its as part of this PR or a new PR, we should make sure those land.

Maybe we should remove the retry loop for the local case (does this run in the local case?)

It does run in the local case, where it'll prompt the user to install required packages during build-igvm invocations and such.

I think its reasonable to only run once when using the local backend, and have it run N times on ADO / GH, with a clear comment explaining that it is a workaround for CI infra

daprilik avatar Oct 21 '24 18:10 daprilik

I opened a new PR with just the apt retry changes: https://github.com/microsoft/openvmm/pull/174

tjones60 avatar Oct 21 '24 22:10 tjones60