FLAMEGPU2 icon indicating copy to clipboard operation
FLAMEGPU2 copied to clipboard

Self hosted runners on ITS VMs

Open mondus opened this issue 2 years ago • 7 comments

I have requested this from ITS and started an issue to capture thoughts on this to share with @willfurnass.

Requested info from Will and initial thoughts.

Are you wanting Repository or Organization-level runners (https://docs.github.com/en/actions/hosting-your-own-runners/about-self-hosted-runners)? Will the associated repository/repositories be public or private and who will be able to commit to them and trigger GitHub Actions (https://docs.github.com/en/actions/hosting-your-own-runners/about-self-hosted-runners#self-hosted-runner-security)? If using an Org-level runner are all members required to have MFA enabled?

Organization level to support CI for example models etc. Repos will be public. Organization is restricted to specific users within the FGPU2 team. I have just forced 2 step auth which has resulted in removal of Mozhgan.

By 'infosec approval for Exemption' do you mean a firewall exemption? I don't think that's required as runners establish connections to GitHub.com, not the other way round: https://docs.github.com/en/actions/hosting-your-own-runners/about-self-hosted-runners#communication-between-self-hosted-runners-and-github

Not required but @ptheywood suggested that without it the runner can only poll and hence the integration and updating of progress is not great. This link you have sent suggests that this should actually be fine.

In terms of hardware resources (RAM, cores, storage, GPU RAM, Compute Capability) and OS what do you think you need? Are you able to specify GPU resource requirements in terms of MIG slices of an A100? How long will the VM be needed for? Any particular requirements re device driver version? Who will need 'ownership' inc SSH access?

We actually don't need a huge amount. @ptheywood Will be able to give a better idea of any resource requirements for build and execute. I suspect it could be done with MIG slices rather than a full A100. We would require the VM indefinitely. Latest driver preferable but as long as it supports > CUDA 11.0 then it should be fine. Paul, Pete, Matt and Rob should have access. Either me or Pete as owners.

Also, out of curiosity how are you planning on providing a clean environment to each new job (if that's needed)? Podman/Docker/Singularity? And will any in-bound connectivity be needed on the VM hosting the GHA runner? I'm guessing not.

Yes it would be best to containerize this. Inbound connectivity probably not needed give the link above.

Finally, be aware that we're currently piloting GPU-backed VMs as a service for researchers: there may need to be downtime for maintenance, driver upgrades, config changes etc as we refine the setup.

Fine.

mondus avatar Sep 02 '22 18:09 mondus

Repos will be public

How will you guard against malicious code execution?

willfurnass avatar Sep 05 '22 08:09 willfurnass

How will you guard against malicious code execution?

Afaik, "first time contributors" code isn't pushed to CI automatically. Atleast that was the case here recently.

https://github.blog/changelog/2021-04-22-github-actions-maintainers-must-approve-first-time-contributor-workflow-runs/

Should be possible to make this mandatory for all "outside contributors" at an org level. If it's not already set up that way.

https://docs.github.com/en/organizations/managing-organization-settings/disabling-or-limiting-github-actions-for-your-organization#configuring-required-approval-for-workflows-from-public-forks

Robadob avatar Sep 05 '22 09:09 Robadob

Not required but @ptheywood suggested that without it the runner can only poll and hence the integration and updating of progress is not great. This link you have sent suggests that this should actually be fine.

This seems like a big change compared to when I tested self-hosted runners in the past, though I don't seem to have written down my notes from back then (It could have been jenkins that needed the port exception for not just pull behaviour and I'm misremembering though).

In terms of hardware resources (RAM, cores, storage, GPU RAM, Compute Capability) and OS what do you think you need? Are you able to specify GPU resource requirements in terms of MIG slices of an A100? How long will the VM be needed for?

The test suite jobs are all small scale tests so that they can reliably run on any GPU (I.e. memory footprint should be very small, if it isn't then we need to change the test). I.e. 1/7th of an A100 (the smallest mig slice) should be fine. We don't make use of any NVDEC's/JPEG/OFA engines afaik, so a 1g.5gb/1g.10gb should be fine (subject to which A100 model it is).

For the host, the memory we use scales with the number of cores to compile (with the exception of a couple of very very large compilation units) and the number of GPU architectuers we build for. The Github hosted runners we use for compilation are 2 cores with 7GB of memory, so anything atleast that large should be fine, though a few extra cores wouldn't hurt, but its only CI so not strictly required. (12c/24t and 32Gib can lock up from swapping, but 16T is usually in-budget)

Should be possible to make this mandatory for all "outside contributors" at an org level. If it's not already set up that way.

This is currently set to require approval for first time contributors, but we could change that setting (though it would be preferable if that were configurable)

ptheywood avatar Sep 05 '22 09:09 ptheywood

An alternative to this is dynamic AWS instances. There is an action for this: https://github.com/machulav/ec2-github-runner

Rather than run this from some private account the preferred method would be to run it in on the universities AWS account and have them provide a IAM with appropriate restrictions.

mondus avatar Nov 10 '22 10:11 mondus

To limit cost of AWS use, we can:

  • Use the cheapest nv image possible, with atleast a kepler generation GPU. We currently keep the test suite small so memory shouldn't be an issue and don't (yet) have performance regression testing.
  • Package all our dependencies into the base image to avoid installing cuda etc at runtime
  • Potentially only run this after regular CI, using pre-compiled artifacts from those jobs and copy them to the AWS image (i.e. no need to use 10 mins of GPU time building pyflamegpu, just install via a wheel).
  • Selectively run the AWS generating action(s) using one of many options:
    • Only run on PR's with a given label?
    • Only run when requested by a manual trigger, or by an issue comment action or similar?
    • Add it to (one of) our current CI jobs, as a separate build step which depends on the first (and some other conditions), so we only run it on AWS if it builds on regular GH hosted runners, which meet some criteria.

We could then add a broader / more expensive test during a manually invoked (somehow) workflow doing a thorough test prior to release, which spins up multiple AWS instances with different compiler / host / gpu / python versions for thorogh testing.

ptheywood avatar Nov 10 '22 10:11 ptheywood

By the way, if you want a no-maintenance method to achieve that, you can use cirun.io for the same.

aktech avatar Nov 17 '22 19:11 aktech

GitHub now has large runners with GPU backends for a (hefty) price. See: https://docs.github.com/en/actions/using-github-hosted-runners/about-larger-runners/managing-larger-runners

mondus avatar Jun 19 '24 13:06 mondus