community
community copied to clipboard
Migrating Dask GPU CI from Jenkins to GitHub Actions
Currently, several of the Dask projects run a subset of their tests on GPUs, through a Jenkins-based infrastructure put together by the ops team at RAPIDS (relevant [archived] docs).
Currently, the RAPIDS projects on GitHub are undergoing a migration from this Jenkins setup to GitHub Actions (docs outlining this setup) for a variety of reasons, most relevant to Dask developers being increased visibility/control over the files controlling the GPU build matrix - in short, public GHA workflow files over private Groovy files only accessible by members of RAPIDS. Once this migration is complete, RAPIDS intends to decommission all its Jenkins-based testing infrastructure.
In making this migration, there would be several changes that would need to happen both at the Dask org and developer level to get things working smoothly, so I’m hoping to use this issue as a place for discussion around those potential changes and coordination of the efforts to complete the migration.
Requirements
In short, to get things running we would need to:
- Have admins of the
dask
anddask-contrib
organizations install the GitHub applications used to manage GPU CI:- Install the
nvidia-runners
GitHub application for each org we intend to migrate, and enable it on all repos we intend to run GPU tests on; this will handle dispatching of self-hosted GPU runners - Install the
copy-pr-bot
GitHub application for each org we intend to migrate, and enable it on all repos we intend to run GPU tests on; this will automate the process of copying/deleting PR source code to/from the upstream repo
- Install the
- Add a configuration file for
copy-pr-bot
to each repo we intend to run GPU tests on; this outlines the trusted NVIDIA users and admins for that repo:
# .github/copy-pr-bot.yaml
enabled: true
additional_trustees:
- nvidia_employee_gh_username # non NVIDIA employees on this list are ignored
additional_vetters:
- nvidia_employee_gh_username # non NVIDIA employees on this list are ignored
Once this is done, we should be able to run jobs on RAPIDS self-hosted runners by adding a workflow file looking something like this:
name: Test Self Hosted Runners
on:
push:
branches:
- "pull-request/[0-9]+"
jobs:
build:
runs-on: "linux-amd64-gpu-t4-latest-1"
container:
# must run in a container
image: nvidia/cuda:11.8.0-base-ubuntu22.04
env:
# must set this container env variable
NVIDIA_VISIBLE_DEVICES: ${{ env.NVIDIA_VISIBLE_DEVICES }}
steps:
- name: hello
run: |
echo "hello"
nvidia-smi
And from there we can work on migrating the existing setup files in each repo over to GHA workflows.
Differences between Jenkins & GHA
Here’s a quick outline of some of the key differences between the Jenkins and GHA setups that will be relevant to developers:
- Instead of running tests through an external CI platform that reports back to GitHub, tests would be run on self-hosted runners, allowing some degree of customization to the CPU/GPU instances being used as outlined here
- Pull requests approved to run on GPU would automatically be pushed to a branch on the upstream repo of the form
pull-request/<PR_NUMBER>
as part of the process of triggering a self-hosted runner to run the tests; this branch would be automatically deleted once the PR is closed or merged- These
pull-request/*
branches can be ignored in local git operations by running the following:
- These
git config \
--global \
--add "remote.upstream.fetch" \
'^refs/heads/pull-request/*'
- The once internal concepts of the GPU CI “allowlist” and “adminlist” would now be handled on a per-repo basis through the copy-pr-bot concepts of trusted users and vetters, respectively
- Commits without GPG/SSH signatures will be considered untrusted and PRs from trusted users containing them will require vetters to leave an
/ok to test
comment for approval - The bot command
rerun tests
would no longer exist, being replaced with the standard method of rerunning GHA workflow runs
Migrating GPU image builds to GHA
An additional piece of Dask’s GPU CI infrastructure that runs on Jenkins and would need to be migrated is GHA is the automated builds of the GPU containers used for each repo; some benefits I think would come with this migration are:
- Running these builds through GHA would greatly increase the flexibility of where the setup files can be hosted, so that they could be moved from the relatively unknown dask-build-environment repo to a place more visible to Dask maintainers
- Running the scheduled builds through GHA allows those watching the repos hosting the builds to receive notifications when they fail, making it a lot more obvious when things break
The main blockers here would be establishing where we would like these workflow files to exist - my first thoughts would be either dask-docker, a standalone repo, or in the respective repos the images are intended for.
Thoughts?
I’m interested in soliciting opinions from Dask maintainers on what parts of this migration seem appealing/unappealing. My first impressions of some specific pain points are:
- This setup would require new branches to get pushed to the upstream repo frequently, which would contribute some noise to local git operations if the branches aren’t manually ignored
- Non-Dask org members or Dask org members without GPG/SSH signed commits would need to have their PRs approved on a per-commit basis by vetters on the relevant repo; depending on who this is, this could put increased pressure on a small group to be constantly watching new PRs and consequently disincentivize running GPU tests on all PRs
But it would be great to hear from others on how we can make this process relatively smooth.
I'm generally positive about this.
- If the RAPIDS group say this is better for a number of reasons, that's good enough for me
- I never really liked the Jenkins interface anyway. We would sometimes get GPU CI failures and I'd click over to the workflow results on Jenkins and struggle to find useful information about why the test actually failed (maybe the info is there, just hidden in a weird place? I don't know, usually at that point I'd ask John to figure it out)
The major disadvantage here is:
Non-Dask org members or Dask org members without GPG/SSH signed commits would need to have their PRs approved on a per-commit basis
This is deeply, deeply annoying. I'd be ok if it was once per PR, but every commit is not practical. And yes, it would disincentivize me running GPU tests. I don't think it's an edge case either - I know dask-image has very low activity, but I'd guess almost all of the PRs it does get are from non-dask org members.
This setup would require new branches to get pushed to the upstream repo frequently, which would contribute some noise to local git operations if the branches aren’t manually ignored
Usually I use the github CLI to checkout specific pull request branches for review (gh pr checkout $PR_NUMBER
). So I don't think I'd care about the extra pull request branches being created, especially if they do eventually get deleted automatically.
But it would be great to hear from others on how we can make this process relatively smooth.
To make the process smooth, I think it would be helpful to split the info you have above into (1) what needs to happen at the Dask org level, and by who, (2) what needs to happen for each individual repository by the maintainers, and (3) what can only be done by NVIDIA employees. It's a good overview, but I'm a little confused about what I would need to do and when, since it seems some steps depend on others too.
This is deeply, deeply annoying. I'd be ok if it was once per PR, but every commit is not practical. And yes, it would disincentivize me running GPU tests. I don't think it's an edge case either - I know dask-image has very low activity, but I'd guess almost all of the PRs it does get are from non-dask org members.
I agree this is frustrating, but it's actually a pretty low bar. Signing your commits with your SSH keys is easy to set up. Your commits get that little verified
tag too.
I agree this is frustrating, but it's actually a pretty low bar. Signing your commits with your SSH keys is easy to set up. Your commits get that little verified tag too.
Maybe I misunderstood. Charles says this above:
Non-Dask org members or Dask org members without GPG/SSH signed commits would need to have their PRs approved on a per-commit basis
So I thought that meant:
- Non-Dask org members - always need to have their PR commits approved
- Dask org members without SSH signed commits - always need to have their PR commits approved
- Dask org members with SSH signed commits - do not need to have anything approved, everything is good here.
Is that incorrect?
It's category 1 I'm thinking about. But perhaps newcomers might not actually feel frustrated they have to wait on us, and maybe it's totally fine for the GPU CI to run and pass only once at the end of the PR process.
Maybe I misunderstood.
Nope, your understanding of the system is correct
This is deeply, deeply annoying. I'd be ok if it was once per PR, but every commit is not practical. And yes, it would disincentivize me running GPU tests. I don't think it's an edge case either - I know dask-image has very low activity, but I'd guess almost all of the PRs it does get are from non-dask org members.
I (along with other Dask-RAPIDS developers) share this sentiment, though my concern lies less with confusion/frustration from potential newcomers (who I assume would largely ignore the automated messages) and more with the likelihood of GPU-breaking code going in due to GPU tests not getting run as often. For now, I've generally accepted these limitations as a necessity to assuage security concerns related to the migration with hopes that we can loosen these requirements later on, but ultimately can't make any solid promises there 😕
To make the process smooth, I think it would be helpful to split the info you have above into (1) what needs to happen at the Dask org level, and by who, (2) what needs to happen for each individual repository by the maintainers, and (3) what can only be done by NVIDIA employees
Good point, will give that section a second pass to make things clearer
I think for a first-time contributor it wouldn't be surprising to see that some CI approval is needed. I guess it may be more frustrating for a return contributor who is not a Dask-org member.
Maybe we should be more liberal and add all contributors to the org when their PR is merged? I expect we could automate that with a GitHub Action. We give out membership pretty openly anyway.