tasks and discussion for new GPU CI queue
This issue is to document tasks and todos for the new GPU CI queue.
To-Do Items:
- [x] decide on software to provision the CI (decided on drone)
- best candidates from internal discussions are drone or azure
- [ ] develop a process to permission feedstocks on the CI
- [ ] we will want an allow list of feedstocks that are allowed to use the queue
- [ ] we will want to add a job to conda-forge/admin-requests to add feedstocks to this list and provision them with the proper keys/permissions to access the CI
- [ ] add allowed list of users
- [ ] put in changes to smithy to allow separate build and test phases in the CI config files
- [ ] make sure the build phase does not tie up a GPU in the CI system
- [ ] separate queues for build and test
- [ ] move jobs from build on cpu to test on GPU
- [ ] put in monitoring for the load on the queues
- we have existing tools that will be able to output the load in five-minute increments to the conda-forge status page
- we may want more than this however
- [ ] establish and document a clear process of who to contact when things fail or break
cc @dharhas @jakirkham @kkraus14 @viniciusdc
@mariusvniekerk for azure stuff
I missed Ray and Kent on the github handles. Can someone ping them here?
hackmd: https://hackmd.io/QCX9xMnzS2WeobW0athINA
cc @ocefpaf @mike-wendt @raydouglass @teoliphant
cc @h-vetinari (Axel)
Should add I also don't know Kent's or others' GH handles. So please cc others as needed. Thanks! 😄
Ahhhh thanks! I could not find Axel's github handle. See this one too for the other azure stuff: https://github.com/conda-forge/conda-forge.github.io/issues/1273
cc @Kllewelyn @rtiwariops from openteams.
Very exciting!
Thanks for opening this @beckermr!
Linking my closing comment from #1062 for reference. TL;DR:
In short: it would be amazing if any of the involved people / companies could take this as impetus to chip in something as well. GPU computing is only ever going to get larger, and I believe that sharing some (comparatively low) CI infra to enable conda-forge to do the building & integration costs would provide huge bang-per-buck for the people & companies that are building & using such packages.
Adding @leej3 from Quansight.
cc @aktech (as it looks like you have been doing work in this area as well)
@beckermr have we decided upon Azure as the software to provision the CI for the GPU tasks?
Hey Folks,
Looks like this is finally happening. We expect hardware to arrive in 2-4 weeks. There are still a lot of questions that are unanswered about the software stack we should run on it and getting this setup and managed etc. So I wanted to restart the discussion.
Also adding @jaimergp to the conversation.
Would it make sense to have a meeting?
I think a coordination meeting makes sense. We probably need some higher bandwidth time to get broad strokes of what this will look like sorted out.
Ok. Went ahead and created a poll for us to figure out when is best time to meet in the next 2 weeks. Also please make sure to configure timezone before filling out the poll. Will share the results here and we can go from there
May I invite myself? 😛
Appreciate the general enthusiasm around this work! 😄
Anyone is welcome. Though my guess is this will be focused on technical issues around integration into conda-forge. So doubt this will be of interest outside those planning to do that work. That said, we can take notes, raise new issues, and summarize here for broader community awareness.
Awesome news! Would love to participate, but on holidays for the next two weeks 😅
hey @jakirkham! I totally missed this poll. If there is a time for the next meeting already, that is fine. Otherwise, I have filled out the poll.
Ok every time has some conflicts for someone. That said, the least conflicting time is on 27 July at 9a US Pacific / 11a US Central / 12p US Eastern / 5p UK / 6p European. We can take notes and summarize here for those that miss. Will send out an invite and we can go from there.
Alright have sent that out 📬
Think I got everyone who responded to the poll. Though feel free to forward to others that I may have been missed.
Also set it up with Microsoft Teams since that's what I have easy access to. Though if people prefer to use something different, feel free to propose (and be ready to set it up 😉). Otherwise we will stick with Teams.
Thanks all! 😄
I'm not totally sure about the feasibility of this, but Drone seems to have and admin management for queues
Just noticed that drone's open source version is fairly hobbled vs their paid version.
https://www.drone.io/enterprise/opensource/#features
Not sure if we need any of the features that are not present in the OSS version but I thought I'd raise.
Thanks for this. We'll have to find out by doing I imagine.
btw What's the GPU model that the CI would use? I was under the wrong impression yesterday that MIG would work out of box for any existing models, but it looks like only certain Ampere GPUs support this feature.
The server under Quansight is currently maintained using openstack and we will be receiving an account for admin management. That's said the overall architecture we ended up with will split the GPUs into VMs (each contains 2 GPU's), we can change that later on if needed, as openstack uses some configurable profiles (they call it Flavors).
The idea would then be using Drone to manage the webhooks requests from GIthub and choose one of the VM (it will contain installed runners in there). There is support for openstack in Drone already, so the implementation might be easier than what we had previously assumed.
We will need to think about how we will trigger these special jobs as well. Are we going to create some new flags for thos feedstocks? should we whitelist those as well?
@beckermr about the current status of the CI-run integration, do you think we can have a feedstock to run some tests and check how the permissioning will be held?
I do not.
I do not.
Should we add a test suit somewhere in the bots tests to eval the GPU builds before enabling it? I am open to any suggestions to test this integration.
We need to hear back from the ci run folks.
@jaimergp would you be able to share an update on where things stand at the meeting ~~tomorrow~~ later today? 🙂