conda-forge.github.io tasks and discussion for new GPU CI queue

This issue is to document tasks and todos for the new GPU CI queue.

To-Do Items:

[x] decide on software to provision the CI (decided on drone)
- best candidates from internal discussions are drone or azure
[ ] develop a process to permission feedstocks on the CI
- [ ] we will want an allow list of feedstocks that are allowed to use the queue
- [ ] we will want to add a job to conda-forge/admin-requests to add feedstocks to this list and provision them with the proper keys/permissions to access the CI
- [ ] add allowed list of users
[ ] put in changes to smithy to allow separate build and test phases in the CI config files
[ ] make sure the build phase does not tie up a GPU in the CI system
- [ ] separate queues for build and test
- [ ] move jobs from build on cpu to test on GPU
[ ] put in monitoring for the load on the queues
- we have existing tools that will be able to output the load in five-minute increments to the conda-forge status page
- we may want more than this however
[ ] establish and document a clear process of who to contact when things fail or break

cc @dharhas @jakirkham @kkraus14 @viniciusdc

@mariusvniekerk for azure stuff

I missed Ray and Kent on the github handles. Can someone ping them here?

hackmd: https://hackmd.io/QCX9xMnzS2WeobW0athINA

Mar 12 '21 22:03 beckermr

cc @ocefpaf @mike-wendt @raydouglass @teoliphant

Mar 12 '21 22:03 jakirkham

cc @h-vetinari (Axel)

Mar 12 '21 22:03 jakirkham

Should add I also don't know Kent's or others' GH handles. So please cc others as needed. Thanks! 😄

Mar 12 '21 22:03 jakirkham

Ahhhh thanks! I could not find Axel's github handle. See this one too for the other azure stuff: https://github.com/conda-forge/conda-forge.github.io/issues/1273

Mar 12 '21 22:03 beckermr

cc @Kllewelyn @rtiwariops from openteams.

Mar 13 '21 00:03 dharhas

Very exciting!

Mar 13 '21 05:03 leofang

Thanks for opening this @beckermr!

Linking my closing comment from #1062 for reference. TL;DR:

In short: it would be amazing if any of the involved people / companies could take this as impetus to chip in something as well. GPU computing is only ever going to get larger, and I believe that sharing some (comparatively low) CI infra to enable conda-forge to do the building & integration costs would provide huge bang-per-buck for the people & companies that are building & using such packages.

Mar 13 '21 15:03 h-vetinari

Adding @leej3 from Quansight.

Mar 22 '21 16:03 dharhas

cc @aktech (as it looks like you have been doing work in this area as well)

Jun 30 '21 18:06 jakirkham

@beckermr have we decided upon Azure as the software to provision the CI for the GPU tasks?

Jul 08 '21 18:07 viniciusdc

Hey Folks,

Looks like this is finally happening. We expect hardware to arrive in 2-4 weeks. There are still a lot of questions that are unanswered about the software stack we should run on it and getting this setup and managed etc. So I wanted to restart the discussion.

Also adding @jaimergp to the conversation.

Jul 15 '21 21:07 dharhas

Would it make sense to have a meeting?

Jul 15 '21 21:07 jakirkham

I think a coordination meeting makes sense. We probably need some higher bandwidth time to get broad strokes of what this will look like sorted out.

Jul 15 '21 21:07 dharhas

Ok. Went ahead and created a poll for us to figure out when is best time to meet in the next 2 weeks. Also please make sure to configure timezone before filling out the poll. Will share the results here and we can go from there

Jul 15 '21 21:07 jakirkham

May I invite myself? 😛

Jul 15 '21 21:07 leofang

Appreciate the general enthusiasm around this work! 😄

Anyone is welcome. Though my guess is this will be focused on technical issues around integration into conda-forge. So doubt this will be of interest outside those planning to do that work. That said, we can take notes, raise new issues, and summarize here for broader community awareness.

Jul 15 '21 22:07 jakirkham

Awesome news! Would love to participate, but on holidays for the next two weeks 😅

Jul 16 '21 08:07 h-vetinari

hey @jakirkham! I totally missed this poll. If there is a time for the next meeting already, that is fine. Otherwise, I have filled out the poll.

Jul 20 '21 17:07 beckermr

Ok every time has some conflicts for someone. That said, the least conflicting time is on 27 July at 9a US Pacific / 11a US Central / 12p US Eastern / 5p UK / 6p European. We can take notes and summarize here for those that miss. Will send out an invite and we can go from there.

Jul 20 '21 19:07 jakirkham

Alright have sent that out 📬

Think I got everyone who responded to the poll. Though feel free to forward to others that I may have been missed.

Also set it up with Microsoft Teams since that's what I have easy access to. Though if people prefer to use something different, feel free to propose (and be ready to set it up 😉). Otherwise we will stick with Teams.

Thanks all! 😄

Jul 21 '21 01:07 jakirkham

I'm not totally sure about the feasibility of this, but Drone seems to have and admin management for queues

Jul 27 '21 17:07 viniciusdc

Just noticed that drone's open source version is fairly hobbled vs their paid version.

https://www.drone.io/enterprise/opensource/#features

Not sure if we need any of the features that are not present in the OSS version but I thought I'd raise.

Jul 28 '21 18:07 dharhas

Thanks for this. We'll have to find out by doing I imagine.

Jul 28 '21 18:07 beckermr

btw What's the GPU model that the CI would use? I was under the wrong impression yesterday that MIG would work out of box for any existing models, but it looks like only certain Ampere GPUs support this feature.

Jul 28 '21 19:07 leofang

The server under Quansight is currently maintained using openstack and we will be receiving an account for admin management. That's said the overall architecture we ended up with will split the GPUs into VMs (each contains 2 GPU's), we can change that later on if needed, as openstack uses some configurable profiles (they call it Flavors).

The idea would then be using Drone to manage the webhooks requests from GIthub and choose one of the VM (it will contain installed runners in there). There is support for openstack in Drone already, so the implementation might be easier than what we had previously assumed.

We will need to think about how we will trigger these special jobs as well. Are we going to create some new flags for thos feedstocks? should we whitelist those as well?

Oct 14 '21 18:10 viniciusdc

@beckermr about the current status of the CI-run integration, do you think we can have a feedstock to run some tests and check how the permissioning will be held?

Mar 22 '22 16:03 viniciusdc

I do not.

Mar 22 '22 17:03 beckermr

I do not.

Should we add a test suit somewhere in the bots tests to eval the GPU builds before enabling it? I am open to any suggestions to test this integration.

Mar 22 '22 19:03 viniciusdc

We need to hear back from the ci run folks.

Mar 22 '22 19:03 beckermr

@jaimergp would you be able to share an update on where things stand at the meeting ~~tomorrow~~ later today? 🙂

Mar 23 '22 09:03 jakirkham