[RFC] [ci] remove Azure DevOps CI jobs?
Description
@shiyu1994 informed me this week (in the private maintainer chat) that he has left Microsoft.
As a result, there are now 0 active contributors to LightGBM employed by Microsoft.
That poses a serious risk to development on this project. Development in this project has been disrupted many times over the last few years with problems of the form "someone needs to take an administrative action that only Microsoft employees have permission to do".
Most recently, Azure DevOps linux CI jobs not working for 2 months: #6918
Even if/when Microsoft assigns new maintainers to the project, based on my experience over the last few years, I am not optimistic that they'll be very responsive. For example, my requests for financial/development support to build and test aarch64 macOS wheels received 0 response from Microsoft for multiple years: #5328. And other than @shiyu1994 (who was also actively developing LightGBM), I have seen very little other involvement from Microsoft in this project's day-to-day maintenance over the last few years.
To try to reduce the risk of disruption to development, I'm proposing that we remove all Azure DevOps CI jobs in the project.
Benefits of this work
- reduces the risk of lengthy periods where commits cannot be merged in this project
Acceptance criteria
- LightGBM CI does not use Azure DevOps
Notes
The Azure DevOps CI jobs have been most useful for allowing LightGBM to run more jobs concurrently than we could on only GitHub-hosted runners.
A few years ago, we were worried about being limited to 20 concurrent GitHub Actions jobs across the entire repo (https://github.com/microsoft/LightGBM/pull/3672#issuecomment-753561780). I think that limit should be much higher now.
Table from https://docs.github.com/en/actions/administering-github-actions/usage-limits-billing-and-administration
I don't know for sure, but I strongly suspect that the microsoft organization is on an Enterprise GitHub plan. If I'm right about that, our limit would be 500 total concurrent jobs and 50 concurrent macOS jobs... I think we could make LightGBM at it's current needs and amount of activity work within those limits, with most of the jobs from Azure DevOps transitioned to GitHub Actions.
Approach
For all of the jobs in https://github.com/microsoft/LightGBM/blob/1684419f6d2221bc355f883f27c84581359d126b/.vsts-ci.yml, do some mix of the following:
- run them on GitHub Actions (with GitHub-hosted runners)
- run them on AppVeyor (with AppVeyor-hosted runners)
- stop running them completely
Progress tracking:
- [x] remove references to
AZUREenv variable in CI scripts (https://github.com/microsoft/LightGBM/pull/7024) - [x]
swigjobs (https://github.com/microsoft/LightGBM/pull/7024) - [x]
if-elsejobs (https://github.com/microsoft/LightGBM/pull/7033) - [x]
cpp-testsjobs (https://github.com/microsoft/LightGBM/pull/7033) - [x]
r-packagejobs (https://github.com/microsoft/LightGBM/pull/7032) - [ ] Python package (
bdist,gpu,regular,sdist) jobs- [x] Windows (#7086)
- [ ] Linux / Linux_latest -
bdist,regularandsdist(#7096) - [x] Linux / Linux_latest -
mpi(#7089) - [ ] Linux / Linux_latest - GPU
- [ ] move
lib_lightgbm.{dll,dylib,so}creation to GitHub Actions - [ ] move NuGet package creation to GitHub Actions
- [x] move
commit.txtand LightGBM tarball creation to GitHub Actions (#7092) - [ ] fully remove configuration, code, etc. related to Azure DevOps here
- [x] implement release requirements (e.g. producing artifacts for new releases) (#7080)
- [x] update docs on getting nightly packages (#7080)
- [ ] automate uploading artifacts to release (ref: https://github.com/microsoft/LightGBM/pull/7080#pullrequestreview-3470357799)
Tagging other maintainers for your thoughts:
@StrikerRUS @guolinke @shiyu1994 @jmoralez @borchero
I am +1
I'm +1 as well
The main problem of such migration will be artifacts creation, I guess. But OK, let's try to do this.
I know :/
But at least we'd be able to fix any issues ourselves and not be blocked the way we were in #6918
Ok thank you all, I will start on this this week. I'll break it up into a few PRs.
Little late to the party but I agree that, under the circumstances, it's a smart move to move off of Azure CI jobs. Thanks @jameslamb!
I'm temporarily taking care of the CI runners for lightgbm in Microsoft (one Azure DevOps CPU pool (8-core), one GitHub Actions GPU runner (V100)).
I agree that the project should move fully to GitHub Actions and use the hosted runners, it will simplify things.
~~However, this won't work for the GPU runner which backs the CUDA GitHub Actions workflow (https://github.com/microsoft/LightGBM/blob/master/.github/workflows/cuda.yml). Unfortunately (for cost attribution reasons), using the separately paid GPU Larger Runner of GitHub (T4 GPUs) is not an option. I'm currently checking who can own the V100 runner inside Microsoft. If you see any issues with the GPU runner, feel free to tag me.~~ Edit: I managed to work something out and this repo now has access to GPU Larger Runners.
Nice to meet you @letmaik , thanks very much for the help! I will try to make some progress on moving the other (non-GPU) jobs off of Azure DevOps.
And please @ me any time for help on moving the CUDA jobs to the GPU Large Runner over in #6958 (glad that seems like it might be an option now!).
@letmaik was our Azure DevOps pool removed????
I just saw the following on CI for #6979 (Azure DevOps build link).
There was a resource authorization issue: "The pipeline is not valid. Could not find a pool with name lgb_agent_pool_ado. The pool does not exist or has not been authorized for use. For authorization details, refer to https://aka.ms/yamlauthz. Could not find a pool with name lgb_agent_pool_ado. The pool does not exist or has not been authorized for use. For authorization details, refer to https://aka.ms/yamlauthz."
I clicked "authorize resources" there:
A box popped up that said:
Resources were successfully authorized
Then I clicked "Re-run failed jobs". Nothing new started... an error box popped up in the Azure DevOps UI with this text:
No plan found for identifier 520448be-17c0-418c-bd26-12da1f195a74.
So I pushed a new commit to #6979, thinking that might be necessary to trigger a new build. That triggered https://dev.azure.com/lightgbm-ci/lightgbm-ci/_build/results?buildId=17877&view=results .... which immediately failed in exactly the same way.
@letmaik was our Azure DevOps pool removed????
I just saw the following on CI for #6979 (Azure DevOps build link).
There was a resource authorization issue: "The pipeline is not valid. Could not find a pool with name lgb_agent_pool_ado. The pool does not exist or has not been authorized for use. For authorization details, refer to https://aka.ms/yamlauthz. Could not find a pool with name lgb_agent_pool_ado. The pool does not exist or has not been authorized for use. For authorization details, refer to https://aka.ms/yamlauthz."
I clicked "authorize resources" there:
A box popped up that said:
Resources were successfully authorized
Then I clicked "Re-run failed jobs". Nothing new started... an error box popped up in the Azure DevOps UI with this text:
No plan found for identifier 520448be-17c0-418c-bd26-12da1f195a74.
So I pushed a new commit to #6979, thinking that might be necessary to trigger a new build. That triggered https://dev.azure.com/lightgbm-ci/lightgbm-ci/_build/results?buildId=17877&view=results .... which immediately failed in exactly the same way.
@jameslamb Before Yu Shi left, he migrated the pool to a new location. It looks like he never tested it, and recently we removed the old pool. The new pool has the name lightgbm_agent_pool_ado, please use that in the YAML.
Ok, thank you for the quick response.
It's unfortunate that the "old" pool was removed before LightGBM was converted to using something else. Now we have to fix this under time pressure to retain full CI coverage, and I have to explain to external contributors that there are some CI errors which they can safely ignore... I would have preferred to migrate from one working state to another without CI being broken 😫
_the new pool has the name
lightgbm_agent_pool_ado, please use that in the YAML.
I just pushed a commit to #6979 and the pipeline is now stuck in "pending" with this message:
This pipeline needs permission to access a resource before this run can continue
ref: https://dev.azure.com/lightgbm-ci/lightgbm-ci/_build/results?buildId=17880&view=results
@letmaik can you please go to https://dev.azure.com/lightgbm-ci/lightgbm-ci/_settings/agentqueues?queueId=69&view=security and grant permission for our CI pipelines to use that pool? I don't have sufficient permissions.
@jameslamb Done. Jobs are running on https://github.com/microsoft/LightGBM/pull/6979 now.
@jameslamb Once your PR is merged I'd suggest going to the broken PRs and just pressing "Update Branch" so they pick up the new name.
Excellent, thank you so much for the quick response!!
Can confirm that it looks like things are working there and all jobs passed using the new pool: https://dev.azure.com/lightgbm-ci/lightgbm-ci/_build/results?buildId=17880&view=results
Once your PR is merged I'd suggest going to the broken PRs and just pressing "Update Branch" so they pick up the new name.
Yep sounds good to me, will do.
I haven't had much time to spend on LightGBM the last few weeks... I have more now, so will try to make some progress soon on moving these jobs to GitHub-hosted runners so that we don't have to bother you or any of your colleagues for things like this in the future.
I'll keep trying to make progress on this when I can, to hopefully help stabilize CI here. I've put up a list in the issue description, breaking this down into smaller tasks.
Added a checklist item about updating docs for nightly packages:
https://github.com/microsoft/LightGBM/blob/6368375b621070821290b9e3df3bdabd8038f8b8/docs/Installation-Guide.rst?plain=1#L33-L35
That location and process will change when we move all these jobs to GitHub Actions.
A box popped up that said: