nuttx icon indicating copy to clipboard operation
nuttx copied to clipboard

[URGENT] Reducing our usage of GitHub Runners

Open lupyuen opened this issue 1 year ago • 48 comments

Hi All: We have an ultimatum to reduce (drastically) our usage of GitHub Actions. Or our Continuous Integration will halt totally in Two Weeks. Here's what I'll implement within 24 hours for nuttx and nuttx-apps repos:

  1. When we submit or update a Complex PR that affects All Architectures (Arm, RISC-V, Xtensa, etc): CI Workflow shall run only half the jobs. Previously CI Workflow will run arm-01 to arm-14, now we will run only arm-01 to arm-07. (This will reduce GitHub Cost by 32%)

  2. When the Complex PR is Merged: CI Workflow will still run all jobs arm-01 to arm-14

    (Simple PRs with One Single Arch / Board will build the same way as before: arm-01 to arm-14)

  3. For NuttX Admins: Our Merge Jobs are now at github.com/NuttX/nuttx. We shall have only Two Scheduled Merge Jobs per day

    I shall quickly Cancel any Merge Jobs that appear in nuttx and nuttx-apps repos. Then at 00:00 UTC and 12:00 UTC: I shall start the Latest Merge Job at nuttxpr. ~~(This will reduce GitHub Cost by 17%)~~

  4. macOS and Windows Jobs (msys2 / msvc): They shall be totally disabled until we find a way to manage their costs. (GitHub charges 10x premium for macOS runners, 2x premium for Windows runners!)

    Let's monitor the GitHub Cost after disabling macOS and Windows Jobs. It's possible that macOS and Windows Jobs are contributing a huge part of the cost. We could re-enable and simplify them after monitoring.

    (This must be done for BOTH nuttx and nuttx-apps repos. Sadly the ASF Report for GitHub Runners doesn't break down the usage by repo, so we'll never know how much macOS and Windows Jobs are contributing to the cost. That's why we need https://github.com/apache/nuttx/pull/14377)

    (Wish I could run NuttX CI Jobs on my M2 Mac Mini. But the CI Script only supports Intel Macs sigh. Buy a Refurbished Intel Mac Mini?)

We have done an Analysis of CI Jobs over the past 24 hours:

https://docs.google.com/spreadsheets/d/1ujGKmUyy-cGY-l1pDBfle_Y6LKMsNp7o3rbfT1UkiZE/edit?gid=0#gid=0

Many CI Jobs are Incomplete: We waste GitHub Runners on jobs that eventually get superseded and cancelled

Screenshot 2024-10-17 at 1 18 14 PM

When we Half the CI Jobs: We reduce the wastage of GitHub Runners

Screenshot 2024-10-17 at 1 15 30 PM

Scheduled Merge Jobs will also reduce wastage of GitHub Runners, since most Merge Jobs don't complete (only 1 completed yesterday)

Screenshot 2024-10-17 at 1 16 16 PM

See the ASF Policy for GitHub Actions

lupyuen avatar Oct 17 '24 05:10 lupyuen

As commented by @xiaoxiang781216:

can we reduce the board on Linux host to keep macOS/Windows? it's very easy to break these host if without these basic coverage.

I suggest that we monitor the GitHub Cost after disabling macOS and Windows Jobs. It's possible that macOS and Windows Jobs are contributing a huge part of the cost. We could re-enable and simplify them after monitoring.

lupyuen avatar Oct 17 '24 07:10 lupyuen

One of the methods proposed by, if I remember correctly @btashton, is to replace many simple configurations for some boards (mostly for peripherals testing) with one large jumbo config activating everything possible. This won't work for chips with low memory, but it will save some CI resources anyway.

raiden00pl avatar Oct 17 '24 08:10 raiden00pl

@raiden00pl Yep I agree. Or we could test a complex target like board:lvgl?

lupyuen avatar Oct 17 '24 08:10 lupyuen

Here's another comment about macOS and Windows by @yamt: https://github.com/apache/nuttx/pull/14377#issuecomment-2418914068

lupyuen avatar Oct 17 '24 08:10 lupyuen

sorry, let me ask a dumb question. what plan are we using? https://github.com/pricing is apache paying for it?

yamt avatar Oct 17 '24 08:10 yamt

what plan are we using? https://github.com/pricing

@yamt It's probably a special plan negotiated by ASF and GitHub? It's not mentioned in the ASF Policy for GitHub Actions: https://infra.apache.org/github-actions-policy.html

I find this "contract" a little strange. Why are all ASF Projects subjected to the same quotas? And why can't we increase the quota if we happen to have additional funding?

Update: More info here: https://cwiki.apache.org/confluence/display/INFRA/GitHub+self-hosted+runners

If your project uses GitHub Actions, you share a queue with all other Apache projects using Github Actions, which can quickly lead to frustration for everyone involved. Builds can be stuck in "queued" for 6+ hours.

One option (if you want to stick with GitHub and don't want to use the Infra-managed Jenkins) is for your project to create its own self-hosted runners, which means your jobs will run on a virtual machine (VM) under your project's control. However this is not something to tackle lightly, as Infra will not manage or secure your VM - that is up to you.

Update 2: This sounds really complicated. I'd rather use my own Mac Mini to execute the NuttX CI Tests, once a day?

lupyuen avatar Oct 17 '24 09:10 lupyuen

what plan are we using? https://github.com/pricing

@yamt It's probably a special plan negotiated by ASF and GitHub? It's not mentioned in the ASF Policy for GitHub Actions: https://infra.apache.org/github-actions-policy.html

do you know if the macos/windows premium applies as usual? the policy page seems to have no mention about it.

I find this "contract" a little strange. Why are all ASF Projects subjected to the same quotas? And why can't we increase the quota if we happen to have additional funding?

yea, i guess projects have very different sizes/demands. (i feel nuttx is using too much anyway though :-)

yamt avatar Oct 17 '24 09:10 yamt

...I'd rather use my own Mac Mini to execute the NuttX CI Tests, once a day?

Is there any merit in "farming out" CI tests to those with boards? I think there was a discussion about NuttX owning a suite of boards but not sure where that got to - and would depend on just 1 or 2 people managing it.

As an aside, is there a guide to self-running CI? As I work on a custom board it would be good for me to do this occasionally but I have noi idea where to start!

TimJTi avatar Oct 17 '24 09:10 TimJTi

@TimJTi Here's how I do daily testing on Milk-V Duo S SBC: https://lupyuen.github.io/articles/sg2000a

lupyuen avatar Oct 17 '24 09:10 lupyuen

@TimJTi Here's how I do daily testing on Milk-V Duo S SBC: https://lupyuen.github.io/articles/sg2000a

And I just RTFM...the "official" guide is here so I'll review both and hopefully get it working - and submit any tweaks/corrections/enhancements I find are needed to the NuttX "How To" documentation

TimJTi avatar Oct 17 '24 10:10 TimJTi

[like] Jerpelea, Alin reacted to your message:


From: Tim Hardisty @.> Sent: Thursday, October 17, 2024 10:06:55 AM To: apache/nuttx @.> Cc: Subscribed @.***> Subject: Re: [apache/nuttx] [URGENT] Reducing our usage of GitHub Runners (Issue #14376)

@ TimJTi Here's how I do daily testing on Milk-V Duo S SBC: https: //lupyuen. github. io/articles/sg2000a And I just RTFM. . . the "official" guide is here so I'll review both and hopefully get it working - and submit any tweaks/corrections/enhancements

@TimJTihttps://urldefense.com/v3/__https://github.com/TimJTi__;!!JmoZiZGBv3RvKRSx!8E0iWp2KKuEYOnSWqkP3whXAfsqbSXzh4AxpgBTpQ3ULEz9KJrgYsa30ZVbRyn826V66Yp62LoYKtN9N6l6JFU3cHg$ Here's how I do daily testing on Milk-V Duo S SBC: https://lupyuen.github.io/articles/sg2000ahttps://urldefense.com/v3/__https://lupyuen.github.io/articles/sg2000a__;!!JmoZiZGBv3RvKRSx!8E0iWp2KKuEYOnSWqkP3whXAfsqbSXzh4AxpgBTpQ3ULEz9KJrgYsa30ZVbRyn826V66Yp62LoYKtN9N6l4xstpnoQ$

And I just RTFM...the "official" guide is herehttps://urldefense.com/v3/__https://nuttx.apache.org/docs/latest/guides/citests.html__;!!JmoZiZGBv3RvKRSx!8E0iWp2KKuEYOnSWqkP3whXAfsqbSXzh4AxpgBTpQ3ULEz9KJrgYsa30ZVbRyn826V66Yp62LoYKtN9N6l7_blKYXg$ so I'll review both and hopefully get it working - and submit any tweaks/corrections/enhancements I find are needed to the NuttX "How To" documentation

— Reply to this email directly, view it on GitHubhttps://urldefense.com/v3/__https://github.com/apache/nuttx/issues/14376*issuecomment-2419108081__;Iw!!JmoZiZGBv3RvKRSx!8E0iWp2KKuEYOnSWqkP3whXAfsqbSXzh4AxpgBTpQ3ULEz9KJrgYsa30ZVbRyn826V66Yp62LoYKtN9N6l7bo_4zVw$, or unsubscribehttps://urldefense.com/v3/__https://github.com/notifications/unsubscribe-auth/AB32XCV6F2Y7L26ESNFQJK3Z36D37AVCNFSM6AAAAABQC44TO2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDIMJZGEYDQMBYGE__;!!JmoZiZGBv3RvKRSx!8E0iWp2KKuEYOnSWqkP3whXAfsqbSXzh4AxpgBTpQ3ULEz9KJrgYsa30ZVbRyn826V66Yp62LoYKtN9N6l4uP3yEVw$. You are receiving this because you are subscribed to this thread.Message ID: @.***>

jerpelea avatar Oct 17 '24 10:10 jerpelea

@TimJTi Here's how I do daily testing on Milk-V Duo S SBC: https://lupyuen.github.io/articles/sg2000a

And I just RTFM...the "official" guide is here so I'll review both and hopefully get it working - and submit any tweaks/corrections/enhancements I find are needed to the NuttX "How To" documentation

These work, but it does not describe the entire CI, just how to run pytest checks for sim:citest configuration.

michallenc avatar Oct 17 '24 11:10 michallenc

Yes let's cut what we can (but to keep at least minimal functional configure, build, syntax testing) and see what are the cost reduction. We need to show Apache we are working on the problem. So far optimitzations did not cut the use and we are in danger of loosing all CI :-(

On the other hand that seems not fair to share the same CI quota as small projects. NuttX is a fully featured RTOS working on ~1000 different devices. In order to keep project code quality we need the CI.

Maybe its time to rethink / redesign from scratch the CI test architecture and implementation?

cederom avatar Oct 17 '24 11:10 cederom

Another problem is that people very often send unfinished undescribed PRs that are updated without a comment or request that triggers whole big CI process several times :-(

Some changes are sometimes required and we cannot avoid that this is part of the process. But maybe we can make something more "adaptive" so only minimal CI is launched by default, preferably only in area that was changed, then with all approvals we can make one manual trigger final big check before merge?

Long story short: We can switch CI test runs to manual trigger for now to see how it reduces costs. I would see two buttons to start Basic and Advanced (maybe also Full = current setup) CI.

cederom avatar Oct 17 '24 11:10 cederom

@cederom Maybe our PRs should have a Mandatory Field: Which NuttX Config to build, e.g. rv-virt:nsh. Then the CI Workflow should do tools/configure.sh rv-virt:nsh && make. Before starting the whole CI Build?

lupyuen avatar Oct 17 '24 11:10 lupyuen

@cederom Maybe our PRs should have a Mandatory Field: Which NuttX Config to build, e.g. rv-virt:nsh. Then the CI Workflow should do tools/configure.sh rv-virt:nsh && make. Before starting the whole CI Build?

People often cant fill even one single sentence to describe Summary, Impact, Testing :D This may be detected automatically.. or we can just see what architecture is the cheapest one and use it for all basic tests..?

cederom avatar Oct 17 '24 11:10 cederom

Another problem is that people very often send unfinished undescribed PRs that are updated without a comment or request that triggers whole big CI process several times :-(

Often contributors use CI to test all configuration instead of testing changes locally. On one hand I understand this because compiling all configurations on a local machine takes a lot of time, on the other hand I'm not sure if CI is for this purpose (especially when we have limits on its use).

@cederom Maybe our PRs should have a Mandatory Field: Which NuttX Config to build, e.g. rv-virt:nsh. Then the CI Workflow should do tools/configure.sh rv-virt:nsh && make. Before starting the whole CI Build?

It won't work. Users are lazy, and in order to choose what needs to be compiled correctly, you need a comprehensive knowledge of the entire NuttX, which is not that easy. The only reasonable option is to automate this process.

raiden00pl avatar Oct 17 '24 12:10 raiden00pl

So it looks like for now, where dramatic steps need to be taken, we need to mark all PR as drafts and start CI by hand when we are sure all is ready for merge? o_O

cederom avatar Oct 17 '24 14:10 cederom

[like] Jerpelea, Alin reacted to your message:


From: CeDeROM @.> Sent: Thursday, October 17, 2024 2:11:13 PM To: apache/nuttx @.> Cc: Jerpelea, Alin @.>; Comment @.> Subject: Re: [apache/nuttx] [URGENT] Reducing our usage of GitHub Runners (Issue #14376)

So it looks like for now, where dramatic steps need to be taken, we need to mark all PR as drafts and start CI by hand when we are sure all is ready for merge? o_O — Reply to this email directly, view it on GitHub, or unsubscribe. You

So it looks like for now, where dramatic steps need to be taken, we need to mark all PR as drafts and start CI by hand when we are sure all is ready for merge? o_O

— Reply to this email directly, view it on GitHubhttps://urldefense.com/v3/__https://github.com/apache/nuttx/issues/14376*issuecomment-2419664709__;Iw!!JmoZiZGBv3RvKRSx!60hNhJMIXMMxTP8-Zr9RteOSJ2PJTdGpwx0nE8SOkWeV1d0uxP1v0N860U_WVI_zv-r-PhDE2T6b-zIlN3CrJpLbOg$, or unsubscribehttps://urldefense.com/v3/__https://github.com/notifications/unsubscribe-auth/AB32XCU22ONPLOEL6JKVC2LZ37AQDAVCNFSM6AAAAABQC44TO2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDIMJZGY3DINZQHE__;!!JmoZiZGBv3RvKRSx!60hNhJMIXMMxTP8-Zr9RteOSJ2PJTdGpwx0nE8SOkWeV1d0uxP1v0N860U_WVI_zv-r-PhDE2T6b-zIlN3DcSsTpzw$. You are receiving this because you commented.Message ID: @.***>

jerpelea avatar Oct 17 '24 14:10 jerpelea

Stats for the past 24 hours: We consumed 61 Full-Time Runners, still got a long way away from our target of 25 Full-Time Runners (otherwise ASF will halt our servers in 12 days)

  • Our Merge Jobs are now at github.com/NuttX/nuttx
  • ~~We have switched to Four Scheduled Merge Jobs per day. New Merge Jobs will now run for a few seconds before getting auto-killed by our script, via the GitHub CLI. (See the Merge Jobs)~~
  • nuttx-apps has stopped macOS and Windows Jobs. But not much impact, since we don't compile nuttx-apps often
    https://github.com/apache/nuttx-apps/pull/2750
  • Still waiting for nuttx repo to stop macOS and Windows Jobs (Update: merged!)
    https://github.com/apache/nuttx/pull/14377
  • Also waiting for nuttx repo to Halve The Jobs (Update: merged!)
    https://github.com/apache/nuttx/pull/14386
  • And for nuttx-apps to Halve The Jobs (probably not much impact, since we don't compile nuttx-apps often) (Update: merged!)
    https://github.com/apache/nuttx-apps/pull/2753
  • Will wait for the above to be merged, then we monitor some more (Update: All merged! Thanks Tomek :-)
  • If our Full-Time Runners don't reduce significantly after 24 hours: We shall further reduce our jobs, halving the jobs for RISC-V / Xtensa / Simulator when we Create / Modify a Complex PR. Also: Reduce the Daily Merge Jobs from 4 to 2.
  • We shall close this issue only when we reach our target of 25 Full-Time Runners per day. (And ASF won't shut us down)

Screenshot 2024-10-18 at 6 14 48 AM

lupyuen avatar Oct 17 '24 22:10 lupyuen

Okay its 0000UTC. We are really short on time. I have merged changes. Lets monitor the use now for 24h we need metrics. We can always revert the commits.

Looking at the pie chart 99.7% use comes from the builds, other tasks are barely visible. So we need to focus on the builds :-)

cederom avatar Oct 18 '24 00:10 cederom

Sorry no clue why it closed in my name o_O

Ah, GH seems to close issues on its own when related PR gets merged. Probably this also happened before when Xiang merged prior PR :D

cederom avatar Oct 18 '24 00:10 cederom

The builds are so much faster today yay! https://github.com/apache/nuttx/actions/runs/11395811301 Screenshot 2024-10-18 at 9 36 30 AM

lupyuen avatar Oct 18 '24 01:10 lupyuen

We can also disable CI checks for draft merge requests. I think it doesn't make much sense to run them as further commits/force pushes are expected.

michallenc avatar Oct 18 '24 06:10 michallenc

Something That Bugs Me: Timeout Errors will cost us precious GitHub Minutes. The remaining jobs get killed, and restarting these remaining jobs from scratch will consume extra GitHub Minutes. (The restart below costs us 6 extra GitHub Runner Hours) (1) How do we retry these Timeout Errors? (2) Can we have Restartable Builds? Doesn't quite make sense to build everything from scratch (arm6, arm7, riscv7) just because one job failed (xtensa2) (3) Or xtensa2 should wait for others to finish, before it declares a timeout and dies? Hmmm...

Configuration/Tool: esp32s2-kaluga-1/lvgl_st7789
curl: (28) Failed to connect to github.com port 443 after 133994 ms: Connection timed out

https://github.com/apache/nuttx/actions/runs/11395811301/attempts/1

lupyuen avatar Oct 18 '24 07:10 lupyuen

@lupyuen: (2) Can we have Restartable Builds? Doesn't quite make sense to build everything from scratch (arm6, arm7, riscv7) just because one job failed (xtensa2)

It is possible to restart only failed tasks on GitHub :-)

If you mean the task could restart where it left.. I am not sure its possible because underlying configuration could change after update to be verified and things need to be started from scratch? :-)

cederom avatar Oct 18 '24 12:10 cederom

11 Days To Doomsday: But we're doing much better already! In the past 24 hours, we consumed 36 Full-Time GitHub Runners. We're getting closer to the ASF Target of 25 Full-Time Runners! Today we shall:

  • Halve the Jobs for RISC-V, Xtensa and Simulator for Complex PRs
    https://github.com/apache/nuttx/pull/14400

  • Do the same for nuttx-apps repo
    https://github.com/apache/nuttx-apps/pull/2758

  • Our Merge Jobs are now at github.com/nuttxpr/nuttx

    ~~Reduce the Scheduled Merge Jobs to Two Per Day at 00:00 / 12:00 UTC (down from Four Per Day)~~

Hopefully we'll reach the ASF Target tomorrow, and ASF won't kill our servers no more! Thanks!

Screenshot 2024-10-19 at 7 15 11 AM

lupyuen avatar Oct 18 '24 23:10 lupyuen

When NuttX merges our PR, the Merge Job won't run until 00:00 UTC and 12:00 UTC. How can we be really sure that our PR was merged correctly?

Let's create a GitHub Org (at no cost), fork the NuttX Repo and trigger the CI Workflow. (Which won't charge any extra GitHub Runner Minutes to NuttX Project!)

  • https://github.com/apache/nuttx/issues/14407

(I think this might also work if ASF shuts down our CI Servers. We can create many many orgs actually)

lupyuen avatar Oct 19 '24 00:10 lupyuen

Sounds like a repo clone that will verify nuttx and nuttx-apps master independently twice a day?

cederom avatar Oct 19 '24 00:10 cederom

@cederom You read my mind :-)

lupyuen avatar Oct 19 '24 00:10 lupyuen