numpy icon indicating copy to clipboard operation
numpy copied to clipboard

Cirrus CI free usage going away - job runtime issues & credits

Open rgommers opened this issue 2 years ago • 38 comments

This is getting kinda annoying: [skip cirrus] is broken and the wheel build and other CI jobs are running way too often. It was probably broken by the addition of other logic like always triggering on PRs with build-related label.

rgommers avatar Jul 28 '23 16:07 rgommers

What do you want to happen with the build label? I use it when I want to test wheel builds. It is a bit annoying, as simply adding a label will retrigger the builds.

charris avatar Jul 28 '23 16:07 charris

Labels are for indicating what a PR or issue is about, so I add it to build-related PRs. And running wheel builds by default is fine, but the [skip cirrus] should supercede that (just like it does for [skip actions] on GHA).

rgommers avatar Jul 28 '23 16:07 rgommers

but the [skip cirrus] should supercede

Fair enough, but some labels do trigger wheel builds, so they aren't only about what the PR or issue is about. Not sure how we could change that unless we want to add a [build wheels] option.

charris avatar Jul 28 '23 16:07 charris

Why [skip cirrus] doesn't work is explained by https://github.com/numpy/numpy/blob/7fc72776b972bfbfdb909e4b15feb0308cf8adba/.cirrus.star#L23-L25

It looks like it's not the build label though, we run wheel builds on Cirrus always. E.g. look at this doc-only PR with no labels: gh-24277. This wastes a huge amount of compute time.

rgommers avatar Jul 28 '23 20:07 rgommers

Cirrus is a bit different. One problem is that it cannot be manually run, I think it needs something in the *.yml file for that. I also think (only) wheels are built and tested, I'm not sure how a build can be made without that. It is on my problem list, but not annoying enough to spend time on it. Maybe @andyfaff has some thoughts.

charris avatar Jul 28 '23 20:07 charris

.cirrus.star is triggered to run with a lot of GH events on the numpy/numpy repo, e.g. if there are commits to PRs, commits to branches, PRs are opened, labels are attached to PRs, tags are pushed, merges, etc.

Here is the current logic for numpy's cirrus CI, as contained in .cirrus.star:

  • only run jobs only on the numpy/numpy repo.
  • (do a wheels build if a 'nightly' cron job is requested)
  • for all other triggers (and there can be a lot of them) the configuration script looks at the commit message of the SHA, provided by CIRRUS_CHANGE_IN_REPO, for the event. If there is [skip cirrus] or [skip ci] in the message then don't run any CI.
  • otherwise do the wheel build and the macosx_arm64 runs.

This logic will dictate that the wheels always get built if there is no [skip cirrus] in the commit message.

[skip cirrus] is broken and the wheel build and other CI jobs are running way too often.

I'm pretty sure [skip cirrus] works for commits to PRs. But for other events if the commit message for the SHA doesn't contain those words then the wheel build and macosx_arm64 will run. For example adding labels to a PR will trigger these runs if the last commit to the PR doesn't contain the magic words.

It's not clear to me from previous comments what additional logic is being requested. The following environment variables may be useful in reducing the number of runs made:

  • CIRRUS_PR | PR number if current build was triggered by a PR - we may be able to query Github as to what occurred in the PR to trigger cirrus-ci. We already query GH in cirrus.star.
  • CIRRUS_PR_LABELS | comma separated list of PR's labels if current build was triggered by a PR - examine the labels to see if 36 - build is present. e.g. if you look in the debugging info for https://cirrus-ci.com/build/4758553810960384 you'll see that label present.
  • CIRRUS_TAG | Tag name if current build was triggered by a new tag.

Possible extra logic that could be done:

  • don't automatically do wheel build unless ...
  • it's a nightly job
  • scan SHA message for [wheel build], do wheel build if present
  • scan CIRRUS_PR_LABELS for 36 - Build, do wheel build if present
  • build wheels for a tag event

Relevant links:

CIRRUS_CHANGE_IN_REPO numpy's .cirrus.star config numpy's wheel build config numpy's macosx_arm64 config cirrus environment variables

EDIT: [skip cirrus] wasn't working, fixed in #24285.

andyfaff avatar Jul 29 '23 00:07 andyfaff

Having just said that I see an issue at https://github.com/numpy/numpy/blob/main/.cirrus.star#L27. I'll open a PR.

andyfaff avatar Jul 29 '23 00:07 andyfaff

You can examine what we requested from the GH API in 24282. The github request is: https://api.github.com/repos/numpy/numpy/git/commits/cad8595a8c86c173285d82b61f6797ff24324364.

This is what is returned:

{
  "sha": "cad8595a8c86c173285d82b61f6797ff24324364",
  "node_id": "C_kwDOAA3dP9oAKGNhZDg1OTVhOGM4NmMxNzMyODVkODJiNjFmNjc5N2ZmMjQzMjQzNjQ",
  "url": "https://api.github.com/repos/numpy/numpy/git/commits/cad8595a8c86c173285d82b61f6797ff24324364",
  "html_url": "https://github.com/numpy/numpy/commit/cad8595a8c86c173285d82b61f6797ff24324364",
  "author": {
    "name": "Andrew Nelson",
    "email": "[email protected]",
    "date": "2023-07-29T00:15:13Z"
  },
  "committer": {
    "name": "Andrew Nelson",
    "email": "[email protected]",
    "date": "2023-07-29T00:15:13Z"
  },
  "tree": {
    "sha": "1c9f90fbbcff3162542b6663e0fe75a86e819bb4",
    "url": "https://api.github.com/repos/numpy/numpy/git/trees/1c9f90fbbcff3162542b6663e0fe75a86e819bb4"
  },
  "message": "CI: correct URL in cirrus.star [skip cirrus]",
  "parents": [
    {
      "sha": "422854fa8dc501e5fcbd713093fdee04e7e9ebb8",
      "url": "https://api.github.com/repos/numpy/numpy/git/commits/422854fa8dc501e5fcbd713093fdee04e7e9ebb8",
      "html_url": "https://github.com/numpy/numpy/commit/422854fa8dc501e5fcbd713093fdee04e7e9ebb8"
    }
  ],
  "verification": {
    "verified": false,
    "reason": "unsigned",
    "signature": null,
    "payload": null
  }
}

The cirrus CI didn't run because [skip cirrus] is in dct['message'].

andyfaff avatar Jul 29 '23 00:07 andyfaff

I think we should be able to limit the wheel build to whatever is in the GHA wheels build, if that's desired.

EDIT: apart from manual trigger, not sure how to do that.

andyfaff avatar Jul 29 '23 00:07 andyfaff

Thanks for the fix @andyfaff!

Given the major reduction in free resources available per 1 Sep (see https://cirrus-ci.org/blog/2023/07/17/limiting-free-usage-of-cirrus-ci/), I think we have a lot more work to do here unfortunately (and may consider buying some credits).

Regarding the label-based trigger, I think there are two things wrong with it:

  • the Build label is wrong for this, it should be a dedicated and clearly named label like trigger-cirrus
  • it's a problem that this label, once added to a PR, tends to stay on it and then the wheel builds run for every subsequent push. typically what's intended is a one-off "check wheel builds". It doesn't really make sense (resource-usage wise) to run the full battery of wheel builds on every push to a PR.

Given the above and that our resource usage at the current rate (see screenshot below) is completely unsustainable and would run at ~$2,500/month (or ~$1,400 after the upcoming price reductions also announced in the blog post linked above) if we'd have to pay for it from 1 Sep, I'd much prefer to get rid of label-based triggering completely. Manual wheel build triggers should be rare and reserved to maintainers who know what they are doing and are able to push an empty commit with the correct commands in the commit message.

image

CPU usage is also bad on cibuildwheel jobs (and note that credits go per CPU-minute, i.e. per core rather than per job); we need to ensure to use 2 cores for pytest:

image

I'll note that on jobs with 2 CPUs, using -n auto also isn't great, since pytest-xdist translates that to -n4 rather than -n2 and its scaling of parallelism is terrible and performance improvement tends to get negative from 4 jobs already. -n2 on 2-core jobs is already far from linear, 3 is getting questionable but still gains typically, >=4 quickly decreases performance.

Example log from a recent macos_arm64_test run showing we get 4 pytest-xdist workers:

$ /Users/admin/numpy-dev/bin/python3.10 -m pytest --rootdir=/private/var/folders/76/zy5ktkns50v6gt5g8r0sf6sc0000gn/T/cirrus-ci-build/build-install/usr/lib/python3.10/site-packages -n auto -m 'not slow' numpy
============================= test session starts ==============================
platform darwin -- Python 3.10.6, pytest-7.4.0, pluggy-1.2.0
rootdir: /private/var/folders/76/zy5ktkns50v6gt5g8r0sf6sc0000gn/T/cirrus-ci-build/build-install/usr/lib/python3.10/site-packages
configfile: ../../../../../pytest.ini
plugins: hypothesis-6.82.0, xdist-3.3.1
created: 4/4 workers
4 workers [34198 items]

Here is the full list of jobs and runtimes for a single run:

image

That's a total of ~222 CPU minutes for wheel builds per run, divided in

  • 77.5 * 2 = 144 min on Linux aarch64 = $0.145 per run
  • 33.5 * 4 = 134 min on macOS = $0.67 per run

So each wheel build costs about $0.80 each time it's triggered - this is a lot. We also have issues with some tests in the full test suite that need fixing (e.g., the slow typing tests shouldn't be run by default, they're the same on all platforms and take well over a minute). But most importantly, we should not be triggering wheel builds so much, they're only very rarely useful.

rgommers avatar Jul 30 '23 09:07 rgommers

As you can see in gh-24289, that PR - which only tweaked a code comment in a meson.build file and I added the 04 - Documentation label to - still triggers a full set of wheel builds on Cirrus (I cancelled them manually after they started running).

rgommers avatar Jul 30 '23 09:07 rgommers

And then after a merge to main, it's running yet again: https://cirrus-ci.com/build/5871472732798976.

rgommers avatar Jul 30 '23 10:07 rgommers

This is the relevant code in .cirrus.star:

    # Obtain commit message for the event. Unfortunately CIRRUS_CHANGE_MESSAGE
    # only contains the actual commit message on a non-PR trigger event.
    # For a PR event it contains the PR title and description.
    SHA = env.get("CIRRUS_CHANGE_IN_REPO")
    url = "https://api.github.com/repos/numpy/numpy/git/commits/" + SHA
    dct = http.get(url).json()
    # if "[wheel build]" in dct["message"]:
    #     return fs.read("ci/cirrus_wheels.yml")

    if "[skip cirrus]" in dct["message"] or "[skip ci]" in dct["message"]:
        return []

    # add extra jobs to the cirrus run by += adding to config
    config = fs.read("tools/ci/cirrus_wheels.yml")
    config += fs.read("tools/ci/cirrus_macosx_arm64.yml")

    return config

I don't see any label-based triggers, also not in tools/ci/cirrus_*. I think what we need here is to (a) uncomment the lines with [wheel build], and (b) delete the line config += fs.read("tools/ci/cirrus_wheels.yml").

rgommers avatar Jul 30 '23 10:07 rgommers

@rgommers, see #24286

andyfaff avatar Jul 30 '23 10:07 andyfaff

I'll try to fix some of the test suite invocation and runtime issues. EDIT: see gh-24291

rgommers avatar Jul 30 '23 10:07 rgommers

I'm currently experimenting with ccache for scipy builds (which use meson). Would the numpy macosx_arm64 benefit from this?

andyfaff avatar Aug 01 '23 01:08 andyfaff

I think so - not by much though, given that the whole build is less than a minute and ~10 seconds of that is the configure stage. So if ccache helps by a factor of ~2x, it may save 20 sec or so.

rgommers avatar Aug 01 '23 08:08 rgommers

Is there a way to tie the cirrus CI builds into the successful run of the smoke test from github actions?

mattip avatar Aug 01 '23 09:08 mattip

@mattip , I'm not sure. It might be possible to have manual triggering if desired, https://cirrus-ci.org/guide/writing-tasks/#manual-tasks

andyfaff avatar Aug 01 '23 12:08 andyfaff

I think this can probably be closed now

andyfaff avatar Aug 12 '23 09:08 andyfaff

The skip/run logic is fixed (thanks!), but gh-24291 still needs finishing and then we need to deal with Cirrus CI credits. So let me re-title this issue rather than close it.

rgommers avatar Aug 12 '23 09:08 rgommers

Current state after 12 days in August - this is looking pretty good, ~3x over the free limit:

image

We haven't done many wheel builds though in August, and we do need those soon for the 1.26.x releases. Finishing up gh-24291 should be useful there. And then we'll probably end up with a O($150/month) bill that we can figure out if we're happy with and if so, the logistics of paying it.

rgommers avatar Aug 12 '23 09:08 rgommers

We're 19 credits away from an outage, so at this rate another 5 days or so. I'll have a look at buying some credits or wiring up a credit card tomorrow.

rgommers avatar Sep 06 '23 20:09 rgommers

Cirrus upped the free credits from 40 to 50, and we're at 41 now - so no problems so far. I've bought a bunch of credits and opened gh-24695 to enable using them.

rgommers avatar Sep 13 '23 16:09 rgommers

After 1.5 days of usage we used 1.02 credits. It actually quite nice that you get to see how much each run costs:

image

macOS arm64 is a little expensive 🤔. This is what the docs say right now:

  • 1000 minutes of 1 virtual CPU for Linux platform for 3 compute credits
  • 1000 minutes of 1 virtual CPU for FreeBSD platform for 3 compute credits
  • 1000 minutes of 1 virtual CPU for Windows platform for 4 compute credits
  • 1000 minutes of 1 Apple Silicon CPU for 15 compute credits

I had interpreted the Apple Silicon CPU as being the whole CPU, not a CPU core. Since you can only get an instance with 4 cores, I thought pricing would end up similar to that for Windows - but it's 4x more.

These jobs were all from maintainers; the external contributor PRs continue to run jobs but don't consume credits. I guess we'll have to see what happens when Cirrus starts to enforce the free credit limit.

So right now the $1/day consumption isn't too worrying, but if consumption goes we may have to think about redoing how we trigger the macOS job.

rgommers avatar Sep 15 '23 09:09 rgommers

For the record, the NumPy Steering Council signed off on my proposal to spend credits - up to max $200/month.

My goal would be to stay below $100/month, and that seems to be feasible. And the invoicing and consumption reporting seems reasonably smooth, so all good so far.

rgommers avatar Sep 15 '23 09:09 rgommers

There are a couple more things to try, manual triggering of the Mac run, or a Cron job e.g. every couple of days.

andyfaff avatar Sep 15 '23 12:09 andyfaff

We used $45 in the first 28 days. I just added $99, so we're good for quite a while now. The consumption is close to what I estimated before.

rgommers avatar Oct 11 '23 20:10 rgommers

We're at $0 now. There's a couple of issues:

  • Credits consumption went too fast. We did not use 99 credits in the last 35 days, more like 40.
  • SciPy has the opposite issue, zero credits were subtracted in the last 6 weeks. Since I paid both with the same credit card, it looks like the SciPy credit usage got subtracted from the NumPy funds 🤔.
  • I cannot add more credits right now, somehow Cirrus is not liking my credit card anymore.

I have to follow up with them, but in the meantime CI jobs may stop running.

rgommers avatar Nov 16 '23 21:11 rgommers

I worst comes to worst, I also have a credit card :wink:

charris avatar Nov 16 '23 22:11 charris