iree icon indicating copy to clipboard operation
iree copied to clipboard

Add optionality to presubmits

Open GMNGeoffrey opened this issue 2 years ago • 8 comments

Currently it is hardcoded which presubmits run on PRs (all of them except those involving GPUs). Instead, we want the PR author to be able to specify if they want the non-default set to run. Use cases:

  1. PR modifying only docs runs no CI
  2. PR modifying GPU code generation runs GPU tests
  3. PR affecting performance runs benchmarks (benchmarks not yet migrated to GitHub actions)

Another consideration is that some CI jobs may use resources that we're more concerned about RCE on than our generic linux builders (e.g. physical benchmark devices, expensive GPU VMs) and we want some way to limit who is allowed to run these jobs. For now, all GitHub actions require being or getting approval from a collaborator before they run, so that is sufficient. We'd be interested to relax that restriction though to allowing previous contributors as well, in which case we will need to come up with our own solution.

Proposed solution:

Use git trailers in the PR description to indicate which jobs to run. These would be ci-extra, ci-skip, and ci-only, each containing a comma-separated list of job names. The first two would be mutually exclusive with the last. If ci-only is specified, that is the set of jobs to run, otherwise, ci-extra is added to the default set and then anything in ci-skip is removed. If the same job is specified in both ci-extra and ci-skip, we would return an error, as this is almost certainly a mistake.

Wildcards: we could support unix-shell-style wildcards in the job names, so someone could specify ci-extra: *gpu*. If we do allow this, then it's really easy to specify no ci with ci-skip: *, but GitHub descriptions are markdown, which means that would have a different meaning. I think we need a special magic string, like ci-skip: all instead. There's some question of what this should mean when combined then. What happens with ci-extra: foo + ci-skip: all? It could just run foo, run nothing, or error. Above I described adding ci-extra and then removing ci-skip, which would mean it would run nothing, but that seems unlikely to be desirable. Running only foo could be written ci-only: foo, but if we remove things from ci-skip and then add ci-extra, then we can get rid of ci-only entirely. I'm not sure what is clearest.

GMNGeoffrey avatar Aug 09 '22 22:08 GMNGeoffrey

@pzread and @ScottTodd opinions?

GMNGeoffrey avatar Aug 09 '22 23:08 GMNGeoffrey

The proposed solution sounds reasonable to me. I imagine there would be a few common recipes modeled after the use cases you listed at the top (docs -> skip CI or just check that the website build works, run GPU or other platform tests, trigger benchmarks).

The labels that we use for benchmark triggering are an alternative, possibly complementary solution. Editing a PR description feels more intrusive than editing labels. I'd also consider anything in the PR description or git commits to be persistent (not really edited after authoring), while labels can be added and removed as needed (trigger once as a sanity check, then remove after results look good, possibly add new commits that don't run the extra checks).

ScottTodd avatar Aug 09 '22 23:08 ScottTodd

Yeah labels don't scale well with number of jobs though. I think it's also good to document "this commit went through these presubmits", so the persistence is a feature. For your use case, I would probably put a comment in the PR description pointing to the previous run (though I agree it's a bit awkward)

GMNGeoffrey avatar Aug 09 '22 23:08 GMNGeoffrey

For the trailers, I'm wondering if we can use only ci-enable and ci-skip. So

  1. If there is no trailer, we run the default set
  2. If ci-enable is present, we disable the default set, and only run the items specified in the list. ci-enable: defaults is equivalent to no trailer. ci-enable: defaults,extra items is equivalent to ci-extra: extra items in your proposal
  3. ci-skip removes whatever in the enabled list or the default set

I also feel that it's important to keep the history of trailers, especially for benchmarking. Right now we only post one benchmark report and replace it when rerun. I think ideally we should post one report for each trailer change, and link the report back to the trailer that triggered the run.

Will it be better if it requires authors to post a new comment to change the trailers?

pzread avatar Aug 10 '22 16:08 pzread

Adding P1 here, please downgrade if needed!

allieculp avatar Aug 10 '22 18:08 allieculp

Your suggestion seems reasonable @pzread

ci-skip removes whatever in the enabled list or the default set

Not sure if I'm understanding correctly, but I'd expect it to be ci-skip: foo, bar, not

ci-enable: foo, bar
ci-skip

which seems confusing.

We discussed this a bit more in our dev meeting and I think this is likely more complicated than we actually need. The main uses cases we have are the ones listed above. Those can be solved by making GPU tests enabled by default and adding a "skip-ci" label (note that labels can only be applied by people with triage access, which should be fine for this use case).

GMNGeoffrey avatar Aug 10 '22 18:08 GMNGeoffrey

The GPU jobs take ~14 minutes (TF integrations) and ~5-7 minutes (main build). So that's about 20 min of a100 machine time per presubmit run. The jobs running on cpu machines take a combined ~100 minutes. An n1-standard-96 (the cpu machines we're using) costs ~$3.33 hourly and the a100 machines cost ~$2.93 hourly (all public information from the cloud console). We actually get a discounted rate in some internal accounting system, but the exact amount is non-public and I can't even look it up. I'll assume it's proportional to the public prices though (no idea if this is true). So this means that the GPU machines actually aren't way more expensive. It also means that each presubmit run costs $6.53. That's more expensive than it seems like it should be, but acceptable. In a week, we had 191 runs, costing about $1250. That is acceptable and we can run all the tests on presubmit and continue to work on efficiency, but mostly because it's worth it for developer productivity.

GMNGeoffrey avatar Aug 10 '22 19:08 GMNGeoffrey

Oh also note that the main reason I was cagey about using a100 machines before was because we had pretty limited quota and so there was a hard cap independent of cost. I've secured us more quota though, so that's less of a concern now. We were also building the whole project on the GPU machine, but now we've got multi-stage builds, so we're not doing that. (just to document that I wasn't just being totally silly about this stuff :stuck_out_tongue:)

GMNGeoffrey avatar Aug 10 '22 19:08 GMNGeoffrey

So we have an issue with the label. When a PR is created with a label, the GitHub events are that it is opened and then it is labeled. That makes it impossible to open a PR and have it not run CI. That's a pretty important use case...

Options we have:

  1. Use a string in the PR description to skip CI instead of a label
  2. Don't run CI when a PR is opened. This means that someone would have to take some other action to trigger the CI. In some ways, that's nice because it can avoid running CI unnecessarily, but people might find that annoying. Events we can use are:
    • assigned
    • unassigned
    • labeled
    • unlabeled
    • opened
    • edited
    • closed
    • reopened
    • synchronize [basically on push to the PR branch]
    • converted_to_draft
    • ready_for_review
    • locked
    • unlocked
    • review_requested
    • review_request_removed
    • auto_merge_enabled
    • auto_merge_disabled

https://docs.github.com/en/actions/using-workflows/events-that-trigger-workflows#pull_request

Option 1 is obviously easier to implement right now. We could keep the label as well, but I think it's more confusing to have multiple ways to do things, so I'd rather delete it.

Option 2 gets us into a slightly different world where we have two classes of presubmits: those that always run, and run on each commit, and those that only run when requested and on some special events. I actually kind of like world 2 better in the long run. It potentially avoids running CI unnecessarily when people push all the time and it allows creating a distinction between different classes of presubmit. In particular, it would allow us to draw a line between things that are small enough to run on GitHub-hosted infra, which we can run automatically for external contributors and on every commit, and things that require our own runners and that can only be triggered by labeling the PR (which gives us some security guarantees). In that world, we could run the smoke tests on push as well so that it wouldn't be necessary to create a PR in order to test them. We had some wonky things with duplicate statuses when we had stuff running on both pull_request and push events, so would need to fiddle with that, but it should be doable. I think we could run the larger set of tests on the labeled event and look for a certain label being added (e.g. "presubmit") and have the action also remove the label (so it could be added again to re-trigger). The "skip-ci" label would also trigger a run that posted the success status. It could also run on auto_merge_enabled. I think it unfortunately couldn't run on ready_for_review or review_requested because those can be triggered by anyone.

Another wrinkle is that we would probably need to use a separate check status instead of the "summary" because filtering based on label would post a "skipped" status, which counts as passing for some reason. It would also override a previous run. To avoid the individual CI jobs posting statuses that override things, we probably need to have them triggered by something other than pull_request, like via a workflow_dispatch. A pipeline with write access would then need to trigger when it was complete and post (or not post) the correct status. That could either be by busy-waiting on the workflow or with a separate workflow triggered on workflow_run. But then there's another problem: workflow_dispatch can only be run using a workflow file from the main repo, which means anything editing the workflow would have to be tested on a branch to the main repo :confused:

Some stuff with complicated workflows here: https://securitylab.github.com/research/github-actions-preventing-pwn-requests/

So having written all that, I think it's clear we want to go with option 1 for now and revisit option 2 later if desired

GMNGeoffrey avatar Aug 15 '22 20:08 GMNGeoffrey

Use a string in the PR description to skip CI instead of a label

This SGTM. I don't think splitting the CI into two classes of presubmits should be necessary for solving "I just want to edit some experimental / markdown / etc. files that I know shouldn't affect presubmits". It might be useful for "I want to run more expensive GPU tests / benchmarks" though.

ScottTodd avatar Aug 16 '22 17:08 ScottTodd

@GMNGeoffrey do you have more plans for this? The git trailers and current filtering both seem to be working pretty well now.

ScottTodd avatar Nov 11 '22 23:11 ScottTodd

I still frequently find myself wanting to only run one CI job or wanting to retrigger CI without having to push again (e.g. I added the "skip-ci" tag). This definitely is lower priority now though. I'm ok closing it too.

GMNGeoffrey avatar Nov 11 '22 23:11 GMNGeoffrey

Dump the note from an offline discussion:

As https://github.com/openxla/iree/issues/10042#issuecomment-1215759942 mentioned, using a separate check status instead of the "summary" is now the main blocker of adding optionality to CI; otherwise it is hard if not impossible to show the correct check status on PR (because skipped jobs will overwrite the previous results).

We need that to support (but not limited to):

  • Provide a more friendly way to re-trigger the benchmark pipeline (update git trailer, update labels)
  • No need to push an empty commit for skip-ci
  • Partially run the CI workflow if the new commit only changes docs
  • Better organize the check status (having 20 statuses in the check list isn't friendly to find the failure jobs)

We also want to not use any runner to decide what workflows/jobs should be run, as this logic might run very frequently (PR description updates, label changes, ...)

pzread avatar Feb 22 '23 19:02 pzread

With some experiments and discussions, we decided to follow the paths below to achieve the more flexible CI workflow.

TL;DR

Our P0 is to support manual re-trigger benchmark workflow instead of pushing a new commit. The first step is to change the main CI workflow (the workflow that builds and tests everything) to fetch the latest pull request description every time it starts. Therefore, in the case you forget to put benchmark trailers or make a typo, you will be able to edit the trailers and use Github UI to re-run the workflow.

First you edit the benchmark trailers benchmarks: ... in your pull request description.

Then go to the Checks page of you pull request Screenshot 2023-03-01 11 42 59

Click the CI workflow one the left panel Screenshot 2023-03-01 11 44 38

On the CI workflow page, if the workflow is finished, you can click the Run-run all jobs to trigger it again: Screenshot 2023-02-28 17 44 36

If the workflow is still running, you can Cancel workflow first. Once it stops (might take ~50s), you can click Re-run all jobs button (not Re-run failed jobs): Screenshot 2023-02-28 17 49 12 Screenshot 2023-03-02 12 37 36

The workflow will still repeat all steps that have been done in the previous run, which is slow and wasteful, but we have plans to improve this.

This change should happen in a few days (PR #12383). It is far from ideal, so we planned the next steps below

Next Step (P0.5)

The next step is to support easier ways to re-trigger the benchmarks. We likely implement it as "have more ways to re-run the main CI workflow", and the benchmark workflow is still a part of the CI workflow, unless we figure out a better solution. Here are potential events that will trigger a re-run:

  • Post a new comment on the pull request with benchmarks: xxx
  • Benchmark trailer in the description is edited and changed
  • Add/remove benchmark labels
  • Support using workflow dispatch UI button to trigger the benchmark run

These triggers can be implemented by having a privileged workflow monitoring the pull request, and automatically re-run the CI workflow when the conditions are satisfied (through rerequest api)

More visible UI feedbacks can also be posted by the privileged workflow, for example, adds emojis to the benchmark comment to show this benchmark request has been taken care.

Some features should be implemented in 1~2 weeks.

Long-term Plan (P2)

Re-running the main CI workflow and repeating all unrelated finished steps are very wasteful. So we want to implement the logic to fetch the status from the previous run and skip successful steps. The artifacts built in the previous run are reused in the new run. We can further extend this approach to reuse artifacts even when a new commit is pushed, but the commit actually doesn't affect the build outcome.

Why The Solution is So Strange

An intuitive design of this kind of system should be:

  • Define a standalone benchmark workflow
  • When the CI workflow is done, trigger the benchmark workflow and pass the benchmark artifacts
  • Have a manual button / other triggers to trigger the benchmark workflow

But here are two key missing features in Github Actions:

  1. No API to wait for another workflow
  2. No API to trigger a workflow and run it in the pull request context

If there were an API for feature 1, we can have the standalone benchmark workflow listening to a variety of trigger events. Then it waits on the main CI workflow for finishing build steps. You can also only re-run the standalone benchmark workflow from UI when needed. There is a workaround for this feature by having a loop to poll the status of another workflow every few seconds, but this will occupy a runner, which is quite wasteful.

With feature 2, instead of waiting for the main CI workflow, the main CI workflow can trigger the benchmark workflow when the build steps are done. If the benchmark workflow is triggered from other events and the main CI workflow is still running, the benchmark workflow can do nothing and wait for the CI workflow to trigger it later.

Feature 2 is something that seems to be doable with Github API but actually has lots of issues. Basically there are 3 methods to trigger another workflow:

  1. workflow_dispatch: An API to trigger a new workflow run. Problem is that the callee workflow can only be run in a repo branch/tag context.
  2. workflow_run: An event triggers a workflow when another workflow is finished. Problem is that the callee workflow can only run in the default branch context.
  3. call reusable workflow: Call another workflow in a workflow. Problem is that it basically just inlines another workflow into the current workflow.

Method 1 and 2 don't work in our use case because the callee workflow runs in the contexts of IREE repo's branches. The context is part of the Github Action security model to control the permissions and isolate caches. The context depends on which branch the workflow runs. A workflow triggered by a pull_request event from a forked pull request will be run in its own context and only have read permissions. It's highly not-recommended to run the workflow/code from forked pull requests in repo's branch contexts as these contexts are privileged (see Preventing pwn requests).

Method 3 doesn't do what we want because the benchmark workflow is still run as part of the main CI workflow. The cancellation on the benchmark workflow jobs will cancel the CI workflow and fail it.

Considering other methods are also complicated, it looks not too bad to re-run the main CI workflow when we want to re-trigger the benchmarks, and implement a smart way to reuse the outcomes from the previous run. The later part can also be extended to build an efficient CI that can reuse artifacts across different commits.

pzread avatar Mar 01 '23 02:03 pzread

This still feels like we're trying to fit a square peg into a round hole by tightly coupling build/test CI with benchmarking CI.

Our P0 is to support manual re-trigger benchmark workflow instead of pushing a new commit. The first step is to change the main CI workflow (the workflow that builds and tests everything) to fetch the latest pull request description every time it starts. Therefore, in the case you forget to put benchmark trailers or make a typo, you will be able to edit the trailers and use Github UI to re-run the workflow.

This would be solved just as well by workflow_dispatch-ing a separate benchmark workflow that fetches the build artifacts from a PR, no? I'm worried that we're twisting how GitHub Actions works far too much in order to make something "fully automated". If someone already needs to edit a PR description and click through some UI, what do we gain by staying within the same workflow file?

ScottTodd avatar Mar 01 '23 18:03 ScottTodd

This still feels like we're trying to fit a square peg into a round hole by tightly coupling build/test CI with benchmarking CI.

Our P0 is to support manual re-trigger benchmark workflow instead of pushing a new commit. The first step is to change the main CI workflow (the workflow that builds and tests everything) to fetch the latest pull request description every time it starts. Therefore, in the case you forget to put benchmark trailers or make a typo, you will be able to edit the trailers and use Github UI to re-run the workflow.

This would be solved just as well by workflow_dispatch-ing a separate benchmark workflow that fetches the build artifacts from a PR, no? I'm worried that we're twisting how GitHub Actions works far too much in order to make something "fully automated". If someone already needs to edit a PR description and click through some UI, what do we gain by staying within the same workflow file?

One major problem to have a separate workflow is that we need a mechanism to maintain the job dependencies between the workflows. Within a single workflow, the job dependencies is supported. When you have a separate workflow, you either need an API or loop polling to wait on another workflow. Or you just let the workflow fail if the dependencies haven't finished.

As we mentioned above, we didn't find an API to wait on a workflow and loop polling is wasteful. If you let the benchmark workflow fail because the CI workflow isn't finished, then the CI workflow needs to trigger the benchmark workflow again when it's finished, which is not well supported by Github API due to the context issues, or users will need to trigger the workflow again by themselves, after the CI workflow is done.

Personally I think asking users to always trigger the benchmark workflow is not ideal. Since once the benchmark workflow is triggered, it will take another ~30mins to finish. Users might waste time if they miss the moment the CI workflow is finished, and now they need to wait for another 30mins for the benchmark results.

pzread avatar Mar 01 '23 18:03 pzread

Unassigning myself from issues that I'm not actively working on

GMNGeoffrey avatar Apr 19 '23 22:04 GMNGeoffrey