test-infra icon indicating copy to clipboard operation
test-infra copied to clipboard

Stop CI jobs on PyTorch PRs with humongous stacks

Open huydhn opened this issue 1 year ago • 4 comments

We have been impacted by this several times when devs unknowingly create a large stack of size 10+:

  • The stack will hook up a big chunk, may be all of CI resources, preventing other people to run their jobs.
  • In addition, a large stack like https://github.com/pytorch/pytorch/pull/116745 could also cause failures further down the line in diff train, i.e. D52549880

We need to take more active action here to prevent the issue from happening. Some thoughts:

  • We could create a CI check running before other jobs stopping them from running in the first place when the stack size is larger than a threshold
  • or updating ghstack to automatically add [no ci] into the commit message for large stack https://docs.github.com/en/actions/managing-workflow-runs/skipping-workflow-runs

In any cases, a clear message will need to be communicated with the owner of the stack about what happen (CI not running) and what they could do to fix it (squash commits, breaking up the stack)

cc @malfet @seemethere @izaitsevfb @kit1980 @clee2000 @PaliC

huydhn avatar Jan 05 '24 03:01 huydhn

I think it's correct to not mess with ghstack as it's technically not in our control and this problem is specific to pytorch.

Another option is to have pytorchbot look for ghstack prs which are over 10 commits or so and cancel all workflows on those PRs.

PaliC avatar Jan 09 '24 00:01 PaliC

One of the reasons why I submitted all those PRs in a stack was because I wanted signal on all of them at the same time. If we are going to limit resources, then we should do it for a reasonable stack size (e.g. over 10). Otherwise devs will be incentivized to submit 10 different PRs instead of in a stack, and then we have the same problem.

zou3519 avatar Feb 01 '24 15:02 zou3519

@zou3519 there should be a reasonable balance between CI overhead associated with rebasing such PRs and easy of development. I run some stats: both mean and median stack length was under 5. See https://github.com/pytorch/builder/blob/add4488dcc3504b9e58bd470a1041501a294e7d2/analytics/github_analyze.py#L376C43-L377C1 if you want to run the analysis yourself

malfet avatar Feb 01 '24 18:02 malfet

We probably don't want mean/median, because that means 50% of people will need to worry about not getting signals on their diff stacks? How does the p90 or p80 look?

zou3519 avatar Feb 01 '24 23:02 zou3519