test-infra icon indicating copy to clipboard operation
test-infra copied to clipboard

Cancelled jobs on CI are still not handled correctly

Open huydhn opened this issue 11 months ago • 6 comments

There are feedback that cancelled signals on CI are still showing up. This causes confusion and also blocks merge. For example, https://github.com/pytorch/pytorch/pull/121522#issuecomment-1986286960

Looking at the workflow summary https://github.com/pytorch/pytorch/actions/runs/8207746512, it's clear that the workflow was cancelled by its concurrency rule:

Canceling since a higher priority waiting request for 'linux-binary-libtorch-pre-cxx11-ciflow/trunk/121522-false-false' exists

If we can query this information, it should be a reliable way to handle cancelled signals on CI.

huydhn avatar Mar 11 '24 23:03 huydhn

AI: Cancelled signals should be surface in clear way 1) if the job is cancel by user, we should tell that the merge is cancel 2) if the job is cancel because a higher priority job runs, it shouldn't show up as failures

huydhn avatar Mar 12 '24 21:03 huydhn

To clarify the ask, would it be correct to say we want:

  1. Jobs cancelled by the user should be marked as "cancelled by user" by Dr. CI and mergebot
  2. Jobs cancelled due to infra reasons (like higher priority jobs) should be marked as "cancelled by infra" in Dr. CI and mergebot
  3. For both of the above, mergebot should continue failing the merge, but give the more precise about why it's blocking the merge

ZainRizvi avatar Apr 05 '24 21:04 ZainRizvi

The first point is correct, when the jobs are cancelled by the user, we want them to show up as failures and block merge. However, if the jobs are canceled by a higher priority request like the above example, we don't want them to shown up on Dr.CI though. Instead, we need to use the status of the newer set of jobs instead.

huydhn avatar Apr 05 '24 22:04 huydhn

IIRC if we just show cancelled jobs as cancelled, when the new job kicks off we'd automatically show the status of of the now-running job, right?

ZainRizvi avatar Apr 05 '24 23:04 ZainRizvi

Related Issue: https://github.com/pytorch/test-infra/issues/4644

ZainRizvi avatar Apr 05 '24 23:04 ZainRizvi

IIRC if we just show cancelled jobs as cancelled, when the new job kicks off we'd automatically show the status of of the now-running job, right?

Yeah, that's what I think too. This issue is kind of hard to track and reproduce.

huydhn avatar Apr 06 '24 00:04 huydhn