test-infra
test-infra copied to clipboard
Cancelled jobs on CI are still not handled correctly
There are feedback that cancelled signals on CI are still showing up. This causes confusion and also blocks merge. For example, https://github.com/pytorch/pytorch/pull/121522#issuecomment-1986286960
Looking at the workflow summary https://github.com/pytorch/pytorch/actions/runs/8207746512, it's clear that the workflow was cancelled by its concurrency rule:
Canceling since a higher priority waiting request for 'linux-binary-libtorch-pre-cxx11-ciflow/trunk/121522-false-false' exists
If we can query this information, it should be a reliable way to handle cancelled signals on CI.
AI: Cancelled signals should be surface in clear way 1) if the job is cancel by user, we should tell that the merge is cancel 2) if the job is cancel because a higher priority job runs, it shouldn't show up as failures
To clarify the ask, would it be correct to say we want:
- Jobs cancelled by the user should be marked as "cancelled by user" by Dr. CI and mergebot
- Jobs cancelled due to infra reasons (like higher priority jobs) should be marked as "cancelled by infra" in Dr. CI and mergebot
- For both of the above, mergebot should continue failing the merge, but give the more precise about why it's blocking the merge
The first point is correct, when the jobs are cancelled by the user, we want them to show up as failures and block merge. However, if the jobs are canceled by a higher priority request like the above example, we don't want them to shown up on Dr.CI though. Instead, we need to use the status of the newer set of jobs instead.
IIRC if we just show cancelled jobs as cancelled, when the new job kicks off we'd automatically show the status of of the now-running job, right?
Related Issue: https://github.com/pytorch/test-infra/issues/4644
IIRC if we just show cancelled jobs as cancelled, when the new job kicks off we'd automatically show the status of of the now-running job, right?
Yeah, that's what I think too. This issue is kind of hard to track and reproduce.