test-infra
test-infra copied to clipboard
Investigate/fix DrCI flaky build classification
Starting from https://github.com/pytorch/pytorch/pull/122350 where two build failures clearly caused by PR were marked as flaky
cc @ZainRizvi
To document what I have found. This is a clear example of the relationship between the accuracy of Dr.CI classification and the log classifier.
The build failure as captured by the log classifier was a generic error:
{
workflowId: 8601773675,
workflowUniqueId: 16535519,
id: 23578796771,
runnerName: 'i-058fb9e0227cfbbbc',
authorEmail: '[email protected]',
name: 'trunk / linux-focal-rocm6.0-py3.8 / build',
jobName: 'linux-focal-rocm6.0-py3.8 / build',
conclusion: 'failure',
completed_at: '2024-04-08T19:09:52Z',
html_url: 'https://github.com/pytorch/pytorch/actions/runs/8601773675/job/23578796771',
head_branch: 'ciflow/trunk/122350',
pr_number: 122350,
head_sha: '247c64f9c271d4abd5f1d7e30f5deb59cc0ea979',
failure_captures: [ 'Process completed with exit code 1.' ],
failure_lines: [ '##[error]Process completed with exit code 1.' ],
failure_context: [
'+ echo ::endgroup::',
'+ sccache --stop-server',
'+ sccache --show-stats',
"+ echo '::group::Sccache Compilation Log'",
'+ sccache_epilogue',
'+ python setup.py bdist_wheel',
'+ [[ linux-focal-rocm6.0-py3.8 != *rocm* ]]',
'+ [[ linux-focal-rocm6.0-py3.8 != *libtorch* ]]',
'+ return 1',
'+ set -e',
'+ retcode=1',
'+ python setup.py clean bad_argument'
],
time: '2024-04-08T19:09:57.620867Z'
}
Given this information, this was exactly the same as an actual flaky build failure from https://github.com/pytorch/pytorch/actions/runs/8473868798/job/23219108783, which the bot retried successfully. The flaky failure captured by the log classifier was:
{
workflowId: 8473868798,
id: 23219108783,
jobName: 'linux-focal-rocm6.0-py3.8 / build',
name: 'trunk / linux-focal-rocm6.0-py3.8 / build',
conclusion: 'failure',
completed_at: '2024-03-28T21:38:47Z',
html_url: 'https://github.com/pytorch/pytorch/actions/runs/8473868798/job/23219108783',
head_sha: '53c6a0301c4d68d5afcdccd746ce4d1f667f71e9',
head_branch: 'ciflow/trunk/122331',
failure_captures: [ 'Process completed with exit code 1.' ],
failure_lines: [ '##[error]Process completed with exit code 1.' ],
failure_context: [
'+ echo ::endgroup::',
'+ sccache --stop-server',
'+ sccache --show-stats',
"+ echo '::group::Sccache Compilation Log'",
'+ sccache_epilogue',
'+ python setup.py bdist_wheel',
'+ [[ linux-focal-rocm6.0-py3.8 != *rocm* ]]',
'+ [[ linux-focal-rocm6.0-py3.8 != *libtorch* ]]',
'+ return 1',
'+ set -e',
'+ retcode=1',
'+ python setup.py clean bad_argument'
],
authorEmail: ''
}
The two looks exactly the same, including the failure_context
guardrail. The merge base of https://github.com/pytorch/pytorch/pull/122350 was also 11 day old, which increase the chance getting false positives (the older the base commit, the higher the chance).
Some thoughts:
- As an mitigation, I could add a check to skip/tighten flaky classification when the base commit is older than a certain threshold of maybe 3 days.
- In the long run, we plan to improve the log classifier accuracy. These 2 failures were completely different, but the log classifier treated them the same.
@malfet: we should never treat build failures as flaky/flaky build failures should never be allowed to merge without -f. Ex infra failure -> do not merge, regardless of whether the failure is due to the PR or not
AI: Add builds to list of never marked as flaky jobs AI: Do not mark flaky when log classifier result is really generic AI: Category for "not your fault but you can't ignore this"
The first and second AI make sense to me, and could be implemented easily. The last one, however, needs more clarification I think. Devs would likely force merge the PR when a failure is "not your fault but you can't ignore this". We have some mechanisms to help with this AIACT:
- To retry infra failures on PR
- To limit the number of failures devs can bypass on their PR to the current value of 10 (configurable)
The intention of the third point is that devs should explicitly force merge the PR if they don't want to figure out some way to rerun the job
Do legitimate flaky failures tend to have generic error captures like ##[error]Process completed with exit code 1.
?
Using a generic output like that to determine flakiness seems really risky, even if we implement the other mitigations mentioned here.
Do legitimate flaky failures tend to have generic error captures like
##[error]Process completed with exit code 1.
?
Yup, that one and Command docker exec -t ...
from Nova, i.e. https://hud.pytorch.org/pytorch/executorch/commit/b3ac5332e72a909ca900d6dc6feb21bedce753e4
Atm, there are 2 heuristic in place to reduce the chance of having FP.
- The error is not compared alone, but is used together with the error context of N shell commands before https://github.com/pytorch/test-infra/blob/main/torchci/lib/jobUtils.ts#L291. So an exit code 1 due to build failure and an exit code 1 due to GH infra flaky looks different.
- And the fix mentioned in the issue https://github.com/pytorch/test-infra/pull/5073 narrow down the chance of having FP further.
We could also mitigate this by making it easier to add a new heuristic rule for the log classifier. The PR route is a bit unyielding IMO, so we don't create new rule often while I think we should.
+1 same thing for backward compat job!
+1 same thing for backward compat job!
I have a fix ready here https://github.com/pytorch/test-infra/pull/5106, although it's weird that I don't see @ZainRizvi comment about lint on the issue, although I think it's there.