test-infra icon indicating copy to clipboard operation
test-infra copied to clipboard

Investigate/fix DrCI flaky build classification

Open malfet opened this issue 10 months ago • 9 comments

Starting from https://github.com/pytorch/pytorch/pull/122350 where two build failures clearly caused by PR were marked as flaky image

malfet avatar Apr 05 '24 16:04 malfet

cc @ZainRizvi

atalman avatar Apr 09 '24 18:04 atalman

To document what I have found. This is a clear example of the relationship between the accuracy of Dr.CI classification and the log classifier.

The build failure as captured by the log classifier was a generic error:

{
  workflowId: 8601773675,
  workflowUniqueId: 16535519,
  id: 23578796771,
  runnerName: 'i-058fb9e0227cfbbbc',
  authorEmail: '[email protected]',
  name: 'trunk / linux-focal-rocm6.0-py3.8 / build',
  jobName: 'linux-focal-rocm6.0-py3.8 / build',
  conclusion: 'failure',
  completed_at: '2024-04-08T19:09:52Z',
  html_url: 'https://github.com/pytorch/pytorch/actions/runs/8601773675/job/23578796771',
  head_branch: 'ciflow/trunk/122350',
  pr_number: 122350,
  head_sha: '247c64f9c271d4abd5f1d7e30f5deb59cc0ea979',
  failure_captures: [ 'Process completed with exit code 1.' ],
  failure_lines: [ '##[error]Process completed with exit code 1.' ],
  failure_context: [
    '+ echo ::endgroup::',
    '+ sccache --stop-server',
    '+ sccache --show-stats',
    "+ echo '::group::Sccache Compilation Log'",
    '+ sccache_epilogue',
    '+ python setup.py bdist_wheel',
    '+ [[ linux-focal-rocm6.0-py3.8 != *rocm* ]]',
    '+ [[ linux-focal-rocm6.0-py3.8 != *libtorch* ]]',
    '+ return 1',
    '+ set -e',
    '+ retcode=1',
    '+ python setup.py clean bad_argument'
  ],
  time: '2024-04-08T19:09:57.620867Z'
}

Given this information, this was exactly the same as an actual flaky build failure from https://github.com/pytorch/pytorch/actions/runs/8473868798/job/23219108783, which the bot retried successfully. The flaky failure captured by the log classifier was:

{
  workflowId: 8473868798,
  id: 23219108783,
  jobName: 'linux-focal-rocm6.0-py3.8 / build',
  name: 'trunk / linux-focal-rocm6.0-py3.8 / build',
  conclusion: 'failure',
  completed_at: '2024-03-28T21:38:47Z',
  html_url: 'https://github.com/pytorch/pytorch/actions/runs/8473868798/job/23219108783',
  head_sha: '53c6a0301c4d68d5afcdccd746ce4d1f667f71e9',
  head_branch: 'ciflow/trunk/122331',
  failure_captures: [ 'Process completed with exit code 1.' ],
  failure_lines: [ '##[error]Process completed with exit code 1.' ],
  failure_context: [
    '+ echo ::endgroup::',
    '+ sccache --stop-server',
    '+ sccache --show-stats',
    "+ echo '::group::Sccache Compilation Log'",
    '+ sccache_epilogue',
    '+ python setup.py bdist_wheel',
    '+ [[ linux-focal-rocm6.0-py3.8 != *rocm* ]]',
    '+ [[ linux-focal-rocm6.0-py3.8 != *libtorch* ]]',
    '+ return 1',
    '+ set -e',
    '+ retcode=1',
    '+ python setup.py clean bad_argument'
  ],
  authorEmail: ''
}

The two looks exactly the same, including the failure_context guardrail. The merge base of https://github.com/pytorch/pytorch/pull/122350 was also 11 day old, which increase the chance getting false positives (the older the base commit, the higher the chance).

Some thoughts:

  1. As an mitigation, I could add a check to skip/tighten flaky classification when the base commit is older than a certain threshold of maybe 3 days.
  2. In the long run, we plan to improve the log classifier accuracy. These 2 failures were completely different, but the log classifier treated them the same.

huydhn avatar Apr 09 '24 19:04 huydhn

@malfet: we should never treat build failures as flaky/flaky build failures should never be allowed to merge without -f. Ex infra failure -> do not merge, regardless of whether the failure is due to the PR or not

AI: Add builds to list of never marked as flaky jobs AI: Do not mark flaky when log classifier result is really generic AI: Category for "not your fault but you can't ignore this"

clee2000 avatar Apr 16 '24 21:04 clee2000

The first and second AI make sense to me, and could be implemented easily. The last one, however, needs more clarification I think. Devs would likely force merge the PR when a failure is "not your fault but you can't ignore this". We have some mechanisms to help with this AIACT:

  1. To retry infra failures on PR
  2. To limit the number of failures devs can bypass on their PR to the current value of 10 (configurable)

huydhn avatar Apr 17 '24 04:04 huydhn

The intention of the third point is that devs should explicitly force merge the PR if they don't want to figure out some way to rerun the job

clee2000 avatar Apr 17 '24 15:04 clee2000

Do legitimate flaky failures tend to have generic error captures like ##[error]Process completed with exit code 1.?

Using a generic output like that to determine flakiness seems really risky, even if we implement the other mitigations mentioned here.

ZainRizvi avatar Apr 17 '24 21:04 ZainRizvi

Do legitimate flaky failures tend to have generic error captures like ##[error]Process completed with exit code 1.?

Yup, that one and Command docker exec -t ... from Nova, i.e. https://hud.pytorch.org/pytorch/executorch/commit/b3ac5332e72a909ca900d6dc6feb21bedce753e4

Atm, there are 2 heuristic in place to reduce the chance of having FP.

  • The error is not compared alone, but is used together with the error context of N shell commands before https://github.com/pytorch/test-infra/blob/main/torchci/lib/jobUtils.ts#L291. So an exit code 1 due to build failure and an exit code 1 due to GH infra flaky looks different.
  • And the fix mentioned in the issue https://github.com/pytorch/test-infra/pull/5073 narrow down the chance of having FP further.

We could also mitigate this by making it easier to add a new heuristic rule for the log classifier. The PR route is a bit unyielding IMO, so we don't create new rule often while I think we should.

huydhn avatar Apr 17 '24 21:04 huydhn

+1 same thing for backward compat job!

albanD avatar Apr 19 '24 18:04 albanD

+1 same thing for backward compat job!

I have a fix ready here https://github.com/pytorch/test-infra/pull/5106, although it's weird that I don't see @ZainRizvi comment about lint on the issue, although I think it's there.

huydhn avatar Apr 19 '24 20:04 huydhn