test-infra icon indicating copy to clipboard operation
test-infra copied to clipboard

[Dr CI] Wrong classification for XLA

Open clee2000 opened this issue 1 year ago • 2 comments

Pretty sure the xla failure on this was real

:link: Helpful Links

:test_tube: See artifacts and rendered test results at hud.pytorch.org/pr/124920

Note: Links to docs will display an error until the docs builds have been completed.

:white_check_mark: You can merge normally! (2 Unrelated Failures)

As of commit a6516ea6789e12a1a80a8c8cc7ce63698d443821 with merge base 59a1f1f308545e3ac1d81940a51f8dc0db3d82d4 (image):

FLAKY - The following job failed but was likely due to flakiness present on trunk:

BROKEN TRUNK - The following job failed but was present on the merge base:

👉 Rebase onto the `viable/strict` branch to avoid these failures

This comment was automatically generated by Dr. CI and updates every 15 minutes.

clee2000 avatar Apr 26 '24 02:04 clee2000

I think it matched against https://github.com/pytorch/pytorch/actions/runs/8819380894/job/24214658473 which was recent and has a different error trace but the same test name. However, it doesn't show up on the main branch afaict. Are flaky failures checking all branches? Can it be changed to only be against main?

cc @huydhn

clee2000 avatar Apr 26 '24 15:04 clee2000

I'm trying to figure out a solution for this case. From what I see, XLA error matching is sometime not good because I have not been paying to much attention on what is running on XLA size to build up a good log classifier support. One common mismatch is ModuleNotFoundError: No module named 'torch.version' which appears on all XLA test job.

huydhn avatar Apr 26 '24 17:04 huydhn

This has been fixed by https://github.com/pytorch/test-infra/pull/5151, so I will close this.

huydhn avatar May 14 '24 21:05 huydhn