aqa-tests icon indicating copy to clipboard operation
aqa-tests copied to clipboard

Feature Proposal: Where possible, make auto-rerun tests run on a different machine

Open adamfarley opened this issue 1 year ago • 4 comments

Summary Proposal for the auto-rerun test feature to mandate (where possible) that the reruns happen on a different host.

Details If a specific unit test failure is caused by something specific to a particular host, this change allows us to avoid that problem.

Statistics Summary Tests which failed and reran on the same host: 84 ...of which this many failed: 78 Tests which failed and reran on a different host: 173 ...of which this many failed: 124

Statistics Source

RerunStatsByHost.groovy.txt

adamfarley avatar Nov 20 '24 15:11 adamfarley

For context, this issue was created based on a discussion in the retrospective where I suggested that we could consider grabbing the hostname of where the initial run occurred and set ADDITIONAL_LABEL=!hostname, but before making such a change, we could actually look at the some of the metrics around auto_reruns in TRSS (related: https://github.com/adoptium/aqa-tests/issues/5121), whether such a change is actually needed or if many of the auto_reruns naturally land on different machines or whether the intermittent failures we have at the project are less likely to be machine-related causes.

smlambert avatar Nov 20 '24 21:11 smlambert

I've written a program to provide some numbers for/against this proposal.

It looks at the last 10 pipelines per LTS version, identifies all rerun tests per build, and compares the host names (and also logs the pass/fail of the rerun).

The program is taking a while to run, but I can see the progress it's making and will update this issue with the results in a minute.

Here is the source: RerunStatsByHost.groovy.txt

And here is the output:

Tests which failed and reran on the same host: 84 ...of which this many failed: 78 Tests which failed and reran on a different host: 173 ...of which this many failed: 124

So the percentages seem to indicate that a different host is best.

adamfarley avatar Nov 21 '24 15:11 adamfarley

Please add me to this task as an assignee, and change the project to the Q4 one.

Also, I'll be training tomorrow, so others can feel free to add their name too for further discussion and/or pr creation.

Ta very much. :)

adamfarley avatar Nov 21 '24 16:11 adamfarley

Great :)

One other dimension to this is that we have now enabled taking 'problem machines' offline if a certain type of failure occurs, so it would not be available to send the rerun job too. It'd be good to look at those that were sent to the same machine and failed in the rerun, to see the nature of the failures (would those failures now trigger taking those machines offline).

smlambert avatar Nov 21 '24 22:11 smlambert