aqa-tests icon indicating copy to clipboard operation
aqa-tests copied to clipboard

Improvements related with rerun test jobs

Open sophia-guo opened this issue 1 year ago • 13 comments

Rerun test jobs was recently enabled in adoptium, which definitely helps in the latest releases. Here are some thoughts or issues we met during the releases:

  • [ ] for openjdk tests, rerun failed targets might cost. There might be a few test cases fail but the test target might include a few hundred of testcases. https://github.com/adoptium/aqa-tests/issues/5241
  • [ ] there will be new intermittent failures during the rerun test target. I.e, the former failure test cases pass in the rerun test target while other former passing test cases fail in the rerun test target. If this is the case the status of rerun build is unstable, which need manually check and hence cause extra work.
  • [x] rerun iteration, for now in adoptium it is set as 3. Tests will be rerun on same machine 3 times, if test targets fail once the job status will be marked as unstable. Wondered if 1 is enough as if test failures related to machine issues there is no need to rerun multiple times on same machine.
  • [x] tap files are not archived to parent jobs https://github.com/adoptium/aqa-tests/issues/5015
  • [ ] rerun job shows unstable, however the TEST TARGETS SUMMARY shows passed. - might be issue with tkg with iteration >1? https://ci.adoptium.net/job/Test_openjdk11_hs_sanity.system_x86-64_mac_rerun/1/
Screenshot 2024-01-29 at 5 33 09 PM

sophia-guo avatar Jan 29 '24 22:01 sophia-guo

Regarding 3 versus 1, I think we will adjust to 1 at ci.adoptium.net, since often the failures we have are machine related, so there is no value to rerun 3x on the same machine. Since this was our 'trial use' of this feature, we set it to 3 to see how it would work.

smlambert avatar Jan 29 '24 23:01 smlambert

After doing triage for Jan 2024 CPU - there are several updates to TRSS that I intend to add, including the list of failed openjdk testcases (just as we add to TAP files) should be tracked in the TRSS database. Have to investigate, but this could be via changing how we configure jtreg, or actively printing out the TAP file contents to console and grabbing it at the end of the job.

smlambert avatar Jan 30 '24 00:01 smlambert

The rerun feature is ideally suited for environments that are more stable than ci.adoptium.net, but at the same time, if we wait for stability we may never get to try any new features.

smlambert avatar Jan 30 '24 00:01 smlambert

I have been seeing extended test durations with recent builds, and have done a bit of digging into an example: x64AlpineLinux sanity.openjdk is now recently taking on average about 17 hours to complete: https://ci.adoptium.net/job/Test_openjdk22_hs_sanity.openjdk_x86-64_alpine-linux

  • The main test run takes about 5hours
  • Typically the following fail and are submitted for re-run: jdk_lang_0,jdk_util_0,jdk_util_1,jdk_foreign_1 (https://ci.adoptium.net/job/Test_openjdk22_hs_sanity.openjdk_x86-64_alpine-linux_rerun/3/)
    • Of those the following always fail: jdk_util_0,jdk_util_1
  • These failures cause 3 re-runs of ALL the four targets, thus effectively all jdk_lang_0,jdk_util_0,jdk_util_1,jdk_foreign_1 get re-run 3 times
  • The net effect given these 4 take up most of the sanity.openjdk duration is the whole test run is taking 4 times longer!

The issue is even worse for https://ci.adoptium.net/job/Test_openjdk22_hs_extended.openjdk_x86-64_alpine-linux/ which is typically taking 2 days if it gets that far.

I'm not sure as it currently stands that amount of extra test run time is effective?

@sophia-guo @smlambert Thoughts? Can we just re-run the "testcases" ? Should we do a blanket exclude of the failing tests?

The problem seems most highlighted for Alpine Linux.

andrew-m-leonard avatar Feb 14 '24 09:02 andrew-m-leonard

Do we understand what the failures are and whether they are system specific? That would seem to be the important thing to do the root analysis on. @Haroon-Khel are these on your radar? I thought we only ever did one rerun for each job (but that may be wrong based on what you've said( so I'm surprised if we're getting four.

If they're taking longer than expected (and since it's happening on sanity and extended that seems likely) then it could be another example of the currency detection issues we've been seeing in containers.

sxa avatar Feb 14 '24 10:02 sxa

I'm not sure as it currently stands that amount of extra test run time is effective?

If we let it run to completion then we know we have a complete picture of the situation which should assist debugging. Also since we're only running one build a week it shouldn't cause as much of a problem as it did when we were running stuff nightly 🤷 But it did need to be understood, and probably as quite a high priority.

sxa avatar Feb 14 '24 10:02 sxa

I'm not sure as it currently stands that amount of extra test run time is effective?

If we let it run to completion then we know we have a complete picture of the situation which should assist debugging. Also since we're only running one build a week it shouldn't cause as much of a problem as it did when we were running stuff nightly 🤷 But it did need to be understood, and probably as quite a high priority.

The failing tests are quite clear from the 2 re-runs, no need to wait for the subsequent 2 re-runs!

I think it does effectively highlight the problem :-) which is a bonus. It looks like it's mainly an Alpine issue, with extended.openjdk and sanity.openjdk, which between them seem to take 2 days to run

I'm going to examine and raise an exclude of the rogue tests

andrew-m-leonard avatar Feb 14 '24 10:02 andrew-m-leonard

For the record, I have also dropped rerunIterations from 3 to 1 in our build pipeline code (via https://github.com/adoptium/ci-jenkins-pipelines/pull/929).

smlambert avatar Feb 14 '24 17:02 smlambert

For openjdk tests it should be able to rerun testcases if the failure testcases number is not big.

sophia-guo avatar Feb 14 '24 17:02 sophia-guo

Also related as another suggested improvement to automatic reruns is https://github.com/adoptium/aqa-tests/issues/4874 (use of EXIT_SUCCESS flag).

smlambert avatar Feb 14 '24 19:02 smlambert

Also related as another suggested improvement to automatic reruns is https://github.com/adoptium/aqa-tests/issues/4379 (acknowledge and skip test targets tagged as notRerun in playlist).

smlambert avatar Feb 14 '24 19:02 smlambert

Example: rerun 4 targets takes 1.5 hours rerun 6 testcases takes 43 seconds.

https://github.com/adoptium/aqa-tests/issues/5016#issuecomment-1944289530

sophia-guo avatar Apr 10 '24 16:04 sophia-guo

If the rerun build is unstable the failed target deep history is still helpful. If the rerun build is successful then there is no need to provide the deep history of the failed targets in the rerun parents job.

Currently if the rerun build succeeds some of the failed targets in the rerun parents job are still there( Example A) , but some aren't ( Example B). If the rerun build is unstable some failed targets deep history are available(Example C) , some are not (Example D). Not sure why and confusing. Example B and C are expected behavior. Example come from https://trss.adoptium.net/resultSummary?parentId=66157115879917006ef59450

Example D: Test_openjdk22_hs_extended.openjdk_x86-64_alpine-linux ⚠️ UNSTABLE ⚠️

Test_openjdk22_hs_extended.openjdk_x86-64_alpine-linux_rerun ⚠️ UNSTABLE ⚠️ Rerun failed

Example C Test_openjdk22_hs_extended.openjdk_x86-64_linux ⚠️ UNSTABLE ⚠️

Test_openjdk22_hs_extended.openjdk_x86-64_linux_rerun ⚠️ UNSTABLE ⚠️ Rerun failed

java -version Test_openjdk22_hs_extended.openjdk_x86-64_linux_testList_0 ⚠️ UNSTABLE ⚠️ jdk_tools_1 => deep history 0/3 passed | possible issues jdk_build_0 => deep history 13/15 passed | possible issues

Test_openjdk22_hs_extended.openjdk_x86-64_linux_testList_2 ⚠️ UNSTABLE ⚠️ jdk_build_1 => deep history 3/5 passed | possible issues

Example A Test_openjdk22_hs_extended.openjdk_x86-64_mac ⚠️ UNSTABLE ⚠️

Test_openjdk22_hs_extended.openjdk_x86-64_mac_rerun ✅ SUCCESS ✅ Rerun all

java -version Test_openjdk22_hs_extended.openjdk_x86-64_mac_testList_1 ⚠️ UNSTABLE ⚠️ jdk_security3_1 => deep history 0/1 passed | possible issues jdk_jfr_1 => deep history 0/1 passed | possible issues

Test_openjdk22_hs_extended.openjdk_x86-64_mac_testList_2 ⚠️ UNSTABLE ⚠️ jdk_net_1 => deep history 7/8 passed | possible issues jdk_nio_1 => deep history 5/8 passed | possible issues

Example B: Test_openjdk22_hs_extended.openjdk_ppc64_aix ⚠️ UNSTABLE ⚠️

Test_openjdk22_hs_extended.openjdk_ppc64_aix_rerun ✅ SUCCESS ✅ Rerun all

sophia-guo avatar Apr 11 '24 15:04 sophia-guo

Close this as most concerns have been resolved.

The only one has no valid information anymore, if re happened can open a separate specific issue.

  • [ ] rerun job shows unstable, however the TEST TARGETS SUMMARY shows passed. - might be issue with tkg with iteration >1? https://ci.adoptium.net/job/Test_openjdk11_hs_sanity.system_x86-64_mac_rerun/1/

sophia-guo avatar Aug 16 '24 15:08 sophia-guo