OpenSearch icon indicating copy to clipboard operation
OpenSearch copied to clipboard

[Automation Enhancement] Mechanism to close the created Gradle Check AUTOCUT flaky test issues.

Open prudhvigodithi opened this issue 1 year ago • 4 comments

Is your feature request related to a problem? Please describe

Background

Coming from the initial implementation https://github.com/opensearch-project/OpenSearch/issues/13950, the automation as described in the DEVELOPER_GUIDE will identify and start creating the flaky test report issues based on a test failures in the post merge actions. The data used to create these issues is part of the OpenSearch Metrics Project (For more details refer Gradle Check Metrics Dashboard). The initial goal to find the flaky tests and creating a detailed issue report was solved.

Problem Statement

Now the issues that are auto created with the automation can only be closed once the failures are not part of the post merge actions for the next 30 days (the query executed on the metrics clusters is targeting to filter the failing tests in past 30 days), example here is an AUTOCUT issue created related to RemoteStoreClusterStateRestoreIT , even though this was identified and fixed promptly there is no way to for a user to close this as the automation will again flag RemoteStoreClusterStateRestoreIT and re-opens the issue as the RemoteStoreClusterStateRestoreIT was identified failing in past 30 days. With this the issue remains open (for next 30 days and if not again failed in post merge action builds) even though the flaky test is fixed by the user.

Describe the solution you'd like

Proposed Solution

Solution 1

As proposed here

If the issue is closed (considering the flaky test is fixed by the user) the automation should not re-open unless the data is different from what shown in the issue body, if anything (in the issue body) is different after closed then it should re-open the issue. Here the data to compare is the markdown table and not the linked PR's as during the PR creation the failures sometimes could be genuine. So re-open when seen a new failure (with a different post merge commit) after the issue is closed. This should also solve the problem where sometimes we think the Flaky test is fixed but would re-occur and with new reoccurrence the issue should re-open with new data.

This solution is simple comparison with existing test names and git reference on the existing issue body and decide to re-open (once the issue is closed by the user) the issue or keep in the closed state.

Solution 2

This solution targets to have a database of events and decide based on events to open a new issue or keep the issue in closed state.

Create a new index gradle-check-flaky-tests, from identified flaky test names in OpenSearch Gradle Check Metrics which is part of the automation FetchPostMergeFailedTestClass. Now create a new document for each test name with a test_class and git_reference association. Example as

{
  "_index": "gradle-check-flaky-tests",
  "_id": "yrZzNpAB0YKBsy3HQg9I",
  "_version": 1,
  "_score": null,
  "_source": {
    "test_class": "RemoteStoreClusterStateRestoreIT",
    "test_name": "org.opensearch.remotestore.RemoteStoreClusterStateRestoreIT.testFullClusterRestoreGlobalMetadata",
    "git_reference": "a06afef1fc63cab9ab9fc1b84215a575a91a12d8",
    "flaky": true,
    "flaky_identified_at": 
    "updated_at": 1718898508731,
    "fixed_at":
    "issue_number: 
    "time_open_in_days":
    "time_closed_in_days": 
  },
  "fields": {
    "updated_at": [
      "2024-06-20T15:48:28.731Z"
    ]
  },
  "sort": [
    1718898508731
  ]
}

The flaky_identified_at is the date when the document was 1st created. The updated_at is when the daily automation was triggered. (Optional) The time_open_in_days is the difference between (updated_at - flaky_identified_at). (Optional) The time_closed_in_days is the difference between (updated_at - flaky_identified_at) once the flaky is set to false. The flaky will be set to false once the issue is closed by the user. The fixed_at will be the current updated_at after the flaky is set to false (Its ~time when the issue was closed). The issue_number is the GitHub issue number created for the test_class (example as https://github.com/opensearch-project/OpenSearch/issues/14326).

Now for the upcoming automation runs if it identifies the test_name for the same git_reference with "flaky": flase it should not re-open the issue, if it finds the test_name for different git_reference then it means even though the same flaky test is fixed it failed for another post merge commit (git_reference) and should create a new document and a new issue flagging the test as flaky for different commit. For open issues the automation will continue to keep updating the issue body and the above document fields still keeping the "flaky": true.

The assumption here the user will only close the issue when all the Test Names part of the issue, example https://github.com/opensearch-project/OpenSearch/issues/14381 are closed. The framework maintains one GitHub Issue for all test failures grouped by test class and different documents in cluster, one for each test name.

With this solution we can even build trends on these flaky test documents using the OpenSearch Metrics Dashboard.

Related component

Other

Describe alternatives you've considered

No response

Additional context

No response

prudhvigodithi avatar Jun 20 '24 17:06 prudhvigodithi

[Triage] Adding @andrross @reta @dblock @msfroh @shiv0408 @getsaurabh02 to please check the proposed solutions.

prudhvigodithi avatar Jun 20 '24 17:06 prudhvigodithi

I think the 1st option is pretty simple and straightforward, thanks @prudhvigodithi !

reta avatar Jun 20 '24 19:06 reta

Agree that the 1st option is the simpler one and probably worth trying first.

andrross avatar Jun 20 '24 22:06 andrross

Thanks the solution 1 is in place now, here is an example https://github.com/opensearch-project/OpenSearch/issues/14499#issuecomment-2195414929. Related Library change PR: https://github.com/opensearch-project/opensearch-build-libraries/pull/448 Related Jenkins change PR: https://github.com/opensearch-project/opensearch-build/pull/4805.

Thank you

prudhvigodithi avatar Jun 27 '24 18:06 prudhvigodithi

Closing this issue as today we have the mechanism to close the created Gradle Check AUTOCUT flaky test issues.

prudhvigodithi avatar Jul 03 '24 17:07 prudhvigodithi