opensearch-build [Feature request] Detect flaky distribuiton build failures and integration test failures

Is your feature request related to a problem? Please describe

The GitHub issues created at distribution level for build failures and integration test failures lack the intelligence to detect if the build or tests are flaky. Currently, the logic blindly closes the issues if it passes the build in say one distribution and opens a new one if it fails for another platform. Example: https://github.com/opensearch-project/cross-cluster-replication/issues?q=is%3Aissue++%5BAUTOCUT%5D+Distribution+Build+Failed+for+cross-cluster-replication-3.0.0+

Describe the solution you'd like

The GH issue creation should be smart enough to detect the following:

Is the build flaky: Are the failures consistent with particular type of platform and architecture. Add comment to the issue instead of closing and creating a new one
Are the integration tests flaky: Are the failures consistent with particular type of platform and architecture. Add comment to the issue instead of closing and creating a new one

If yes, it should label the issue or comment on it saying this is flaky and should not be closed unless addressed

Time span to detect the issue as flaky can be 3-4 hours considering 3-4 runs within the given time frame.

Describe alternatives you've considered

No response

Additional context

No response

Oct 24 '23 21:10 gaiksaya

In order to avoid creation and closing of multiple issues, we should introduce a circuit breaker to the createGithubIssue library, what this should do is before creating an issue it should query for AUTOCUT issues for a release version and if closed less than 24-48hrs reopen and update the issue with failed build information.

Example: https://github.com/opensearch-project/cross-cluster-replication/issues?q=is%3Aissue+%5BAUTOCUT%5D+Distribution+Build+Failed+for+cross-cluster-replication-3.0.0+is%3Aclosed+closed%3A2023-10-15..2023-10-22+ Take the latest issue, re-open and update with the build failure.

Oct 25 '23 02:10 prudhvigodithi

[Untriage] We have the library now updated that re-opens the AUTOCUT issues instead of just creating new one. https://github.com/opensearch-project/common-utils/issues/556#issuecomment-1788145550 Screenshot 2023-10-31 at 3.57.26 PM.png

Screenshot 2023-10-31 at 3 57 26 PM

@gaiksaya take a look and close this issue if you think this solves the problem.

Thank you

Nov 01 '23 16:11 prudhvigodithi

Thanks @prudhvigodithi Looks good. It needs to add more details in comment but that can be tracked in another issue. Closing the issue.

Nov 01 '23 17:11 gaiksaya

We should add a flaky-test label when a test passes and fails between different runs. CC: @prudhvigodithi @gaiksaya

Feb 29 '24 20:02 bbarani

@rishabh6788 is going to work on a POC to record, track and surface flaky integration tests for OpenSearch core before implementing it for plugins.

Note: We will currently focus only on Gradle based projects.

Mar 05 '24 17:03 bbarani

We now have the Gradle Check insights on failed and flaky tests in the OpenSearch Gradle Check Metrics dashboard. https://github.com/opensearch-project/OpenSearch/blob/main/DEVELOPER_GUIDE.md#gradle-check-metrics-dashboard

As required moving forward we can have similar setup/metrics for distribution build and integration test failures. Based on the this data and trend (part of the metrics initiate) we can go with the solution @gaiksaya described of creating/updating/commenting on issues.

@getsaurabh02 @dblock

Jun 04 '24 16:06 prudhvigodithi

opensearch-build opensearch-build copied to clipboard

[Feature request] Detect flaky distribuiton build failures and integration test failures

Is your feature request related to a problem? Please describe

Describe the solution you'd like

Describe alternatives you've considered

Additional context

opensearch-build
opensearch-build copied to clipboard