opensearch-build
opensearch-build copied to clipboard
[Feature request] Detect flaky distribuiton build failures and integration test failures
Is your feature request related to a problem? Please describe
The GitHub issues created at distribution level for build failures and integration test failures lack the intelligence to detect if the build or tests are flaky. Currently, the logic blindly closes the issues if it passes the build in say one distribution and opens a new one if it fails for another platform. Example: https://github.com/opensearch-project/cross-cluster-replication/issues?q=is%3Aissue++%5BAUTOCUT%5D+Distribution+Build+Failed+for+cross-cluster-replication-3.0.0+
Describe the solution you'd like
The GH issue creation should be smart enough to detect the following:
- Is the build flaky: Are the failures consistent with particular type of platform and architecture. Add comment to the issue instead of closing and creating a new one
- Are the integration tests flaky: Are the failures consistent with particular type of platform and architecture. Add comment to the issue instead of closing and creating a new one
If yes, it should label the issue or comment on it saying this is flaky and should not be closed unless addressed
Time span to detect the issue as flaky can be 3-4 hours considering 3-4 runs within the given time frame.
Describe alternatives you've considered
No response
Additional context
No response
In order to avoid creation and closing of multiple issues, we should introduce a circuit breaker to the createGithubIssue library, what this should do is before creating an issue it should query for AUTOCUT
issues for a release version and if closed less than 24-48hrs reopen and update the issue with failed build information.
Example: https://github.com/opensearch-project/cross-cluster-replication/issues?q=is%3Aissue+%5BAUTOCUT%5D+Distribution+Build+Failed+for+cross-cluster-replication-3.0.0+is%3Aclosed+closed%3A2023-10-15..2023-10-22+ Take the latest issue, re-open and update with the build failure.
[Untriage] We have the library now updated that re-opens the AUTOCUT issues instead of just creating new one. https://github.com/opensearch-project/common-utils/issues/556#issuecomment-1788145550 Screenshot 2023-10-31 at 3.57.26 PM.png
@gaiksaya take a look and close this issue if you think this solves the problem.
Thank you
Thanks @prudhvigodithi Looks good. It needs to add more details in comment but that can be tracked in another issue. Closing the issue.
We should add a flaky-test label when a test passes and fails between different runs. CC: @prudhvigodithi @gaiksaya
@rishabh6788 is going to work on a POC to record, track and surface flaky integration tests for OpenSearch core before implementing it for plugins.
Note: We will currently focus only on Gradle based projects.
We now have the Gradle Check insights on failed and flaky tests in the OpenSearch Gradle Check Metrics dashboard. https://github.com/opensearch-project/OpenSearch/blob/main/DEVELOPER_GUIDE.md#gradle-check-metrics-dashboard
As required moving forward we can have similar setup/metrics for distribution build and integration test failures. Based on the this data and trend (part of the metrics initiate) we can go with the solution @gaiksaya described of creating/updating/commenting on issues.
@getsaurabh02 @dblock