Stabilize :plugins:repository-azure:internalClusterTest by waiting foβ¦
Description
Stabilizes :plugins:repository-azure:internalClusterTest, which occasionally failed
in CI before executing any JUnit tests due to the Azure docker fixture not being
fully ready when the test suite began.
Fix
- Added
assertBusy()+ensureGreen()to wait for Azure fixture readiness - Added test teardown to remove all indices and repositories and prevent stale CI state
- No production behavior has been modified β test-only reliability fix
Testing
- 20 consecutive successful local runs:
for i in {1..20}; do ./gradlew :plugins:repository-azure:internalClusterTest; done
Issue
Fixes #20124
Summary by CodeRabbit
- Tests
- Improved Azure integration test reliability by adding a bounded readiness check before tests, a stronger teardown that fully cleans indices and repositories after each run, and a reusable retry helper to wait for transient conditions.
βοΈ Tip: You can customize this high-level summary in your review settings.
Walkthrough
Adds JUnit lifecycle methods to the Azure repository internalClusterTest to wait for an Azure container to be ready before tests and to list/delete blobs after tests; also makes setUp/tearDown public overrides that ensure cluster health and remove indices and repositories.
Changes
| Cohort / File(s) | Summary |
|---|---|
Azure test lifecycle plugins/repository-azure/src/internalClusterTest/java/org/opensearch/repositories/azure/AzureStorageCleanupThirdPartyTests.java |
Added @BeforeClass waitForAzureFixtureReady() to validate required test.azure.* system properties, initialize AzureStorageService and container client, and poll container existence; added @AfterClass cleanupAzureBlobs() to list and delete blobs and close the service; added public setUp() and tearDown() overrides to ensure cluster green state and delete indices (using IndicesOptions.LENIENT_EXPAND_OPEN_CLOSED_HIDDEN) and repositories. Added imports: BeforeClass, TimeUnit, IndicesOptions. |
Sequence Diagram(s)
sequenceDiagram
autonumber
participant JUnit as JUnit Runner
participant TestClass as AzureStorageCleanupThirdPartyTests
participant AzureSvc as AzureStorageService
participant Container as Azure Container
Note over JUnit,TestClass: `@BeforeClass` β readiness
JUnit->>TestClass: call waitForAzureFixtureReady()
TestClass->>AzureSvc: init(credentials from sysprops)
TestClass->>Container: getContainerClient()
loop poll until ready or timeout
TestClass->>Container: container.exists()
Container-->>TestClass: exists? (true/false)
end
Note over JUnit,TestClass: Tests execute...
Note over JUnit,TestClass: `@AfterClass` β cleanup
JUnit->>TestClass: call cleanupAzureBlobs()
TestClass->>AzureSvc: listBlobs(container)
loop for each blob
TestClass->>Container: deleteBlob(blob)
Container-->>TestClass: deletion result
end
TestClass->>AzureSvc: close()
Estimated code review effort
π― 3 (Moderate) | β±οΈ ~25 minutes
- Inspect system-property validation and error messages in waitForAzureFixtureReady().
- Verify polling/backoff, timeout choices and use of assertBusy for CI stability.
- Review blob listing/deletion loop for error handling and resource closing.
- Confirm tearDown index/repository deletion ordering and IndicesOptions usage.
Suggested reviewers
- andrross
Poem
π I waited till the container glowed,
polled the sky where SAS keys stowed.
When tests had hopped through every part,
I swept the blobs and closed my cart.
Cluster green β I thumped my heart.
Pre-merge checks and finishing touches
β Failed checks (1 warning)
| Check name | Status | Explanation | Resolution |
|---|---|---|---|
| Docstring Coverage | β οΈ Warning | Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. | You can run @coderabbitai generate docstrings to improve docstring coverage. |
β Passed checks (4 passed)
| Check name | Status | Explanation |
|---|---|---|
| Title check | β Passed | The title accurately describes the main fix: addressing flakiness in the Azure repository test by implementing fixture readiness checks and state cleanup. |
| Description check | β Passed | The description includes main sections matching the template: a clear description of the change, related issue reference (#20124), and verification of testing. All essential elements are present. |
| Linked Issues check | β Passed | The PR directly addresses the objective in #20124 by implementing fixes to eliminate the flaky test failures through fixture readiness validation and state cleanup. |
| Out of Scope Changes check | β Passed | All code changes are scoped to test reliability improvements: lifecycle methods for Azure fixture readiness and cleanup, setup/teardown methods for proper test state management. |
β¨ Finishing touches
- [ ] π Generate docstrings
π§ͺ Generate unit tests (beta)
- [ ] Create PR with unit tests
- [ ] Post copyable unit tests in a comment
π Recent review details
Configuration used: defaults
Review profile: CHILL
Plan: Pro
π₯ Commits
Reviewing files that changed from the base of the PR and between 6d236017943a3c6a60def59f1801a2e040fd1ab4 and 04129bd808ceb40b9e0208319d43dba83ac63ee4.
π Files selected for processing (1)
plugins/repository-azure/src/internalClusterTest/java/org/opensearch/repositories/azure/AzureStorageCleanupThirdPartyTests.java(3 hunks)
π§° Additional context used
π§ Learnings (2)
π Learning: 2025-12-12T18:40:08.452Z
Learnt from: reta
Repo: opensearch-project/OpenSearch PR: 20017
File: plugins/transport-reactor-netty4/src/test/java/org/opensearch/http/reactor/netty4/ssl/SecureReactorNetty4HttpServerTransportTests.java:256-256
Timestamp: 2025-12-12T18:40:08.452Z
Learning: In the OpenSearch ReactorNetty4 secure HTTP transport tests (plugins/transport-reactor-netty4/src/test/java/org/opensearch/http/reactor/netty4/ssl/SecureReactorNetty4HttpServerTransportTests.java), URI limit validation has been moved from the protocol layer to the transport layer, making it protocol-agnostic. The random protocol selection in ReactorHttpClient.https(settings) is intentional to ensure all tests validate correct behavior across HTTP/1.1, HTTP/2, and HTTP/3.
Applied to files:
plugins/repository-azure/src/internalClusterTest/java/org/opensearch/repositories/azure/AzureStorageCleanupThirdPartyTests.java
π Learning: 2025-12-02T22:44:14.799Z
Learnt from: prudhvigodithi
Repo: opensearch-project/OpenSearch PR: 20112
File: server/src/internalClusterTest/java/org/opensearch/search/slice/SearchSliceIT.java:73-81
Timestamp: 2025-12-02T22:44:14.799Z
Learning: In OpenSearch integration tests extending OpenSearchIntegTestCase, using `LuceneTestCase.SuppressCodecs("*")` triggers special handling that selects a random production codec from the CODECS array, while `SuppressCodecs("Asserting")` or other specific codec suppressions still allow Lucene's default codec randomization which may include the asserting codec. Use `SuppressCodecs("*")` when you need to completely avoid asserting codecs (e.g., for cross-thread StoredFieldsReader usage) while maintaining production codec test coverage.
Applied to files:
plugins/repository-azure/src/internalClusterTest/java/org/opensearch/repositories/azure/AzureStorageCleanupThirdPartyTests.java
β° Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: gradle-check
π Additional comments (4)
plugins/repository-azure/src/internalClusterTest/java/org/opensearch/repositories/azure/AzureStorageCleanupThirdPartyTests.java (4)
67-103: Well-structured cleanup with appropriate defensive patterns.The
cleanupAzureBlobsmethod correctly handles the CI teardown scenario:
- Early returns prevent NPE when fixture properties are not set
- try-with-resources ensures the
AzureStorageServiceis properly closed- The existence check before cleanup prevents failures on non-existent containers
- The broad exception catch with "CI teardown" comment is appropriate for best-effort cleanup in CI environments
The empty string fallback in line 83 is reasonable hereβif credentials are unavailable, the Azure SDK call will fail gracefully and the exception will be caught and ignored, which is the intended behavior for optional cleanup.
110-150: Excellent two-phase fixture readiness check.The
waitForAzureFixtureReadymethod effectively addresses the flaky test issues described in the PR objectives by:
- First validating that all required system properties are set (lines 112-124)
- Then confirming the Azure container is actually accessible (lines 126-149)
The try-with-resources pattern correctly prevents resource leaks, and the 60-second timeout is appropriate for Docker fixture startup. The two-phase approach is well-designedβproperty validation is fast, while the connection test may legitimately require retries as the fixture initializes.
152-156: LGTM: Proper test setup with cluster health check.The
setUpoverride correctly callssuper.setUp()first and then ensures the cluster is green before each test begins. This provides good test isolation and prevents flakiness from cluster state issues.
158-167: LGTM: Comprehensive per-test cleanup with proper safeguards.The
tearDownoverride correctly implements cleanup to prevent stale state between tests:
- Uses
LENIENT_EXPAND_OPEN_CLOSED_HIDDENto protect system indices- Removes all repositories to ensure clean state for next test
- Uses try-finally to guarantee
super.tearDown()always executesThis addresses the "cleaning stale state" objective mentioned in the PR description and follows OpenSearch testing best practices.
Comment @coderabbitai help to get the list of available commands and usage tips.
:x: Gradle check result for 49ca72f89968e9623d5fe5008ff478472b2bfa49: FAILURE
Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?
:x: Gradle check result for 72ed735d523f21461a6c51290a67b9092c92af11: FAILURE
Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?
:x: Gradle check result for da47b310f60b6c3fd76a642de3c74b177fd8c43a: FAILURE
Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?
The failure appears unrelated to this change. The failing task is
:distribution:bwc:maintenance:buildBwcLinuxTar, not related to repository-azure.
All Azure tests passed, spotless and compilation passed. Requesting a CI re-run when convenient. π
Thanks @Gautam-aman for fixing this test. Can you ensure your commits are signed off correctly. This should give you context on how to: https://github.com/opensearch-project/OpenSearch/pull/20171/checks?check_run_id=57257172591
Hi @sandeshkr419, The Jenkins re-run failed again in :distribution:bwc:maintenance:buildBwcLinuxTar while building BWC artifacts for checkout-2.19 (build 68311). There are no test or compilation failures related to this PR; this appears to be an infra/pipeline issue. Could you please re-run or override the required check when convenient? Thanks.
:x: Gradle check result for da47b310f60b6c3fd76a642de3c74b177fd8c43a: FAILURE
Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?
This failure is unrelated to the changes in this PR.
The failing jobs are Jenkins / BWC infra:
- distribution:bwc:maintenance:buildBwcLinuxTar
- Gradle Check (Jenkins)
There are no test or compilation failures introduced by this PR. This appears to be a flaky infra issue.
@Gautam-aman Looks like a compiler failure to me:
> Task :plugins:repository-azure:compileInternalClusterTestJava FAILED
/var/jenkins/workspace/gradle-check/search/plugins/repository-azure/src/internalClusterTest/java/org/opensearch/repositories/azure/AzureStorageCleanupThirdPartyTests.java:128: error: cannot find symbol
final AtomicReference<T> result = new AtomicReference<>();
Also check out the failed DCO workflow and make sure all commits are properly signed.
Addressed the compile-time error pointed out earlier by correcting the AtomicReference<T> usage. Rebased, signed off, and force-pushed to the PR branch. Requesting re-review.
:x: Gradle check result for 9e8ae0384f72421a59df8aa6a66a210528d7a3d8: FAILURE
Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?
:x: Gradle check result for 4c38bfc1eb100d593ca7cea379cddc28e035b585: FAILURE
Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?
:x: Gradle check result for d4ebd9affb899b3e8a62fbe67d1b239025b71119: FAILURE
Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?
:x: Gradle check result for 413bfdde877e5db202d4d4c0fc403e0b5ff1d14f: FAILURE
Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?
:x: Gradle check result for d7fb0cfb4b065b75821ee7e30c7663598a6d65d2: FAILURE
Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?
Spotless is failing with: Failed to load eclipse jdt formatter: java.io.EOFException
This happens when Spotless runs under Java 25 and is a known formatter compatibility issue. Local formatting and spotless checks pass under JDK 21, which is the intended build JDK for OpenSearch. Please verify CIβs JDK or formatter tooling.
:x: Gradle check result for 66f33d7fd0eeca518b830a02797ce9cbeb7eba9e: FAILURE
Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?
:x: Gradle check result for bc0b0a7592117037058be5cee444aacfbdf6c8c3: FAILURE
Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?
@Gautam-aman Looks like a test failure related to this change:
AzureStorageCleanupThirdPartyTests > classMethod FAILED
java.lang.IllegalArgumentException: Setting [azure.client.default.account] is a secure setting and must be stored inside the OpenSearch keystore, but was found inside opensearch.yml
at __randomizedtesting.SeedInfo.seed([AE1EA76EF69A93A4]:0)
at org.opensearch.common.settings.SecureSetting.get(SecureSetting.java:108)
at org.opensearch.repositories.azure.AzureStorageSettings.getConfigValue(AzureStorageSettings.java:481)
at org.opensearch.repositories.azure.AzureStorageSettings.getClientSettings(AzureStorageSettings.java:431)
at org.opensearch.repositories.azure.AzureStorageSettings.load(AzureStorageSettings.java:415)
at org.opensearch.repositories.azure.AzureStorageService.<init>(AzureStorageService.java:143)
at org.opensearch.repositories.azure.AzureStorageCleanupThirdPartyTests.lambda$waitForAzureFixtureReady$1(AzureStorageCleanupThirdPartyTests.java:96)
at org.opensearch.test.OpenSearchTestCase.assertBusy(OpenSearchTestCase.java:1171)
at org.opensearch.repositories.azure.AzureStorageCleanupThirdPartyTests.waitForAzureFixtureReady(AzureStorageCleanupThirdPartyTests.java:87)
Also there are still unsigned commits so the DCO check is still failing.
:x: Gradle check result for bef96582c6521d396473507cc3042063f646597a: FAILURE
Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?
:x: Gradle check result for 3435b0082f526f8a92d4e8c788a11be957905fb8: FAILURE
Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?
:x: Gradle check result for 4026dac10762638b2cdb4748cdfaa6d8a3c5f476: FAILURE
Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?
:x: Gradle check result for 6481297367fe1acd44a5e84abab051042aa07845: FAILURE
Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?
:x: Gradle check result for 8be670dfc9e8f166858bde43f6b8ef0c17e2ea74: FAILURE
Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?
:x: Gradle check result for 1437bc289b1380a354a9b22ff119dd28638d493b: FAILURE
Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?
:x: Gradle check result for 681e085f63070a21d645b7d62025ca0cc0b3a724: FAILURE
Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?
:x: Gradle check result for 6d236017943a3c6a60def59f1801a2e040fd1ab4: FAILURE
Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?
:x: Gradle check result for 04129bd808ceb40b9e0208319d43dba83ac63ee4: null
Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?
Hello maintainers, This PR is blocked at the Gradle check step because the Jenkins workflow needs explicit approval for external execution (the jq parse errors are due to the job not having been started yet). Could someone with maintainer permissions please approve and run the workflow? Once that has been done, I will monitor the job and address any real test failures.
Thank you.