OpenSearch icon indicating copy to clipboard operation
OpenSearch copied to clipboard

Stabilize :plugins:repository-azure:internalClusterTest by waiting fo…

Open Gautam-aman opened this issue 3 weeks ago β€’ 5 comments

Description

Stabilizes :plugins:repository-azure:internalClusterTest, which occasionally failed in CI before executing any JUnit tests due to the Azure docker fixture not being fully ready when the test suite began.

Fix

  • Added assertBusy() + ensureGreen() to wait for Azure fixture readiness
  • Added test teardown to remove all indices and repositories and prevent stale CI state
  • No production behavior has been modified β€” test-only reliability fix

Testing

  • 20 consecutive successful local runs: for i in {1..20}; do ./gradlew :plugins:repository-azure:internalClusterTest; done

Issue

Fixes #20124

Summary by CodeRabbit

  • Tests
    • Improved Azure integration test reliability by adding a bounded readiness check before tests, a stronger teardown that fully cleans indices and repositories after each run, and a reusable retry helper to wait for transient conditions.

✏️ Tip: You can customize this high-level summary in your review settings.

Gautam-aman avatar Dec 05 '25 13:12 Gautam-aman

Walkthrough

Adds JUnit lifecycle methods to the Azure repository internalClusterTest to wait for an Azure container to be ready before tests and to list/delete blobs after tests; also makes setUp/tearDown public overrides that ensure cluster health and remove indices and repositories.

Changes

Cohort / File(s) Summary
Azure test lifecycle
plugins/repository-azure/src/internalClusterTest/java/org/opensearch/repositories/azure/AzureStorageCleanupThirdPartyTests.java
Added @BeforeClass waitForAzureFixtureReady() to validate required test.azure.* system properties, initialize AzureStorageService and container client, and poll container existence; added @AfterClass cleanupAzureBlobs() to list and delete blobs and close the service; added public setUp() and tearDown() overrides to ensure cluster green state and delete indices (using IndicesOptions.LENIENT_EXPAND_OPEN_CLOSED_HIDDEN) and repositories. Added imports: BeforeClass, TimeUnit, IndicesOptions.

Sequence Diagram(s)

sequenceDiagram
    autonumber
    participant JUnit as JUnit Runner
    participant TestClass as AzureStorageCleanupThirdPartyTests
    participant AzureSvc as AzureStorageService
    participant Container as Azure Container

    Note over JUnit,TestClass: `@BeforeClass` β€” readiness
    JUnit->>TestClass: call waitForAzureFixtureReady()
    TestClass->>AzureSvc: init(credentials from sysprops)
    TestClass->>Container: getContainerClient()
    loop poll until ready or timeout
        TestClass->>Container: container.exists()
        Container-->>TestClass: exists? (true/false)
    end

    Note over JUnit,TestClass: Tests execute...

    Note over JUnit,TestClass: `@AfterClass` β€” cleanup
    JUnit->>TestClass: call cleanupAzureBlobs()
    TestClass->>AzureSvc: listBlobs(container)
    loop for each blob
        TestClass->>Container: deleteBlob(blob)
        Container-->>TestClass: deletion result
    end
    TestClass->>AzureSvc: close()

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

  • Inspect system-property validation and error messages in waitForAzureFixtureReady().
  • Verify polling/backoff, timeout choices and use of assertBusy for CI stability.
  • Review blob listing/deletion loop for error handling and resource closing.
  • Confirm tearDown index/repository deletion ordering and IndicesOptions usage.

Suggested reviewers

  • andrross

Poem

πŸ‡ I waited till the container glowed,
polled the sky where SAS keys stowed.
When tests had hopped through every part,
I swept the blobs and closed my cart.
Cluster green β€” I thumped my heart.

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
βœ… Passed checks (4 passed)
Check name Status Explanation
Title check βœ… Passed The title accurately describes the main fix: addressing flakiness in the Azure repository test by implementing fixture readiness checks and state cleanup.
Description check βœ… Passed The description includes main sections matching the template: a clear description of the change, related issue reference (#20124), and verification of testing. All essential elements are present.
Linked Issues check βœ… Passed The PR directly addresses the objective in #20124 by implementing fixes to eliminate the flaky test failures through fixture readiness validation and state cleanup.
Out of Scope Changes check βœ… Passed All code changes are scoped to test reliability improvements: lifecycle methods for Azure fixture readiness and cleanup, setup/teardown methods for proper test state management.
✨ Finishing touches
  • [ ] πŸ“ Generate docstrings
πŸ§ͺ Generate unit tests (beta)
  • [ ] Create PR with unit tests
  • [ ] Post copyable unit tests in a comment

πŸ“œ Recent review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

πŸ“₯ Commits

Reviewing files that changed from the base of the PR and between 6d236017943a3c6a60def59f1801a2e040fd1ab4 and 04129bd808ceb40b9e0208319d43dba83ac63ee4.

πŸ“’ Files selected for processing (1)
  • plugins/repository-azure/src/internalClusterTest/java/org/opensearch/repositories/azure/AzureStorageCleanupThirdPartyTests.java (3 hunks)
🧰 Additional context used
🧠 Learnings (2)
πŸ“š Learning: 2025-12-12T18:40:08.452Z
Learnt from: reta
Repo: opensearch-project/OpenSearch PR: 20017
File: plugins/transport-reactor-netty4/src/test/java/org/opensearch/http/reactor/netty4/ssl/SecureReactorNetty4HttpServerTransportTests.java:256-256
Timestamp: 2025-12-12T18:40:08.452Z
Learning: In the OpenSearch ReactorNetty4 secure HTTP transport tests (plugins/transport-reactor-netty4/src/test/java/org/opensearch/http/reactor/netty4/ssl/SecureReactorNetty4HttpServerTransportTests.java), URI limit validation has been moved from the protocol layer to the transport layer, making it protocol-agnostic. The random protocol selection in ReactorHttpClient.https(settings) is intentional to ensure all tests validate correct behavior across HTTP/1.1, HTTP/2, and HTTP/3.

Applied to files:

  • plugins/repository-azure/src/internalClusterTest/java/org/opensearch/repositories/azure/AzureStorageCleanupThirdPartyTests.java
πŸ“š Learning: 2025-12-02T22:44:14.799Z
Learnt from: prudhvigodithi
Repo: opensearch-project/OpenSearch PR: 20112
File: server/src/internalClusterTest/java/org/opensearch/search/slice/SearchSliceIT.java:73-81
Timestamp: 2025-12-02T22:44:14.799Z
Learning: In OpenSearch integration tests extending OpenSearchIntegTestCase, using `LuceneTestCase.SuppressCodecs("*")` triggers special handling that selects a random production codec from the CODECS array, while `SuppressCodecs("Asserting")` or other specific codec suppressions still allow Lucene's default codec randomization which may include the asserting codec. Use `SuppressCodecs("*")` when you need to completely avoid asserting codecs (e.g., for cross-thread StoredFieldsReader usage) while maintaining production codec test coverage.

Applied to files:

  • plugins/repository-azure/src/internalClusterTest/java/org/opensearch/repositories/azure/AzureStorageCleanupThirdPartyTests.java
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: gradle-check
πŸ”‡ Additional comments (4)
plugins/repository-azure/src/internalClusterTest/java/org/opensearch/repositories/azure/AzureStorageCleanupThirdPartyTests.java (4)

67-103: Well-structured cleanup with appropriate defensive patterns.

The cleanupAzureBlobs method correctly handles the CI teardown scenario:

  • Early returns prevent NPE when fixture properties are not set
  • try-with-resources ensures the AzureStorageService is properly closed
  • The existence check before cleanup prevents failures on non-existent containers
  • The broad exception catch with "CI teardown" comment is appropriate for best-effort cleanup in CI environments

The empty string fallback in line 83 is reasonable hereβ€”if credentials are unavailable, the Azure SDK call will fail gracefully and the exception will be caught and ignored, which is the intended behavior for optional cleanup.


110-150: Excellent two-phase fixture readiness check.

The waitForAzureFixtureReady method effectively addresses the flaky test issues described in the PR objectives by:

  1. First validating that all required system properties are set (lines 112-124)
  2. Then confirming the Azure container is actually accessible (lines 126-149)

The try-with-resources pattern correctly prevents resource leaks, and the 60-second timeout is appropriate for Docker fixture startup. The two-phase approach is well-designedβ€”property validation is fast, while the connection test may legitimately require retries as the fixture initializes.


152-156: LGTM: Proper test setup with cluster health check.

The setUp override correctly calls super.setUp() first and then ensures the cluster is green before each test begins. This provides good test isolation and prevents flakiness from cluster state issues.


158-167: LGTM: Comprehensive per-test cleanup with proper safeguards.

The tearDown override correctly implements cleanup to prevent stale state between tests:

  • Uses LENIENT_EXPAND_OPEN_CLOSED_HIDDEN to protect system indices
  • Removes all repositories to ensure clean state for next test
  • Uses try-finally to guarantee super.tearDown() always executes

This addresses the "cleaning stale state" objective mentioned in the PR description and follows OpenSearch testing best practices.


Comment @coderabbitai help to get the list of available commands and usage tips.

coderabbitai[bot] avatar Dec 05 '25 13:12 coderabbitai[bot]

:x: Gradle check result for 49ca72f89968e9623d5fe5008ff478472b2bfa49: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

github-actions[bot] avatar Dec 05 '25 13:12 github-actions[bot]

:x: Gradle check result for 72ed735d523f21461a6c51290a67b9092c92af11: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

github-actions[bot] avatar Dec 05 '25 14:12 github-actions[bot]

:x: Gradle check result for da47b310f60b6c3fd76a642de3c74b177fd8c43a: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

github-actions[bot] avatar Dec 05 '25 14:12 github-actions[bot]

The failure appears unrelated to this change. The failing task is :distribution:bwc:maintenance:buildBwcLinuxTar, not related to repository-azure.

All Azure tests passed, spotless and compilation passed. Requesting a CI re-run when convenient. πŸ‘

Gautam-aman avatar Dec 05 '25 14:12 Gautam-aman

Thanks @Gautam-aman for fixing this test. Can you ensure your commits are signed off correctly. This should give you context on how to: https://github.com/opensearch-project/OpenSearch/pull/20171/checks?check_run_id=57257172591

sandeshkr419 avatar Dec 10 '25 20:12 sandeshkr419

Hi @sandeshkr419, The Jenkins re-run failed again in :distribution:bwc:maintenance:buildBwcLinuxTar while building BWC artifacts for checkout-2.19 (build 68311). There are no test or compilation failures related to this PR; this appears to be an infra/pipeline issue. Could you please re-run or override the required check when convenient? Thanks.

Gautam-aman avatar Dec 14 '25 12:12 Gautam-aman

:x: Gradle check result for da47b310f60b6c3fd76a642de3c74b177fd8c43a: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

github-actions[bot] avatar Dec 15 '25 08:12 github-actions[bot]

This failure is unrelated to the changes in this PR.

The failing jobs are Jenkins / BWC infra:

  • distribution:bwc:maintenance:buildBwcLinuxTar
  • Gradle Check (Jenkins)

There are no test or compilation failures introduced by this PR. This appears to be a flaky infra issue.

Gautam-aman avatar Dec 15 '25 14:12 Gautam-aman

@Gautam-aman Looks like a compiler failure to me:

> Task :plugins:repository-azure:compileInternalClusterTestJava FAILED
/var/jenkins/workspace/gradle-check/search/plugins/repository-azure/src/internalClusterTest/java/org/opensearch/repositories/azure/AzureStorageCleanupThirdPartyTests.java:128: error: cannot find symbol
        final AtomicReference<T> result = new AtomicReference<>();

Also check out the failed DCO workflow and make sure all commits are properly signed.

andrross avatar Dec 18 '25 21:12 andrross

Addressed the compile-time error pointed out earlier by correcting the AtomicReference<T> usage. Rebased, signed off, and force-pushed to the PR branch. Requesting re-review.

Gautam-aman avatar Dec 19 '25 08:12 Gautam-aman

:x: Gradle check result for 9e8ae0384f72421a59df8aa6a66a210528d7a3d8: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

github-actions[bot] avatar Dec 19 '25 08:12 github-actions[bot]

:x: Gradle check result for 4c38bfc1eb100d593ca7cea379cddc28e035b585: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

github-actions[bot] avatar Dec 19 '25 12:12 github-actions[bot]

:x: Gradle check result for d4ebd9affb899b3e8a62fbe67d1b239025b71119: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

github-actions[bot] avatar Dec 19 '25 12:12 github-actions[bot]

:x: Gradle check result for 413bfdde877e5db202d4d4c0fc403e0b5ff1d14f: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

github-actions[bot] avatar Dec 19 '25 14:12 github-actions[bot]

:x: Gradle check result for d7fb0cfb4b065b75821ee7e30c7663598a6d65d2: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

github-actions[bot] avatar Dec 19 '25 14:12 github-actions[bot]

Spotless is failing with: Failed to load eclipse jdt formatter: java.io.EOFException

This happens when Spotless runs under Java 25 and is a known formatter compatibility issue. Local formatting and spotless checks pass under JDK 21, which is the intended build JDK for OpenSearch. Please verify CI’s JDK or formatter tooling.

Gautam-aman avatar Dec 19 '25 16:12 Gautam-aman

:x: Gradle check result for 66f33d7fd0eeca518b830a02797ce9cbeb7eba9e: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

github-actions[bot] avatar Dec 19 '25 16:12 github-actions[bot]

:x: Gradle check result for bc0b0a7592117037058be5cee444aacfbdf6c8c3: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

github-actions[bot] avatar Dec 19 '25 18:12 github-actions[bot]

@Gautam-aman Looks like a test failure related to this change:

AzureStorageCleanupThirdPartyTests > classMethod FAILED
    java.lang.IllegalArgumentException: Setting [azure.client.default.account] is a secure setting and must be stored inside the OpenSearch keystore, but was found inside opensearch.yml
        at __randomizedtesting.SeedInfo.seed([AE1EA76EF69A93A4]:0)
        at org.opensearch.common.settings.SecureSetting.get(SecureSetting.java:108)
        at org.opensearch.repositories.azure.AzureStorageSettings.getConfigValue(AzureStorageSettings.java:481)
        at org.opensearch.repositories.azure.AzureStorageSettings.getClientSettings(AzureStorageSettings.java:431)
        at org.opensearch.repositories.azure.AzureStorageSettings.load(AzureStorageSettings.java:415)
        at org.opensearch.repositories.azure.AzureStorageService.<init>(AzureStorageService.java:143)
        at org.opensearch.repositories.azure.AzureStorageCleanupThirdPartyTests.lambda$waitForAzureFixtureReady$1(AzureStorageCleanupThirdPartyTests.java:96)
        at org.opensearch.test.OpenSearchTestCase.assertBusy(OpenSearchTestCase.java:1171)
        at org.opensearch.repositories.azure.AzureStorageCleanupThirdPartyTests.waitForAzureFixtureReady(AzureStorageCleanupThirdPartyTests.java:87)

Also there are still unsigned commits so the DCO check is still failing.

andrross avatar Dec 19 '25 21:12 andrross

:x: Gradle check result for bef96582c6521d396473507cc3042063f646597a: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

github-actions[bot] avatar Dec 20 '25 11:12 github-actions[bot]

:x: Gradle check result for 3435b0082f526f8a92d4e8c788a11be957905fb8: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

github-actions[bot] avatar Dec 20 '25 12:12 github-actions[bot]

:x: Gradle check result for 4026dac10762638b2cdb4748cdfaa6d8a3c5f476: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

github-actions[bot] avatar Dec 20 '25 12:12 github-actions[bot]

:x: Gradle check result for 6481297367fe1acd44a5e84abab051042aa07845: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

github-actions[bot] avatar Dec 20 '25 13:12 github-actions[bot]

:x: Gradle check result for 8be670dfc9e8f166858bde43f6b8ef0c17e2ea74: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

github-actions[bot] avatar Dec 20 '25 15:12 github-actions[bot]

:x: Gradle check result for 1437bc289b1380a354a9b22ff119dd28638d493b: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

github-actions[bot] avatar Dec 20 '25 17:12 github-actions[bot]

:x: Gradle check result for 681e085f63070a21d645b7d62025ca0cc0b3a724: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

github-actions[bot] avatar Dec 20 '25 17:12 github-actions[bot]

:x: Gradle check result for 6d236017943a3c6a60def59f1801a2e040fd1ab4: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

github-actions[bot] avatar Dec 20 '25 17:12 github-actions[bot]

:x: Gradle check result for 04129bd808ceb40b9e0208319d43dba83ac63ee4: null

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

github-actions[bot] avatar Dec 20 '25 19:12 github-actions[bot]

Hello maintainers, This PR is blocked at the Gradle check step because the Jenkins workflow needs explicit approval for external execution (the jq parse errors are due to the job not having been started yet). Could someone with maintainer permissions please approve and run the workflow? Once that has been done, I will monitor the job and address any real test failures.

Thank you.

Gautam-aman avatar Dec 21 '25 02:12 Gautam-aman