OpenSearch icon indicating copy to clipboard operation
OpenSearch copied to clipboard

fix bug of warm index : FullFileCachedIndexInput was repeatedly closed

Open liuyonghengheng opened this issue 1 month ago • 11 comments

Description

check closed value in function FullFileCachedIndexInput.IndexInputHolder.run(),prevent FileCachedIndexInput closed repeatedly and duplicate deref the reference count of cache entry

Related Issues

Resolves #20054 https://github.com/opensearch-project/OpenSearch/issues/20054

Check List

  • [ ] Functionality includes testing.
  • [ ] API changes companion pull request created, if applicable.
  • [ ] Public documentation issue/PR created, if applicable.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license. For more information on following Developer Certificate of Origin and signing off your commits, please check here.

Summary by CodeRabbit

  • Bug Fixes

    • Fixed an issue in warm index caching where cached index inputs could be incorrectly marked closed, improving lifecycle handling and ensuring proper resource cleanup for cached index clones.
  • Tests

    • Added end-to-end lifecycle/GC test coverage to validate clone/slice closing and cleanup behavior.

✏️ Tip: You can customize this high-level summary in your review settings.

liuyonghengheng avatar Nov 19 '25 10:11 liuyonghengheng

:x: Gradle check result for dacd3a5a30eeb733a8a41173c0a93607199c756c: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

github-actions[bot] avatar Nov 19 '25 11:11 github-actions[bot]

:grey_exclamation: Gradle check result for c3fc0f2d6e7d32f2f9ab563111125366e7a3097f: UNSTABLE

Please review all flaky tests that succeeded after retry and create an issue if one does not already exist to track the flaky failure.

github-actions[bot] avatar Nov 20 '25 09:11 github-actions[bot]

Codecov Report

:x: Patch coverage is 85.71429% with 2 lines in your changes missing coverage. Please review. :white_check_mark: Project coverage is 73.26%. Comparing base (1aed472) to head (033d8f2). :warning: Report is 6 commits behind head on main.

Files with missing lines Patch % Lines
...x/store/remote/filecache/FileCachedIndexInput.java 66.66% 0 Missing and 1 partial :warning:
...ore/remote/filecache/FullFileCachedIndexInput.java 90.90% 0 Missing and 1 partial :warning:
Additional details and impacted files
@@             Coverage Diff              @@
##               main   #20055      +/-   ##
============================================
- Coverage     73.29%   73.26%   -0.03%     
+ Complexity    71780    71743      -37     
============================================
  Files          5795     5795              
  Lines        328297   328302       +5     
  Branches      47282    47283       +1     
============================================
- Hits         240612   240523      -89     
- Misses        68368    68431      +63     
- Partials      19317    19348      +31     

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

:rocket: New features to boost your workflow:
  • :snowflake: Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

codecov[bot] avatar Nov 20 '25 09:11 codecov[bot]

:x: Gradle check result for ef574256d5cd6562ba7b0607c8b26f92bf6641fc: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

github-actions[bot] avatar Nov 26 '25 04:11 github-actions[bot]

:white_check_mark: Gradle check result for 844e7beabe5ef4646a7423184435d57e0f8a8fea: SUCCESS

github-actions[bot] avatar Nov 26 '25 08:11 github-actions[bot]

Walkthrough

Converted the closed flag from a volatile boolean to an AtomicBoolean across FileCachedIndexInput and FullFileCachedIndexInput, updated close/cleaner logic and IndexInputHolder to use the AtomicBoolean, added indexInputHolderRun() for explicit cleaner invocation, and added a test exercising clone/close/cleaner interactions. (50 words)

Changes

Cohort / File(s) Summary
Changelog
CHANGELOG.md
Added a "Fixed" entry: "Fix bug of warm index: FullFileCachedIndexInput was closed error".
Closed flag atomicity
server/src/main/java/org/opensearch/index/store/remote/filecache/FileCachedIndexInput.java
Replaced protected volatile boolean closed with protected final AtomicBoolean closed = new AtomicBoolean(false) and updated reads/writes to atomic operations; added import for AtomicBoolean.
Lifecycle & cleaner coordination
server/src/main/java/org/opensearch/index/store/remote/filecache/FullFileCachedIndexInput.java
Propagated AtomicBoolean into IndexInputHolder constructor, switched close() and cleaner Runnable to use closed.get()/closed.set(...), added guards to avoid duplicate closures, and added public void indexInputHolderRun() to trigger the cleaner-held runnable.
Tests
server/src/test/java/org/opensearch/index/store/remote/filecache/FullFileCachedIndexInputTests.java
Added public void testClose() throws IOException to validate clone lifecycle, reference counting progression, explicit close behavior, and GC-like cleaner invocations via indexInputHolderRun().

Sequence Diagram

sequenceDiagram
    participant Test
    participant FullInput as FullFileCachedIndexInput
    participant Holder as IndexInputHolder
    participant Cleaner as CleanerRunnable
    participant Atomic as closed (AtomicBoolean)

    Note over Test,FullInput: Clone/close lifecycle with cleaner coordination
    Test->>FullInput: clone() x3
    FullInput->>Atomic: closed.get() -> false
    FullInput->>FullInput: increment refCount for clones

    Test->>FullInput: clone1.close()
    FullInput->>Atomic: closed.get() -> false
    FullInput->>Atomic: closed.set(true)
    FullInput->>Holder: schedule/hold cleanup
    Holder->>Cleaner: run()
    Cleaner->>Atomic: closed.get()
    alt not already cleaned
        Cleaner->>FullInput: deRef() (decrement refCount)
        Cleaner->>Atomic: closed.set(true)
    end

    Test->>FullInput: indexInputHolderRun() on clone1
    FullInput->>Holder: run() (explicit)
    Holder->>Atomic: closed.get() -> true
    Holder->>Cleaner: no-op if already cleaned

    Test->>FullInput: close remaining clones -> refCount -> 0
    Cleaner->>FullInput: final cleanup when refs reach 0

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

  • Verify AtomicBoolean is used consistently for visibility and idempotence across all close paths.
  • Inspect Cleaner Runnable guard logic and interactions with indexInputHolderRun().
  • Check reference counting correctness when clones are closed vs cleaned by the cleaner.
  • Confirm the new indexInputHolderRun() method is intended and safe to expose (test usage).

Poem

🐇 A flag once fickle, now atomic and true,
Clones hop in, cleaners tidy the queue.
No double-close tumble, no racing fright,
Threads settle down in the soft moonlight,
Rabbit hums — everything’s snug tonight.

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 20.00% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
✅ Passed checks (2 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately summarizes the main fix: preventing repeated closure of FullFileCachedIndexInput by using AtomicBoolean for thread-safe state management.
Description check ✅ Passed The description covers the core fix and references the related issue (#20054), but lacks detail about implementation approach, test coverage, and reasoning.
✨ Finishing touches
🧪 Generate unit tests (beta)
  • [ ] Create PR with unit tests
  • [ ] Post copyable unit tests in a comment

📜 Recent review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 0c0f95585441297e09e72542c77f348f2b6ff436 and 033d8f2d7f674e379c964e9b97202d424e20d6e2.

📒 Files selected for processing (1)
  • CHANGELOG.md (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • CHANGELOG.md
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (20)
  • GitHub Check: gradle-check
  • GitHub Check: precommit (25, macos-15-intel)
  • GitHub Check: precommit (25, windows-latest)
  • GitHub Check: precommit (21, ubuntu-24.04-arm)
  • GitHub Check: precommit (21, windows-2025, true)
  • GitHub Check: precommit (25, ubuntu-latest)
  • GitHub Check: precommit (25, ubuntu-24.04-arm)
  • GitHub Check: precommit (21, macos-15)
  • GitHub Check: precommit (21, ubuntu-latest)
  • GitHub Check: assemble (25, ubuntu-latest)
  • GitHub Check: assemble (25, windows-latest)
  • GitHub Check: assemble (21, windows-latest)
  • GitHub Check: assemble (21, ubuntu-24.04-arm)
  • GitHub Check: assemble (21, ubuntu-latest)
  • GitHub Check: precommit (21, windows-latest)
  • GitHub Check: detect-breaking-change
  • GitHub Check: precommit (21, macos-15-intel)
  • GitHub Check: assemble (25, ubuntu-24.04-arm)
  • GitHub Check: Analyze (java)
  • GitHub Check: Mend Security Check

Comment @coderabbitai help to get the list of available commands and usage tips.

coderabbitai[bot] avatar Nov 27 '25 00:11 coderabbitai[bot]

:white_check_mark: Gradle check result for 7014a56112cabf88d2fd18b753b5b14b90ab49e0: SUCCESS

github-actions[bot] avatar Nov 27 '25 02:11 github-actions[bot]

@gbbafna can you please review and merge this if it looks good

rayshrey avatar Dec 01 '25 12:12 rayshrey

:x: Gradle check result for 0c0f95585441297e09e72542c77f348f2b6ff436: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

github-actions[bot] avatar Dec 08 '25 19:12 github-actions[bot]

Changes LGTM . Thanks @liuyonghengheng for this and @rayshrey for the review. Will merge once the build passes .

gbbafna avatar Dec 09 '25 13:12 gbbafna

:white_check_mark: Gradle check result for 0c0f95585441297e09e72542c77f348f2b6ff436: SUCCESS

github-actions[bot] avatar Dec 09 '25 14:12 github-actions[bot]

:x: Gradle check result for 033d8f2d7f674e379c964e9b97202d424e20d6e2: null

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

github-actions[bot] avatar Dec 11 '25 10:12 github-actions[bot]

:x: Gradle check result for 033d8f2d7f674e379c964e9b97202d424e20d6e2: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

github-actions[bot] avatar Dec 11 '25 12:12 github-actions[bot]

:white_check_mark: Gradle check result for 033d8f2d7f674e379c964e9b97202d424e20d6e2: SUCCESS

github-actions[bot] avatar Dec 11 '25 14:12 github-actions[bot]