OpenSearch icon indicating copy to clipboard operation
OpenSearch copied to clipboard

[Snapshot Interop][Bug Fix] Ensure lock file gets deleted if snapshot …

Open harishbhakuni opened this issue 1 year ago • 8 comments

...shard md upload fails during snapshot creation.

Description

  • While working on another change, realized that in the following code, everything below snapshotRemoteStoreIndexShard call was not getting executed: https://github.com/opensearch-project/OpenSearch/blob/829215c4a4660ee65026c40a0a01ebe100b3ee2c/server/src/main/java/org/opensearch/snapshots/SnapshotShardsService.java#L427-L462

  • that was causing an issue where if shard md upload to snapshot repository fails, it will not release the lock file from S3.

  • as part of this PR, making sure those lines of code get executed even if shard md upload fails and adding IT to cover that scenario.

  • Also, Adding a log line to print the time taken in flush operation during snapshot creation and another minor change to filter lock files based on lock suffix while fetching lock files from lock directory.

Check List

  • [x] New functionality includes testing.
    • [x] All tests pass
  • [x] New functionality has been documented.
    • [ ] New functionality has javadoc added
  • [ ] Failing checks are inspected and point to the corresponding known issue(s) (See: Troubleshooting Failing Builds)
  • [x] Commits are signed per the DCO using --signoff
  • [ ] Commit changes are listed out in CHANGELOG.md file (See: Changelog)
  • [ ] Public documentation issue/PR created

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license. For more information on following Developer Certificate of Origin and signing off your commits, please check here.

harishbhakuni avatar Jan 25 '24 01:01 harishbhakuni

Compatibility status:

Checks if related components are compatible with change 197ccfd

Incompatible components

Incompatible components: [https://github.com/opensearch-project/cross-cluster-replication.git, https://github.com/opensearch-project/performance-analyzer.git, https://github.com/opensearch-project/performance-analyzer-rca.git]

Skipped components

Compatible components

Compatible components: [https://github.com/opensearch-project/security-analytics.git, https://github.com/opensearch-project/geospatial.git, https://github.com/opensearch-project/observability.git, https://github.com/opensearch-project/notifications.git, https://github.com/opensearch-project/opensearch-oci-object-storage.git, https://github.com/opensearch-project/job-scheduler.git, https://github.com/opensearch-project/custom-codecs.git, https://github.com/opensearch-project/neural-search.git, https://github.com/opensearch-project/security.git, https://github.com/opensearch-project/asynchronous-search.git, https://github.com/opensearch-project/reporting.git, https://github.com/opensearch-project/anomaly-detection.git, https://github.com/opensearch-project/ml-commons.git, https://github.com/opensearch-project/common-utils.git, https://github.com/opensearch-project/sql.git, https://github.com/opensearch-project/index-management.git, https://github.com/opensearch-project/alerting.git, https://github.com/opensearch-project/k-nn.git]

github-actions[bot] avatar Jan 25 '24 01:01 github-actions[bot]

:x: Gradle check result for b8ec898c3292a5913b8f8abbb0b10474cd13c7ce: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

github-actions[bot] avatar Jan 25 '24 02:01 github-actions[bot]

Test org.opensearch.search.sort.FieldSortIT.testSimpleSorts {p0={"search.concurrent_segment_search.enabled":"true"}} is failing which is flaky and being tracked as part of #11875 .

harishbhakuni avatar Jan 25 '24 04:01 harishbhakuni

:grey_exclamation: Gradle check result for 197ccfd5814d1ff5bcdc3b590eb72bbd79741663: UNSTABLE

  • TEST FAILURES:
      1 org.opensearch.cluster.MinimumClusterManagerNodesIT.testThreeNodesNoClusterManagerBlock

Please review all flaky tests that succeeded after retry and create an issue if one does not already exist to track the flaky failure.

github-actions[bot] avatar Feb 02 '24 07:02 github-actions[bot]

Codecov Report

Attention: Patch coverage is 53.57143% with 26 lines in your changes missing coverage. Please review.

Project coverage is 71.42%. Comparing base (b15cb0c) to head (197ccfd). Report is 361 commits behind head on main.

:exclamation: Current head 197ccfd differs from pull request most recent head 57799ac

Please upload reports for the commit 57799ac to get more accurate results.

Files Patch % Lines
...rg/opensearch/snapshots/SnapshotShardsService.java 53.84% 18 Missing :warning:
...earch/index/store/RemoteSegmentStoreDirectory.java 50.00% 4 Missing :warning:
...re/lockmanager/RemoteStoreMetadataLockManager.java 55.55% 2 Missing and 2 partials :warning:
Additional details and impacted files
@@             Coverage Diff             @@
##               main   #12016     +/-   ##
===========================================
  Coverage     71.42%   71.42%             
+ Complexity    59978    59511    -467     
===========================================
  Files          4985     4925     -60     
  Lines        282275   279574   -2701     
  Branches      40946    40647    -299     
===========================================
- Hits         201603   199696   -1907     
+ Misses        63999    63287    -712     
+ Partials      16673    16591     -82     

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

codecov[bot] avatar Feb 02 '24 07:02 codecov[bot]

This PR is stalled because it has been open for 30 days with no activity.

This PR is stalled because it has been open for 30 days with no activity.

This PR is stalled because it has been open for 30 days with no activity.

:x: Gradle check result for 4cdd33ece679479bd71976855026dd37a9da60e4: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

github-actions[bot] avatar Jun 04 '24 22:06 github-actions[bot]

org.opensearch.cluster.coordination.AwarenessAttributeDecommissionIT.testConcurrentDecommissionAction is flaky: https://github.com/opensearch-project/OpenSearch/issues/12197

harishbhakuni avatar Jun 04 '24 22:06 harishbhakuni

:x: Gradle check result for 203e404f6a08dd55999a6cb98f53f14409a71a92: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

github-actions[bot] avatar Jun 04 '24 23:06 github-actions[bot]

:x: Gradle check result for 184132ebb40bf8a5cb9a65b722146e7cb8902c2e: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

github-actions[bot] avatar Jun 05 '24 20:06 github-actions[bot]

:x: Gradle check result for 1ebbc06d5a36384ebe81d630b36f107126ecdbcc: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

github-actions[bot] avatar Jun 06 '24 04:06 github-actions[bot]

:x: Gradle check result for 57799ac66dfb6cc1c9b38614c0ed689b519ad42e: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

github-actions[bot] avatar Jun 06 '24 19:06 github-actions[bot]

"that was causing an issue where if shard md upload to snapshot repository fails, it will not release the lock file from S3." - How is that happening ? https://github.com/opensearch-project/OpenSearch/blob/829215c4a4660ee65026c40a0a01ebe100b3ee2c/server/src/main/java/org/opensearch/snapshots/SnapshotShardsService.java#L438-L462

The above block is doing exactly that

gbbafna avatar Jun 11 '24 09:06 gbbafna

The above block is doing exactly that

not really, the problem was that caller method snapshotRemoteStoreIndexShard was catching the exception and passing it with listener.onFailure() method. so the catch block where we were releasing lock will never get executed. just moved that code to onFailure() method. so that it will get executed.

harishbhakuni avatar Jun 11 '24 20:06 harishbhakuni