[Snapshot Interop][Bug Fix] Ensure lock file gets deleted if snapshot …
...shard md upload fails during snapshot creation.
Description
-
While working on another change, realized that in the following code, everything below snapshotRemoteStoreIndexShard call was not getting executed: https://github.com/opensearch-project/OpenSearch/blob/829215c4a4660ee65026c40a0a01ebe100b3ee2c/server/src/main/java/org/opensearch/snapshots/SnapshotShardsService.java#L427-L462
-
that was causing an issue where if shard md upload to snapshot repository fails, it will not release the lock file from S3.
-
as part of this PR, making sure those lines of code get executed even if shard md upload fails and adding IT to cover that scenario.
-
Also, Adding a log line to print the time taken in flush operation during snapshot creation and another minor change to filter lock files based on lock suffix while fetching lock files from lock directory.
Check List
- [x] New functionality includes testing.
- [x] All tests pass
- [x] New functionality has been documented.
- [ ] New functionality has javadoc added
- [ ] Failing checks are inspected and point to the corresponding known issue(s) (See: Troubleshooting Failing Builds)
- [x] Commits are signed per the DCO using --signoff
- [ ] Commit changes are listed out in CHANGELOG.md file (See: Changelog)
- [ ] Public documentation issue/PR created
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license. For more information on following Developer Certificate of Origin and signing off your commits, please check here.
Compatibility status:
Checks if related components are compatible with change 197ccfd
Incompatible components
Incompatible components: [https://github.com/opensearch-project/cross-cluster-replication.git, https://github.com/opensearch-project/performance-analyzer.git, https://github.com/opensearch-project/performance-analyzer-rca.git]
Skipped components
Compatible components
Compatible components: [https://github.com/opensearch-project/security-analytics.git, https://github.com/opensearch-project/geospatial.git, https://github.com/opensearch-project/observability.git, https://github.com/opensearch-project/notifications.git, https://github.com/opensearch-project/opensearch-oci-object-storage.git, https://github.com/opensearch-project/job-scheduler.git, https://github.com/opensearch-project/custom-codecs.git, https://github.com/opensearch-project/neural-search.git, https://github.com/opensearch-project/security.git, https://github.com/opensearch-project/asynchronous-search.git, https://github.com/opensearch-project/reporting.git, https://github.com/opensearch-project/anomaly-detection.git, https://github.com/opensearch-project/ml-commons.git, https://github.com/opensearch-project/common-utils.git, https://github.com/opensearch-project/sql.git, https://github.com/opensearch-project/index-management.git, https://github.com/opensearch-project/alerting.git, https://github.com/opensearch-project/k-nn.git]
:x: Gradle check result for b8ec898c3292a5913b8f8abbb0b10474cd13c7ce: FAILURE
Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?
Test org.opensearch.search.sort.FieldSortIT.testSimpleSorts {p0={"search.concurrent_segment_search.enabled":"true"}} is failing which is flaky and being tracked as part of #11875 .
:grey_exclamation: Gradle check result for 197ccfd5814d1ff5bcdc3b590eb72bbd79741663: UNSTABLE
- TEST FAILURES:
1 org.opensearch.cluster.MinimumClusterManagerNodesIT.testThreeNodesNoClusterManagerBlock
Please review all flaky tests that succeeded after retry and create an issue if one does not already exist to track the flaky failure.
Codecov Report
Attention: Patch coverage is 53.57143% with 26 lines in your changes missing coverage. Please review.
Project coverage is 71.42%. Comparing base (
b15cb0c) to head (197ccfd). Report is 361 commits behind head on main.
:exclamation: Current head 197ccfd differs from pull request most recent head 57799ac
Please upload reports for the commit 57799ac to get more accurate results.
Additional details and impacted files
@@ Coverage Diff @@
## main #12016 +/- ##
===========================================
Coverage 71.42% 71.42%
+ Complexity 59978 59511 -467
===========================================
Files 4985 4925 -60
Lines 282275 279574 -2701
Branches 40946 40647 -299
===========================================
- Hits 201603 199696 -1907
+ Misses 63999 63287 -712
+ Partials 16673 16591 -82
:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.
This PR is stalled because it has been open for 30 days with no activity.
This PR is stalled because it has been open for 30 days with no activity.
This PR is stalled because it has been open for 30 days with no activity.
:x: Gradle check result for 4cdd33ece679479bd71976855026dd37a9da60e4: FAILURE
Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?
org.opensearch.cluster.coordination.AwarenessAttributeDecommissionIT.testConcurrentDecommissionAction is flaky: https://github.com/opensearch-project/OpenSearch/issues/12197
:x: Gradle check result for 203e404f6a08dd55999a6cb98f53f14409a71a92: FAILURE
Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?
https://github.com/opensearch-project/OpenSearch/issues/13600 org.opensearch.indices.IndicesRequestCacheIT.testDeleteAndCreateSameIndexShardOnSameNode {p0={"search.concurrent_segment_search.enabled":"false"}}, org.opensearch.indices.IndicesRequestCacheIT.testDeleteAndCreateSameIndexShardOnSameNode {p0={"opensearch.experimental.feature.pluggable.caching.enabled":"true"}}, org.opensearch.indices.IndicesRequestCacheIT.testDeleteAndCreateSameIndexShardOnSameNode {p0={"search.concurrent_segment_search.enabled":"true"}}, org.opensearch.indices.IndicesRequestCacheIT.testDeleteAndCreateSameIndexShardOnSameNode {p0={"opensearch.experimental.feature.pluggable.caching.enabled":"false"}}
https://github.com/opensearch-project/OpenSearch/issues/13939 org.opensearch.remotemigration.RemoteMigrationIndexMetadataUpdateIT.testRemoteIndexPathFileExistsAfterMigration:
:x: Gradle check result for 184132ebb40bf8a5cb9a65b722146e7cb8902c2e: FAILURE
Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?
org.opensearch.indices.IndicesRequestCacheIT.classMethod: test suite timed out.
All the below mentioned tests are flaky: org.opensearch.action.admin.indices.create.RemoteCloneIndexIT.testCreateCloneIndexFailure: https://github.com/opensearch-project/OpenSearch/issues/13476 org.opensearch.cluster.MinimumClusterManagerNodesIT.testThreeNodesNoClusterManagerBlock: https://github.com/opensearch-project/OpenSearch/issues/10006 org.opensearch.indices.IndicesRequestCacheIT.testStaleKeysCleanupWithMultipleIndices {p0={"opensearch.experimental.feature.pluggable.caching.enabled":"false"}} org.opensearch.indices.IndicesRequestCacheIT.classMethod: https://github.com/opensearch-project/OpenSearch/issues/13437
:x: Gradle check result for 1ebbc06d5a36384ebe81d630b36f107126ecdbcc: FAILURE
Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?
org.opensearch.action.admin.indices.create.RemoteCloneIndexIT.testCreateCloneIndexFailure: https://github.com/opensearch-project/OpenSearch/issues/13476
org.opensearch.indices.IndicesRequestCacheIT.testCacheCleanupWithDefaultSettings {p0={"opensearch.experimental.feature.pluggable.caching.enabled":"true"}}
org.opensearch.indices.IndicesRequestCacheIT.classMethod: https://github.com/opensearch-project/OpenSearch/issues/13711
:x: Gradle check result for 57799ac66dfb6cc1c9b38614c0ed689b519ad42e: FAILURE
Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?
"that was causing an issue where if shard md upload to snapshot repository fails, it will not release the lock file from S3." - How is that happening ? https://github.com/opensearch-project/OpenSearch/blob/829215c4a4660ee65026c40a0a01ebe100b3ee2c/server/src/main/java/org/opensearch/snapshots/SnapshotShardsService.java#L438-L462
The above block is doing exactly that
The above block is doing exactly that
not really, the problem was that caller method snapshotRemoteStoreIndexShard was catching the exception and passing it with listener.onFailure() method. so the catch block where we were releasing lock will never get executed. just moved that code to onFailure() method. so that it will get executed.