Split the remote global metadata file to metadata attribute files
Description
We are now uploading the global metadata of a cluster state as a separate file for each metadata attribute like coordination metadata, settings, templates and all of the custom metadata attributes. Remote global state directory will look like below:
base folder/
|
|--> index/
| | --> index_UUID/
| | --> metadata__<inverted_index_metadata_version>__<inverted_codec_version>__<timestamp>.dat
| | --> metadata__<inverted_index_metadata_version>__<inverted_codec_version>__<timestamp>.dat
|
|--> global-metadata/
| | --> coordination__<inverted_metadata_version>__<inverted_codec_version>__<timestamp>.dat
| | --> settings__<inverted_metadata_version>__<inverted_codec_version>__<timestamp>.dat
| | --> templates__<inverted_metadata_version>__<inverted_codec_version>__<timestamp>.dat
| | --> custom__<type>__<inverted_metadata_version>__<inverted_codec_version>__<timestamp>.dat
|
|
|--> manifest/
| | --> manifest__<inverted_term>__<inverted_version>__<inverted_codec_version>__<timestamp>
| | --> manifest__<inverted_term>__<inverted_version>__<inverted_codec_version>__<timestamp>
Splitting the global-metadata into multiple files have improved the incremental metadata upload time to S3 by 50-70%, and full metadata upload by upto 5% because of parallel upload of global metadata attribute and index metadata files. These benchmarks were done by writing a microbenchmark on main (shiv0408/OpenSearch@fe5fad8c4d4684182f7286ed8545819b13b387dd) and on top of PR branch (shiv0408/OpenSearch@88ab1aca1743718c94c2a7753b999dd9e78a36e3)
Following are the benchmark results:
Benchmark on main
Benchmark (indicesAliasesTemplates) Mode Cnt Score Error Units
RemoteClusterStateBenchmark.measureFullMetadataUpload 1000| 100| 100| avgt 30 60.832 ± 0.642 ms/op
RemoteClusterStateBenchmark.measureFullMetadataUpload 10000| 1000| 1000| avgt 30 615.146 ± 1.765 ms/op
RemoteClusterStateBenchmark.measureFullMetadataUpload 20000| 2000| 2000| avgt 30 1227.299 ± 4.178 ms/op
RemoteClusterStateBenchmark.measureFullMetadataUpload 50000| 5000| 5000| avgt 30 3031.392 ± 19.117 ms/op
RemoteClusterStateBenchmark.measureIncrementalClusterStateUpdate_Coordination 1000| 100| 100| avgt 30 2.440 ± 0.014 ms/op
RemoteClusterStateBenchmark.measureIncrementalClusterStateUpdate_Coordination 10000| 1000| 1000| avgt 30 25.849 ± 0.105 ms/op
RemoteClusterStateBenchmark.measureIncrementalClusterStateUpdate_Coordination 20000| 2000| 2000| avgt 30 52.243 ± 0.476 ms/op
RemoteClusterStateBenchmark.measureIncrementalClusterStateUpdate_Coordination 50000| 5000| 5000| avgt 30 139.867 ± 1.062 ms/op
RemoteClusterStateBenchmark.measureIncrementalClusterStateUpdate_IndexMetadata 1000| 100| 100| avgt 30 32.722 ± 0.541 ms/op
RemoteClusterStateBenchmark.measureIncrementalClusterStateUpdate_IndexMetadata 10000| 1000| 1000| avgt 30 311.668 ± 2.694 ms/op
RemoteClusterStateBenchmark.measureIncrementalClusterStateUpdate_IndexMetadata 20000| 2000| 2000| avgt 30 622.160 ± 3.091 ms/op
RemoteClusterStateBenchmark.measureIncrementalClusterStateUpdate_IndexMetadata 50000| 5000| 5000| avgt 30 1578.523 ± 2.661 ms/op
RemoteClusterStateBenchmark.measureIncrementalClusterStateUpdate_Settings 1000| 100| 100| avgt 30 2.470 ± 0.007 ms/op
RemoteClusterStateBenchmark.measureIncrementalClusterStateUpdate_Settings 10000| 1000| 1000| avgt 30 26.391 ± 0.250 ms/op
RemoteClusterStateBenchmark.measureIncrementalClusterStateUpdate_Settings 20000| 2000| 2000| avgt 30 53.320 ± 0.876 ms/op
RemoteClusterStateBenchmark.measureIncrementalClusterStateUpdate_Settings 50000| 5000| 5000| avgt 30 144.819 ± 1.237 ms/op
RemoteClusterStateBenchmark.measureIncrementalClusterStateUpdate_Templates 1000| 100| 100| avgt 30 2.814 ± 0.024 ms/op
RemoteClusterStateBenchmark.measureIncrementalClusterStateUpdate_Templates 10000| 1000| 1000| avgt 30 29.080 ± 0.160 ms/op
RemoteClusterStateBenchmark.measureIncrementalClusterStateUpdate_Templates 20000| 2000| 2000| avgt 30 60.032 ± 0.397 ms/op
RemoteClusterStateBenchmark.measureIncrementalClusterStateUpdate_Templates 50000| 5000| 5000| avgt 30 155.970 ± 2.390 ms/op
Benchmark after splitting the global metadata
Benchmark (indicesAliasesTemplates) Mode Cnt Score Error Units
RemoteClusterStateBenchmark.measureFullMetadataUpload 1000| 100| 100| avgt 30 59.594 ± 0.323 ms/op
RemoteClusterStateBenchmark.measureFullMetadataUpload 10000| 1000| 1000| avgt 30 599.334 ± 2.941 ms/op
RemoteClusterStateBenchmark.measureFullMetadataUpload 20000| 2000| 2000| avgt 30 1198.450 ± 5.466 ms/op
RemoteClusterStateBenchmark.measureFullMetadataUpload 50000| 5000| 5000| avgt 30 2990.730 ± 15.318 ms/op
RemoteClusterStateBenchmark.measureIncrementalClusterStateUpdate_Coordination 1000| 100| 100| avgt 30 0.800 ± 0.019 ms/op
RemoteClusterStateBenchmark.measureIncrementalClusterStateUpdate_Coordination 10000| 1000| 1000| avgt 30 8.483 ± 0.059 ms/op
RemoteClusterStateBenchmark.measureIncrementalClusterStateUpdate_Coordination 20000| 2000| 2000| avgt 30 17.231 ± 0.271 ms/op
RemoteClusterStateBenchmark.measureIncrementalClusterStateUpdate_Coordination 50000| 5000| 5000| avgt 30 65.734 ± 1.375 ms/op
RemoteClusterStateBenchmark.measureIncrementalClusterStateUpdate_IndexMetadata 1000| 100| 100| avgt 30 31.890 ± 0.295 ms/op
RemoteClusterStateBenchmark.measureIncrementalClusterStateUpdate_IndexMetadata 10000| 1000| 1000| avgt 30 304.154 ± 0.994 ms/op
RemoteClusterStateBenchmark.measureIncrementalClusterStateUpdate_IndexMetadata 20000| 2000| 2000| avgt 30 606.649 ± 1.042 ms/op
RemoteClusterStateBenchmark.measureIncrementalClusterStateUpdate_IndexMetadata 50000| 5000| 5000| avgt 30 1530.920 ± 14.235 ms/op
RemoteClusterStateBenchmark.measureIncrementalClusterStateUpdate_Settings 1000| 100| 100| avgt 30 0.832 ± 0.008 ms/op
RemoteClusterStateBenchmark.measureIncrementalClusterStateUpdate_Settings 10000| 1000| 1000| avgt 30 8.253 ± 0.226 ms/op
RemoteClusterStateBenchmark.measureIncrementalClusterStateUpdate_Settings 20000| 2000| 2000| avgt 30 20.208 ± 0.280 ms/op
RemoteClusterStateBenchmark.measureIncrementalClusterStateUpdate_Settings 50000| 5000| 5000| avgt 30 65.269 ± 0.439 ms/op
RemoteClusterStateBenchmark.measureIncrementalClusterStateUpdate_Templates 1000| 100| 100| avgt 30 1.166 ± 0.005 ms/op
RemoteClusterStateBenchmark.measureIncrementalClusterStateUpdate_Templates 10000| 1000| 1000| avgt 30 12.657 ± 0.245 ms/op
RemoteClusterStateBenchmark.measureIncrementalClusterStateUpdate_Templates 20000| 2000| 2000| avgt 30 26.883 ± 0.419 ms/op
RemoteClusterStateBenchmark.measureIncrementalClusterStateUpdate_Templates 50000| 5000| 5000| avgt 30 88.283 ± 0.754 ms/op
Related Issues
Resolves #12468 Resolves #10645
Check List
- [x] New functionality includes testing.
- [x] All tests pass
- [x] New functionality has been documented.
- [x] New functionality has javadoc added
- [x] Failing checks are inspected and point to the corresponding known issue(s) (See: Troubleshooting Failing Builds)
- [x] Commits are signed per the DCO using --signoff
- [x] Commit changes are listed out in CHANGELOG.md file (See: Changelog)
- [x] Public documentation issue/PR created
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license. For more information on following Developer Certificate of Origin and signing off your commits, please check here.
Compatibility status:
Checks if related components are compatible with change 8244c6d
Incompatible components
Skipped components
Compatible components
:x: Gradle check result for 6bf7bc963b0dd9b7a6153b675ae46801430dd1b3: FAILURE
Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?
:x: Gradle check result for d0875f97cd77fb5eaeec4254471a0f08743f3fc4: FAILURE
Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?
:x: Gradle check result for f6a243119d33cdd69c1701666897ce749ff4f29d: FAILURE
Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?
:x: Gradle check result for f3e853b23411842a71343acbdcc21ba47df1fa48: FAILURE
Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?
:x: Gradle check result for 5d6a0ad25b77c2ba69e8480559b428893df671b6: FAILURE
Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?
Looks good on a high level. Can you move it out of draft ?
:x: Gradle check result for 279dbbe24de4283e7a93565a7e2f6483f90d6c88: FAILURE
Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?
@shiv0408 Seems like there is more context missing for this pull request- why break this apart? Can you articulate what use cases this change improves?
Tagging @sachinpkale for review
:x: Gradle check result for fc270d16a9dc770ebe747a4d8da5ce248c5911a5: FAILURE
Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?
:x: Gradle check result for adb4cf2d8bc1ecf11a3a565595b96868c8abc849: FAILURE
Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?
:x: Gradle check result for 0b3873655e5a18b20a98605a4473d7c0a3b02365: FAILURE
Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?
:x: Gradle check result for c86c0f1e7ec60de6ceb388b5937e5c7a32918bc9:
Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?
:x: Gradle check result for 3ed92e5f458585017a77235cd60c461aacb8345b: FAILURE
Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?
:x: Gradle check result for cd5c9a5b256119fe2dbe71c6832343e20d3b9ee6: FAILURE
Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?
:x: Gradle check result for cd5c9a5b256119fe2dbe71c6832343e20d3b9ee6: FAILURE
Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?
:x: Gradle check result for 2bd97d7fa1739252bd5cfc7c6906f41eef71f20c: FAILURE
Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?
:x: Gradle check result for 2bd97d7fa1739252bd5cfc7c6906f41eef71f20c: FAILURE
Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?
:white_check_mark: Gradle check result for 2bd97d7fa1739252bd5cfc7c6906f41eef71f20c: SUCCESS
:x: Gradle check result for 8244c6db0914b2d596c3b68d4e051f80fbebcfb5: FAILURE
Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?
:x: Gradle check result for 425cf2097a52d292bff2f532ea183030e3e737ce: FAILURE
Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?
:grey_exclamation: Gradle check result for 8efc1edd9bc96ad279d171e6e7dbbe637395fd34: UNSTABLE
- TEST FAILURES:
1 org.opensearch.gateway.RecoveryFromGatewayIT.testShardStoreFetchMultiNodeMultiIndexesUsingBatchAction
Please review all flaky tests that succeeded after retry and create an issue if one does not already exist to track the flaky failure.
:white_check_mark: Gradle check result for 6c637c038adcec6acbcd1c8d2d0e866abbae3118: SUCCESS
Codecov Report
Attention: Patch coverage is 78.85835% with 100 lines in your changes are missing coverage. Please review.
Project coverage is 71.55%. Comparing base (
b15cb0c) to head (4f8a64e). Report is 285 commits behind head on main.
Additional details and impacted files
@@ Coverage Diff @@
## main #12190 +/- ##
============================================
+ Coverage 71.42% 71.55% +0.13%
- Complexity 59978 61237 +1259
============================================
Files 4985 5060 +75
Lines 282275 287854 +5579
Branches 40946 41689 +743
============================================
+ Hits 201603 205965 +4362
- Misses 63999 64928 +929
- Partials 16673 16961 +288
:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.
:white_check_mark: Gradle check result for 494aacc7c73671d76e298284fcbcee1a3072636f: SUCCESS
:x: Gradle check result for 928b65036d200c15865f33994a88349875b7f6f0: FAILURE
Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?
:white_check_mark: Gradle check result for 2ebfc6de25614dc30604e469ed3f3d37df85ce9b: SUCCESS
[Storage Triage - attendees 1 2 3 4 5 6 7 8 9 10 11 12 13]
@shiv0408 Looking forward to seeing this improvement merged. Please add the updated release target version.
:x: Gradle check result for fb0b6aaaed7a53f4c6811db6f78c0f0e7f42e644: FAILURE
Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?