[BUGFIX] Fix missing fields to resolve Strict Dynamic Mapping issue when saving task result
Related Issues
Resolves #16060
Description
[The result of the execution]
- Click the image to view it in full size.
| Before | After |
|---|---|
StrictDynamicMappingException occurred |
not occurred |
- When applying and then deleting the auto follow rule in the
CCR Plugin(cross-cluster-replication), the.tasksindex is properly updated withoutStrictDynamicMappingException.
[Background]
- OpenSearch allows a Follower Cluster to replicate indexes from a Leader Cluster through the Opensearch CCR Plugin.
- However, when cancelling the
auto followrule in the CCR Plugin, aStrictDynamicMappingExceptionwas previously encountered:
[2024-09-28T12:01:16,819][WARN ][o.o.r.t.a.AutoFollowTask ] [5460424aac92][my-connection-alias] Error storing result StrictDynamicMappingException[mapping set to strict, dynamic introduction of [cancellation_time_millis] within [task] is not allowed]
2024-09-28 21:01:16 at org.opensearch.index.mapper.DocumentParser.parseDynamicValue(DocumentParser.java:876)
2024-09-28 21:01:16 at org.opensearch.index.mapper.DocumentParser.parseValue(DocumentParser.java:722)
2024-09-28 21:01:16 at org.opensearch.index.mapper.DocumentParser.innerParseObject(DocumentParser.java:461)
2024-09-28 21:01:16 at org.opensearch.index.mapper.DocumentParser.parseObjectOrNested(DocumentParser.java:419)
2024-09-28 21:01:16 at org.opensearch.index.mapper.DocumentParser.parseObjectOrField(DocumentParser.java:524)
2024-09-28 21:01:16 at org.opensearch.index.mapper.DocumentParser.parseObject(DocumentParser.java:545)
2024-09-28 21:01:16 at org.opensearch.index.mapper.DocumentParser.innerParseObject(DocumentParser.java:447)
2024-09-28 21:01:16 at org.opensearch.index.mapper.DocumentParser.parseObjectOrNested(DocumentParser.java:419)
2024-09-28 21:01:16 at org.opensearch.index.mapper.DocumentParser.internalParseDocument(DocumentParser.java:138)
2024-09-28 21:01:16 at org.opensearch.index.mapper.DocumentParser.parseDocument(DocumentParser.java:93)
2024-09-28 21:01:16 at org.opensearch.index.mapper.DocumentMapper.parse(DocumentMapper.java:256)
2024-09-28 21:01:16 at org.opensearch.index.shard.IndexShard.prepareIndex(IndexShard.java:1178)
2024-09-28 21:01:16 at org.opensearch.index.shard.IndexShard.applyIndexOperation(IndexShard.java:1135)
2024-09-28 21:01:16 at org.opensearch.index.shard.IndexShard.applyIndexOperationOnPrimary(IndexShard.java:1052)
2024-09-28 21:01:16 at org.opensearch.action.bulk.TransportShardBulkAction.executeBulkItemRequest(TransportShardBulkAction.java:625)
2024-09-28 21:01:16 at org.opensearch.action.bulk.TransportShardBulkAction$2.doRun(TransportShardBulkAction.java:471)
2024-09-28 21:01:16 at org.opensearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:52)
2024-09-28 21:01:16 at org.opensearch.action.bulk.TransportShardBulkAction.performOnPrimary(TransportShardBulkAction.java:535)
2024-09-28 21:01:16 at org.opensearch.action.bulk.TransportShardBulkAction.dispatchedShardOperationOnPrimary(TransportShardBulkAction.java:416)
2024-09-28 21:01:16 at org.opensearch.action.bulk.TransportShardBulkAction.dispatchedShardOperationOnPrimary(TransportShardBulkAction.java:125)
2024-09-28 21:01:16 at org.opensearch.action.support.replication.TransportWriteAction$1.doRun(TransportWriteAction.java:275)
2024-09-28 21:01:16 at org.opensearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:1005)
2024-09-28 21:01:16 at org.opensearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:52)
2024-09-28 21:01:16 at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144)
2024-09-28 21:01:16 at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642)
2024-09-28 21:01:16 at java.base/java.lang.Thread.run(Thread.java:1583)
- When a task is stored, OpenSearch’s
TaskResultsService.storeResult(TaskResult taskResult, ActionListener<Void> listener)is called. - This method extracts a
TaskInfoobject from theTaskResultparameter.TaskInfodefines the necessary fields for a task, which are mapped to the.tasksindex viatask-index-mapping.json. - So, when a task is cancelled, the
.tasksindex should be updated - Also, the
.tasksindex has strict dynamic mapping enabled. - However,
cancellation_time_millisandresource_statswere present inTaskInfobut missing from the mapping JSON. - As a result, this was causing a
StrictDynamicMappingException.
[PR contents]
- Added the
cancellation_time_millisandresource_statsfields totask-index-mapping.json:
...
"cancellation_time_millis": {
"type": "long"
},
"resource_stats": {
"type" : "object",
"enabled" : false
}
...
- Updated the mapping
versionin bothtask-index-mapping.jsonandTaskResultsServicefrom 4 to 5. These versions must match for the mapping to apply correctly. When updating a mapping, increment the version so future operations can detect and apply changes.
[Test]
I conducted two types of tests:
- Writing a test in OpenSearch
- Running local end-to-end (e2e) tests
Writing a test in OpenSearch
TaskInfo has two constructors: one that includes cancellation_time_millis (a) and one that does not (b).
In the existing TasksIT, the resultsService.storeResult() test did not include the appropriate resource_stats and cancellation_time_millis in the TaskInfo.
So, I added testStoreResultWithAllFields, using constructor (b) to create a test that only passes with the updated task-index-mapping.json.
Local End-to-End (e2e) Testing
- To perform the e2e test, I needed a setup with two custom OpenSearch 3.0.0v clusters and the CCR Plugin 3.0.0v installed.
-
I set up two clusters with the CCR Plugin installed. I cloned the OpenSearch GitHub repository and modified the task-index-mapping.json.
-
After modifying OpenSearch, I
assembleit, generating theopensearch-min-3.0.0-SNAPSHOT-linux-arm64.tar.gz file. -
I then cloned and assembled the OpenSearch CCR GitHub repository, which generated opensearch-cross-cluster-replication-3.0.0.0-SNAPSHOT.zip.
-
Using a
Dockerfile, I built aDocker image. In the same directory as the Dockerfile, I included the following files:opensearch-min-3.0.0-SNAPSHOT-linux-arm64.tar.gz,opensearch-cross-cluster-replication-3.0.0.0-SNAPSHOT.zip,opensearch.yml,opensearch-docker-entrypoint.sh,opensearch-onetime-setup.sh -
I used
Docker Composeto create two clusters. -
Finally, I installed the CCR plugin on each cluster:
docker exec -it [container] /bin/bash
/usr/share/opensearch/bin/opensearch-plugin install file:/usr/share/opensearch/opensearch-cross-cluster-replication-3.0.0.0-SNAPSHOT.zip
- I registered an
auto followrule and then canceled it to verify the behavior. Set up a cross-cluster connection, get-started-with-auto-follow:
curl -XPUT -k -H 'Content-Type: application/json' 'http://localhost:9200/_cluster/settings?pretty' -d '
{
"persistent": {
"cluster": {
"remote": {
"my-connection-alias": {
"seeds": ["localhost:9300"]
}
}
}
}
}'
curl -XPUT -k -H 'Content-Type: application/json' 'http://localhost:9201/leader-01?pretty'
curl -XPOST -k -H 'Content-Type: application/json' -u 'admin:Yhj99!009' 'https://localhost:9200/_plugins/_replication/_autofollow?pretty' -d '
{
"leader_alias" : "my-connection-alias",
"name": "my-replication-rule",
"pattern": "movies*",
"use_roles":{
"leader_cluster_role": "all_access",
"follower_cluster_role": "all_access"
}
}'
curl -XDELETE -k -H 'Content-Type: application/json' 'http://localhost:9200/_plugins/_replication/_autofollow?pretty' -d '
{
"leader_alias" : "my-connection-alias",
"name": "my-replication-rule"
}'
- I checked the logs to confirm if
StrictDynamicMappingExceptionoccurred. After modifying thetask-index-mapping.jsoncorrectly, I observed that instead of the previousStrictDynamicMappingException, the task status was successfully updated.
replication-node22-orin: {"type": "server", "timestamp": "2024-10-05T04:37:02,764Z", "level": "INFO", "component": "o.o.r.t.a.AutoFollowTask", "cluster.name": "follower-cluster", "node.name": "6b9ce94a6ee6", "message": "[my-connection-alias] Going to mark AutoFollowTask:97 task as completed", "cluster.uuid": "6afNBEthSgaKVY55dxNk8Q", "node.id": "0Zrg1aZFS_Sov4YAJjCDAg" }
replication-node22-orin: {"type": "server", "timestamp": "2024-10-05T04:37:02,766Z", "level": "INFO", "component": "o.o.r.t.a.AutoFollowTask", "cluster.name": "follower-cluster", "node.name": "6b9ce94a6ee6", "message": "[my-connection-alias] Completed the task with id:97", "cluster.uuid": "6afNBEthSgaKVY55dxNk8Q", "node.id": "0Zrg1aZFS_Sov4YAJjCDAg" }
replication-node22-orin: {"type": "server", "timestamp": "2024-10-05T04:37:02,770Z", "level": "INFO", "component": "o.o.r.t.a.AutoFollowTask", "cluster.name": "follower-cluster", "node.name": "6b9ce94a6ee6", "message": "[my-connection-alias] Successfully persisted task status", "cluster.uuid": "6afNBEthSgaKVY55dxNk8Q", "node.id": "0Zrg1aZFS_Sov4YAJjCDAg" }
Below are the files I used for testing, along with their sources:
| Name | File | Source |
|---|---|---|
| DockerFile | link | "opensearch-build" al2023.dockerfile |
| docker-compose | link | "opensearch" distribution docker-compose |
| log4j2.properties | link | "opensearch-build" log4j2.properties |
| opensearch-docker-entrypoint-default.x.sh | link | "opensearch-build" entrypoint |
| opensearch-onetime-setup.sh | link | "opensearch-build" onetime |
| opensearch.yml | link | "opensearch" distribution yml |
| opensearch-min-3.0.0-SNAPSHOT-linux-arm64.tar.gz | link | Generated by cloning the OpenSearch repository and assembling the project. |
| opensearch-cross-cluster-replication-3.0.0.0-SNAPSHOT.zip | link | Generated by cloning the OpenSearch CCR Plugin repository and assembling the project. |
My test environment was Mac OS M2. If you are using a different operating system, replace the .tar.gz file with the version that matches your system.
If needed, I am happy to provide the Docker image used in my test. Please feel free to request it if required.
Check List
- [X] Functionality includes testing.
- [ ] ~API changes companion pull request created, if applicable.~
- [ ] ~Public documentation issue/PR created, if applicable.~
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license. For more information on following Developer Certificate of Origin and signing off your commits, please check here.
:x: Gradle check result for 839a1bcda6d74522b733f7024901117a38cfa582: FAILURE
Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?
:x: Gradle check result for 7946ad998323465e9f7292cf55212555e7178048: FAILURE
Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?
:x: Gradle check result for c15c90aeb97c235d298c7b395dfc51c9c580d4e9: FAILURE
Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?
:x: Gradle check result for 0cad64fc6c3792c492796a7c0f4a4778687b526b: FAILURE
Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?
:x: Gradle check result for 5b7688964bd13f46444f478310f6b7d58e225d54: FAILURE
Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?
Hello @dbwiddis ,
I have updated the changes as follows! Please take a look when you have time 😄
-
I added the missing field
resource_statstotask-index-mapping.json. During debugging, I discovered thattask-index-mapping.jsonwas missing not onlycancellation_time_millisbut alsoresource_stats. -
I also updated the
versioninTaskResultsService. Referring to the previous commit wheretask-index-mapping.jsonwas updated, I realized that this version update was necessary in the service as well. -
I added an entry to
CHANGELOG.md. -
I added a test case in
TasksIT. I introduced thetestStoreResultWithAllFieldstest, which passes only when using the updated JSON. Initially, I wanted to create two tests, but I faced the following challenge, so I’ve included only the test that passes with the updated JSON.
I wanted to add a test to check if the old JSON throws a StrictDynamicMappingException.
However, the TASK_RESULT_INDEX_MAPPING_FILE field in TaskResultsService is public static final.
Therefore, I’m currently proceeding with option (c) below and would like to ask if options (a) or (b) would be better:
a. Use reflection to override TASK_RESULT_INDEX_MAPPING_FILE. However, reflection might have performance implications.
b. Refactor TaskResultsService to make TASK_RESULT_INDEX_MAPPING_FILE non-static. This field has been public static final since 2016 and is only used internally in TaskResultsService.
c. Proceed without verifying StrictDynamicMappingException and only run the test that passes with the updated JSON.
Thanks to your guidance, I’ve learned a lot about contributing to OpenSearch. Thank you for your support.
:x: Gradle check result for 7e66bb325c704d0d8dc1fbd57948a3c9d1291e83: FAILURE
Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?
:x: Gradle check result for 3c966efeec0534b5b2f4e50f3c9d7954696f8d22: FAILURE
Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?
:x: Gradle check result for 1cf934f77dac68e164d14445fa77b10a4cf86c43: FAILURE
Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?
:x: Gradle check result for 1cf934f77dac68e164d14445fa77b10a4cf86c43: FAILURE
Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?
:x: Gradle check result for b541bca6746347662a938ce26786fc6e6cdb4ba6: FAILURE
Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?
:x: Gradle check result for 5469e9671aa28e5729d51f6089a479152e22307c: FAILURE
Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?
:x: Gradle check result for f09fe85480dfd828d552d424d75b43d144a2ac4c: FAILURE
Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?
The local tests succeed, but they fail on “Gradle Check (Jenkins) / gradle-check (pull_request_target)” even though I haven’t modified them. What actions should be taken?
What actions should be taken?
Looks like two test failures:
- There's no index
test. Investigate why that is. - Your max hits expected and actual mismatch. This is a change in Lucene 9 where search hits gives a minimum bound (to save time) if an explicit value isn't requested. See https://github.com/opensearch-project/OpenSearch/pull/4270 for example on how to properly test this.
:grey_exclamation: Gradle check result for 0b367cca32a39694d536191009309d469eede9c0: UNSTABLE
Please review all flaky tests that succeeded after retry and create an issue if one does not already exist to track the flaky failure.
Codecov Report
All modified and coverable lines are covered by tests :white_check_mark:
Project coverage is 72.13%. Comparing base (
ad7f9e7) to head (ace68c7). Report is 4 commits behind head on main.
Additional details and impacted files
@@ Coverage Diff @@
## main #16201 +/- ##
============================================
+ Coverage 72.05% 72.13% +0.07%
- Complexity 64861 64885 +24
============================================
Files 5309 5309
Lines 302734 302785 +51
Branches 43733 43750 +17
============================================
+ Hits 218134 218404 +270
+ Misses 66731 66516 -215
+ Partials 17869 17865 -4
:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.
:x: Gradle check result for 797821f0ce8c0ea89188894d5dbd71f01b67ddb4: FAILURE
Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?
:white_check_mark: Gradle check result for 6a5e60fa9e6601c9b871f806329e0cb155330a55: SUCCESS
The Gradle check has been completed. 😄🎉 Thank you, @dbwiddis ! When you have time, I’d appreciate it if you could merge it!
:white_check_mark: Gradle check result for ace68c73b0601e9d17c400bac74633453bfcf1ba: SUCCESS
The backport to 2.x failed:
The process '/usr/bin/git' failed with exit code 128
To backport manually, run these commands in your terminal:
# Navigate to the root of your repository
cd $(git rev-parse --show-toplevel)
# Fetch latest updates from GitHub
git fetch
# Create a new working tree
git worktree add ../.worktrees/OpenSearch/backport-2.x 2.x
# Navigate to the new working tree
pushd ../.worktrees/OpenSearch/backport-2.x
# Create a new branch
git switch --create backport/backport-16201-to-2.x
# Cherry-pick the merged commit of this pull request and resolve the conflicts
git cherry-pick -x --mainline 1 322bdc42dab1d6d4fa021529057453afd5cb898e
# Push it to GitHub
git push --set-upstream origin backport/backport-16201-to-2.x
# Go back to the original working tree
popd
# Delete the working tree
git worktree remove ../.worktrees/OpenSearch/backport-2.x
Then, create a pull request where the base branch is 2.x and the compare/head branch is backport/backport-16201-to-2.x.