OpenSearch icon indicating copy to clipboard operation
OpenSearch copied to clipboard

[BUGFIX] Fix missing fields to resolve Strict Dynamic Mapping issue when saving task result

Open inpink opened this issue 1 year ago • 18 comments

Related Issues

Resolves #16060

Description

[The result of the execution]

  • Click the image to view it in full size.
Before After
before after
StrictDynamicMappingException occurred not occurred
  • When applying and then deleting the auto follow rule in the CCR Plugin(cross-cluster-replication), the .tasks index is properly updated without StrictDynamicMappingException.

[Background]

  • OpenSearch allows a Follower Cluster to replicate indexes from a Leader Cluster through the Opensearch CCR Plugin.
  • However, when cancelling the auto follow rule in the CCR Plugin, a StrictDynamicMappingException was previously encountered:
[2024-09-28T12:01:16,819][WARN ][o.o.r.t.a.AutoFollowTask ] [5460424aac92][my-connection-alias] Error storing result StrictDynamicMappingException[mapping set to strict, dynamic introduction of [cancellation_time_millis] within [task] is not allowed]
2024-09-28 21:01:16     at org.opensearch.index.mapper.DocumentParser.parseDynamicValue(DocumentParser.java:876)
2024-09-28 21:01:16     at org.opensearch.index.mapper.DocumentParser.parseValue(DocumentParser.java:722)
2024-09-28 21:01:16     at org.opensearch.index.mapper.DocumentParser.innerParseObject(DocumentParser.java:461)
2024-09-28 21:01:16     at org.opensearch.index.mapper.DocumentParser.parseObjectOrNested(DocumentParser.java:419)
2024-09-28 21:01:16     at org.opensearch.index.mapper.DocumentParser.parseObjectOrField(DocumentParser.java:524)
2024-09-28 21:01:16     at org.opensearch.index.mapper.DocumentParser.parseObject(DocumentParser.java:545)
2024-09-28 21:01:16     at org.opensearch.index.mapper.DocumentParser.innerParseObject(DocumentParser.java:447)
2024-09-28 21:01:16     at org.opensearch.index.mapper.DocumentParser.parseObjectOrNested(DocumentParser.java:419)
2024-09-28 21:01:16     at org.opensearch.index.mapper.DocumentParser.internalParseDocument(DocumentParser.java:138)
2024-09-28 21:01:16     at org.opensearch.index.mapper.DocumentParser.parseDocument(DocumentParser.java:93)
2024-09-28 21:01:16     at org.opensearch.index.mapper.DocumentMapper.parse(DocumentMapper.java:256)
2024-09-28 21:01:16     at org.opensearch.index.shard.IndexShard.prepareIndex(IndexShard.java:1178)
2024-09-28 21:01:16     at org.opensearch.index.shard.IndexShard.applyIndexOperation(IndexShard.java:1135)
2024-09-28 21:01:16     at org.opensearch.index.shard.IndexShard.applyIndexOperationOnPrimary(IndexShard.java:1052)
2024-09-28 21:01:16     at org.opensearch.action.bulk.TransportShardBulkAction.executeBulkItemRequest(TransportShardBulkAction.java:625)
2024-09-28 21:01:16     at org.opensearch.action.bulk.TransportShardBulkAction$2.doRun(TransportShardBulkAction.java:471)
2024-09-28 21:01:16     at org.opensearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:52)
2024-09-28 21:01:16     at org.opensearch.action.bulk.TransportShardBulkAction.performOnPrimary(TransportShardBulkAction.java:535)
2024-09-28 21:01:16     at org.opensearch.action.bulk.TransportShardBulkAction.dispatchedShardOperationOnPrimary(TransportShardBulkAction.java:416)
2024-09-28 21:01:16     at org.opensearch.action.bulk.TransportShardBulkAction.dispatchedShardOperationOnPrimary(TransportShardBulkAction.java:125)
2024-09-28 21:01:16     at org.opensearch.action.support.replication.TransportWriteAction$1.doRun(TransportWriteAction.java:275)
2024-09-28 21:01:16     at org.opensearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:1005)
2024-09-28 21:01:16     at org.opensearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:52)
2024-09-28 21:01:16     at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144)
2024-09-28 21:01:16     at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642)
2024-09-28 21:01:16     at java.base/java.lang.Thread.run(Thread.java:1583)
  • When a task is stored, OpenSearch’s TaskResultsService.storeResult(TaskResult taskResult, ActionListener<Void> listener) is called.
  • This method extracts a TaskInfo object from theTaskResult parameter. TaskInfo defines the necessary fields for a task, which are mapped to the .tasks index via task-index-mapping.json.
  • So, when a task is cancelled, the .tasks index should be updated
  • Also, the .tasks index has strict dynamic mapping enabled.
  • However, cancellation_time_millis and resource_stats were present in TaskInfo but missing from the mapping JSON.
  • As a result, this was causing a StrictDynamicMappingException.

[PR contents]

  1. Added the cancellation_time_millis and resource_stats fields to task-index-mapping.json:
...
          "cancellation_time_millis": {
            "type": "long"
          },
          "resource_stats": {
            "type" : "object",
            "enabled" : false
          }
...
  1. Updated the mapping version in both task-index-mapping.json and TaskResultsService from 4 to 5. These versions must match for the mapping to apply correctly. When updating a mapping, increment the version so future operations can detect and apply changes.

[Test]

I conducted two types of tests:

  • Writing a test in OpenSearch
  • Running local end-to-end (e2e) tests
Writing a test in OpenSearch

TaskInfo has two constructors: one that includes cancellation_time_millis (a) and one that does not (b). In the existing TasksIT, the resultsService.storeResult() test did not include the appropriate resource_stats and cancellation_time_millis in the TaskInfo. So, I added testStoreResultWithAllFields, using constructor (b) to create a test that only passes with the updated task-index-mapping.json.

Local End-to-End (e2e) Testing
  • To perform the e2e test, I needed a setup with two custom OpenSearch 3.0.0v clusters and the CCR Plugin 3.0.0v installed.
  1. I set up two clusters with the CCR Plugin installed. I cloned the OpenSearch GitHub repository and modified the task-index-mapping.json.

  2. After modifying OpenSearch, I assemble it, generating the opensearch-min-3.0.0-SNAPSHOT-linux-arm64.tar.gz file.

  3. I then cloned and assembled the OpenSearch CCR GitHub repository, which generated opensearch-cross-cluster-replication-3.0.0.0-SNAPSHOT.zip.

  4. Using a Dockerfile, I built a Docker image. In the same directory as the Dockerfile, I included the following files: opensearch-min-3.0.0-SNAPSHOT-linux-arm64.tar.gz, opensearch-cross-cluster-replication-3.0.0.0-SNAPSHOT.zip, opensearch.yml, opensearch-docker-entrypoint.sh, opensearch-onetime-setup.sh image

  5. I used Docker Compose to create two clusters.

  6. Finally, I installed the CCR plugin on each cluster:

docker exec -it [container] /bin/bash
/usr/share/opensearch/bin/opensearch-plugin install file:/usr/share/opensearch/opensearch-cross-cluster-replication-3.0.0.0-SNAPSHOT.zip
  1. I registered an auto follow rule and then canceled it to verify the behavior. Set up a cross-cluster connection, get-started-with-auto-follow:
curl -XPUT -k -H 'Content-Type: application/json'  'http://localhost:9200/_cluster/settings?pretty' -d '
{
  "persistent": {
    "cluster": {
      "remote": {
        "my-connection-alias": {
          "seeds": ["localhost:9300"]
        }
      }
    }
  }
}'
curl -XPUT -k -H 'Content-Type: application/json'  'http://localhost:9201/leader-01?pretty'
curl -XPOST -k -H 'Content-Type: application/json' -u 'admin:Yhj99!009' 'https://localhost:9200/_plugins/_replication/_autofollow?pretty' -d '
{
   "leader_alias" : "my-connection-alias",
   "name": "my-replication-rule",
   "pattern": "movies*",
   "use_roles":{
      "leader_cluster_role": "all_access",
      "follower_cluster_role": "all_access"
   }
}'
curl -XDELETE -k -H 'Content-Type: application/json'  'http://localhost:9200/_plugins/_replication/_autofollow?pretty' -d '
{
   "leader_alias" : "my-connection-alias",
   "name": "my-replication-rule"
}'
  1. I checked the logs to confirm if StrictDynamicMappingException occurred. After modifying the task-index-mapping.json correctly, I observed that instead of the previous StrictDynamicMappingException, the task status was successfully updated.
replication-node22-orin: {"type": "server", "timestamp": "2024-10-05T04:37:02,764Z", "level": "INFO", "component": "o.o.r.t.a.AutoFollowTask", "cluster.name": "follower-cluster", "node.name": "6b9ce94a6ee6", "message": "[my-connection-alias] Going to mark AutoFollowTask:97 task as completed", "cluster.uuid": "6afNBEthSgaKVY55dxNk8Q", "node.id": "0Zrg1aZFS_Sov4YAJjCDAg"  }
replication-node22-orin: {"type": "server", "timestamp": "2024-10-05T04:37:02,766Z", "level": "INFO", "component": "o.o.r.t.a.AutoFollowTask", "cluster.name": "follower-cluster", "node.name": "6b9ce94a6ee6", "message": "[my-connection-alias] Completed the task with id:97", "cluster.uuid": "6afNBEthSgaKVY55dxNk8Q", "node.id": "0Zrg1aZFS_Sov4YAJjCDAg"  }
replication-node22-orin: {"type": "server", "timestamp": "2024-10-05T04:37:02,770Z", "level": "INFO", "component": "o.o.r.t.a.AutoFollowTask", "cluster.name": "follower-cluster", "node.name": "6b9ce94a6ee6", "message": "[my-connection-alias] Successfully persisted task status", "cluster.uuid": "6afNBEthSgaKVY55dxNk8Q", "node.id": "0Zrg1aZFS_Sov4YAJjCDAg"  }

Below are the files I used for testing, along with their sources:

Name File Source
DockerFile link "opensearch-build" al2023.dockerfile
docker-compose link "opensearch" distribution docker-compose
log4j2.properties link "opensearch-build" log4j2.properties
opensearch-docker-entrypoint-default.x.sh link "opensearch-build" entrypoint
opensearch-onetime-setup.sh link "opensearch-build" onetime
opensearch.yml link "opensearch" distribution yml
opensearch-min-3.0.0-SNAPSHOT-linux-arm64.tar.gz link Generated by cloning the OpenSearch repository and assembling the project.
opensearch-cross-cluster-replication-3.0.0.0-SNAPSHOT.zip link Generated by cloning the OpenSearch CCR Plugin repository and assembling the project.

My test environment was Mac OS M2. If you are using a different operating system, replace the .tar.gz file with the version that matches your system.

If needed, I am happy to provide the Docker image used in my test. Please feel free to request it if required.

Check List

  • [X] Functionality includes testing.
  • [ ] ~API changes companion pull request created, if applicable.~
  • [ ] ~Public documentation issue/PR created, if applicable.~

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license. For more information on following Developer Certificate of Origin and signing off your commits, please check here.

inpink avatar Oct 05 '24 16:10 inpink

:x: Gradle check result for 839a1bcda6d74522b733f7024901117a38cfa582: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

github-actions[bot] avatar Oct 05 '24 17:10 github-actions[bot]

:x: Gradle check result for 7946ad998323465e9f7292cf55212555e7178048: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

github-actions[bot] avatar Oct 07 '24 03:10 github-actions[bot]

:x: Gradle check result for c15c90aeb97c235d298c7b395dfc51c9c580d4e9: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

github-actions[bot] avatar Oct 08 '24 09:10 github-actions[bot]

:x: Gradle check result for 0cad64fc6c3792c492796a7c0f4a4778687b526b: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

github-actions[bot] avatar Oct 11 '24 10:10 github-actions[bot]

:x: Gradle check result for 5b7688964bd13f46444f478310f6b7d58e225d54: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

github-actions[bot] avatar Oct 13 '24 10:10 github-actions[bot]

Hello @dbwiddis ,

I have updated the changes as follows! Please take a look when you have time 😄

  1. I added the missing field resource_stats to task-index-mapping.json. During debugging, I discovered that task-index-mapping.json was missing not only cancellation_time_millis but also resource_stats.

  2. I also updated the version in TaskResultsService. Referring to the previous commit where task-index-mapping.json was updated, I realized that this version update was necessary in the service as well.

  3. I added an entry to CHANGELOG.md.

  4. I added a test case in TasksIT. I introduced the testStoreResultWithAllFields test, which passes only when using the updated JSON. Initially, I wanted to create two tests, but I faced the following challenge, so I’ve included only the test that passes with the updated JSON.

I wanted to add a test to check if the old JSON throws a StrictDynamicMappingException. However, the TASK_RESULT_INDEX_MAPPING_FILE field in TaskResultsService is public static final.

Therefore, I’m currently proceeding with option (c) below and would like to ask if options (a) or (b) would be better:

a. Use reflection to override TASK_RESULT_INDEX_MAPPING_FILE. However, reflection might have performance implications. b. Refactor TaskResultsService to make TASK_RESULT_INDEX_MAPPING_FILE non-static. This field has been public static final since 2016 and is only used internally in TaskResultsService. c. Proceed without verifying StrictDynamicMappingException and only run the test that passes with the updated JSON.

Thanks to your guidance, I’ve learned a lot about contributing to OpenSearch. Thank you for your support.

inpink avatar Oct 13 '24 11:10 inpink

:x: Gradle check result for 7e66bb325c704d0d8dc1fbd57948a3c9d1291e83: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

github-actions[bot] avatar Oct 13 '24 14:10 github-actions[bot]

:x: Gradle check result for 3c966efeec0534b5b2f4e50f3c9d7954696f8d22: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

github-actions[bot] avatar Oct 18 '24 12:10 github-actions[bot]

:x: Gradle check result for 1cf934f77dac68e164d14445fa77b10a4cf86c43: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

github-actions[bot] avatar Oct 18 '24 12:10 github-actions[bot]

:x: Gradle check result for 1cf934f77dac68e164d14445fa77b10a4cf86c43: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

github-actions[bot] avatar Oct 19 '24 05:10 github-actions[bot]

:x: Gradle check result for b541bca6746347662a938ce26786fc6e6cdb4ba6: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

github-actions[bot] avatar Oct 19 '24 05:10 github-actions[bot]

:x: Gradle check result for 5469e9671aa28e5729d51f6089a479152e22307c: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

github-actions[bot] avatar Oct 19 '24 09:10 github-actions[bot]

:x: Gradle check result for f09fe85480dfd828d552d424d75b43d144a2ac4c: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

github-actions[bot] avatar Oct 19 '24 10:10 github-actions[bot]

image image

The local tests succeed, but they fail on “Gradle Check (Jenkins) / gradle-check (pull_request_target)” even though I haven’t modified them. What actions should be taken?

inpink avatar Oct 19 '24 10:10 inpink

What actions should be taken?

Looks like two test failures:

  1. There's no index test. Investigate why that is.
  2. Your max hits expected and actual mismatch. This is a change in Lucene 9 where search hits gives a minimum bound (to save time) if an explicit value isn't requested. See https://github.com/opensearch-project/OpenSearch/pull/4270 for example on how to properly test this.

dbwiddis avatar Oct 19 '24 17:10 dbwiddis

:grey_exclamation: Gradle check result for 0b367cca32a39694d536191009309d469eede9c0: UNSTABLE

Please review all flaky tests that succeeded after retry and create an issue if one does not already exist to track the flaky failure.

github-actions[bot] avatar Oct 19 '24 21:10 github-actions[bot]

Codecov Report

All modified and coverable lines are covered by tests :white_check_mark:

Project coverage is 72.13%. Comparing base (ad7f9e7) to head (ace68c7). Report is 4 commits behind head on main.

Additional details and impacted files
@@             Coverage Diff              @@
##               main   #16201      +/-   ##
============================================
+ Coverage     72.05%   72.13%   +0.07%     
- Complexity    64861    64885      +24     
============================================
  Files          5309     5309              
  Lines        302734   302785      +51     
  Branches      43733    43750      +17     
============================================
+ Hits         218134   218404     +270     
+ Misses        66731    66516     -215     
+ Partials      17869    17865       -4     

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

codecov[bot] avatar Oct 19 '24 21:10 codecov[bot]

:x: Gradle check result for 797821f0ce8c0ea89188894d5dbd71f01b67ddb4: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

github-actions[bot] avatar Oct 20 '24 05:10 github-actions[bot]

:white_check_mark: Gradle check result for 6a5e60fa9e6601c9b871f806329e0cb155330a55: SUCCESS

github-actions[bot] avatar Oct 20 '24 07:10 github-actions[bot]

The Gradle check has been completed. 😄🎉 Thank you, @dbwiddis ! When you have time, I’d appreciate it if you could merge it!

inpink avatar Oct 21 '24 05:10 inpink

:white_check_mark: Gradle check result for ace68c73b0601e9d17c400bac74633453bfcf1ba: SUCCESS

github-actions[bot] avatar Oct 21 '24 18:10 github-actions[bot]

The backport to 2.x failed:

The process '/usr/bin/git' failed with exit code 128

To backport manually, run these commands in your terminal:

# Navigate to the root of your repository
cd $(git rev-parse --show-toplevel)
# Fetch latest updates from GitHub
git fetch
# Create a new working tree
git worktree add ../.worktrees/OpenSearch/backport-2.x 2.x
# Navigate to the new working tree
pushd ../.worktrees/OpenSearch/backport-2.x
# Create a new branch
git switch --create backport/backport-16201-to-2.x
# Cherry-pick the merged commit of this pull request and resolve the conflicts
git cherry-pick -x --mainline 1 322bdc42dab1d6d4fa021529057453afd5cb898e
# Push it to GitHub
git push --set-upstream origin backport/backport-16201-to-2.x
# Go back to the original working tree
popd
# Delete the working tree
git worktree remove ../.worktrees/OpenSearch/backport-2.x

Then, create a pull request where the base branch is 2.x and the compare/head branch is backport/backport-16201-to-2.x.