data-prepper
data-prepper copied to clipboard
Reduce duplicates from the opensearch source for Scroll searches
Is your feature request related to a problem? Please describe.
I'm seeing duplicate documents being ingested into my sink when using an opensearch
source that's pointing to an Amazon OpenSearch domain running Elasticsearch 7.10. After looking into the Elastic documentation and the code I believe that this is the root cause of the duplicates
We recommend you include a tiebreaker field in your sort. This tiebreaker field should contain a unique value for each document. If you don’t include a tiebreaker field, your paged results could miss or duplicate hits.
https://www.elastic.co/guide/en/elasticsearch/reference/7.10/paginate-search-results.html
The scroll request that's being created in the ElasticsearchAccessor only sorts based on a single field: https://github.com/opensearch-project/data-prepper/blob/2919e9942e51dcb02547b209d6ee3a3fe420944f/data-prepper-plugins/opensearch/src/main/java/org/opensearch/dataprepper/plugins/source/opensearch/worker/client/ElasticsearchAccessor.java#L159-L164
Describe the solution you'd like Add a secondary sort field to the Elasticsearch Scroll request.
Something similar is already done for OpenSearch PointInTime requests: https://github.com/opensearch-project/data-prepper/blob/87c560a3964175231b35f87fa6ab0cbc626271bb/data-prepper-plugins/opensearch/src/main/java/org/opensearch/dataprepper/plugins/source/opensearch/worker/client/OpenSearchAccessor.java#L114-L131
Describe alternatives you've considered (Optional) N/A
Additional context Add any other context or screenshots about the feature request here.