data-prepper icon indicating copy to clipboard operation
data-prepper copied to clipboard

Reduce duplicates from the opensearch source for Scroll searches

Open travisbenedict opened this issue 5 months ago • 0 comments

Is your feature request related to a problem? Please describe. I'm seeing duplicate documents being ingested into my sink when using an opensearch source that's pointing to an Amazon OpenSearch domain running Elasticsearch 7.10. After looking into the Elastic documentation and the code I believe that this is the root cause of the duplicates

We recommend you include a tiebreaker field in your sort. This tiebreaker field should contain a unique value for each document. If you don’t include a tiebreaker field, your paged results could miss or duplicate hits.

https://www.elastic.co/guide/en/elasticsearch/reference/7.10/paginate-search-results.html

The scroll request that's being created in the ElasticsearchAccessor only sorts based on a single field: https://github.com/opensearch-project/data-prepper/blob/2919e9942e51dcb02547b209d6ee3a3fe420944f/data-prepper-plugins/opensearch/src/main/java/org/opensearch/dataprepper/plugins/source/opensearch/worker/client/ElasticsearchAccessor.java#L159-L164

Describe the solution you'd like Add a secondary sort field to the Elasticsearch Scroll request.

Something similar is already done for OpenSearch PointInTime requests: https://github.com/opensearch-project/data-prepper/blob/87c560a3964175231b35f87fa6ab0cbc626271bb/data-prepper-plugins/opensearch/src/main/java/org/opensearch/dataprepper/plugins/source/opensearch/worker/client/OpenSearchAccessor.java#L114-L131

Describe alternatives you've considered (Optional) N/A

Additional context Add any other context or screenshots about the feature request here.

travisbenedict avatar Sep 18 '24 17:09 travisbenedict