data-prepper icon indicating copy to clipboard operation
data-prepper copied to clipboard

OpenSearch Bulk API Source

Open laneholloway opened this issue 3 years ago • 3 comments

Summary

This creates a new Data Prepper source which accepts data in the form of the OpenSearch Bulk API.

Configuration

source:
  opensearch_api:
    port: 9200
    path_prefix: opensearch/

Operations

The _bulk API supports:

  • index
  • create
  • update
  • delete

This source can do something similar to what the dynamodb source does. Specifically it should include the opensearch_action metadata.

Sample

POST opensearch/_bulk
{ "index": { "_index": "movies", "_id": "tt1979320" } }
{ "title": "Rush", "year": 2013 }

The above request is the simplest case since it is an index request.

It creates an Event with data such as:

{ "_id": "tt1979320" "title": "Rush", "year": 2013 }

Additionally, the event will need metadata that we can use in the opensearch sink.

opensearch_action: "index"
opensearch_index: "movies"
opensearch_id: "tt1979320"

Query parameters

The _bulk API supports a few query parameters. The source should also support most of these and provide some of them as metadata.

  • pipeline -> Sets metadata: opensearch_pipeline
  • routing -> Sets metadata: opensearch_routing
  • timeout -> Configures an alternate timeout for the request in the source. This probably doesn't need to be provided downstream.

Some other parameters that we may wish to support:

  • refresh
  • require_alias
  • wait_for_active_shards

Finally, we should not support these parameters as they are being deprecated.

  • type

Response

Being able to provide the _bulk API response may be more challenging. There are a few reasons:

  1. Unless end-to-end acknowledgments are enabled, we won't have any knowledge of the writes.
  2. Even when acknowledgments are enabled all the metadata needed in a typical response is still not available.

An initial version could provide responses that either have empty values (where appropriate) or use synthetic values.

laneholloway avatar Sep 06 '21 15:09 laneholloway

I would like to work on this issue. Could you please assign this to me?

sb2k16 avatar Apr 19 '24 20:04 sb2k16

For the first milestone, we are going to support the OpenSearch Bulk API Index action. All other actions like create, update and delete will be available in later milestones.

sb2k16 avatar May 03 '24 18:05 sb2k16

Thanks @sb2k16 for picking this up. This is of interest to me working on OpenSearch UBI.

I don't want to restrict where UBI events and queries are indexed because there can be valid reasons for wanting to store those items on a different OpenSearch instance (different meaning different from where the query was done). Allowing the user to specify an OpenSearch API-compatible endpoint to receive that data would allow UBI to store data in any instance of OpenSearch with minimal overhead.

The Bulk API will be helpful because the UBI OpenSearch module can use that endpoint directly to send data to another instance of OpenSearch via Data Prepper. Additionally, using Data Prepper is valuable because of the flexibility it gives the user.

I hope that gives some insight into one use-case for this feature request. If it would be helpful to chat more about it please let me know.

jzonthemtn avatar May 03 '24 22:05 jzonthemtn