data-prepper
data-prepper copied to clipboard
OpenSearch Bulk API Source
Summary
This creates a new Data Prepper source which accepts data in the form of the OpenSearch Bulk API.
Configuration
source:
opensearch_api:
port: 9200
path_prefix: opensearch/
Operations
The _bulk
API supports:
-
index
-
create
-
update
-
delete
This source can do something similar to what the dynamodb
source does. Specifically it should include the opensearch_action
metadata.
Sample
POST opensearch/_bulk
{ "index": { "_index": "movies", "_id": "tt1979320" } }
{ "title": "Rush", "year": 2013 }
The above request is the simplest case since it is an index
request.
It creates an Event with data such as:
{ "_id": "tt1979320" "title": "Rush", "year": 2013 }
Additionally, the event will need metadata that we can use in the opensearch
sink.
opensearch_action: "index"
opensearch_index: "movies"
opensearch_id: "tt1979320"
Query parameters
The _bulk
API supports a few query parameters. The source should also support most of these and provide some of them as metadata.
-
pipeline
-> Sets metadata:opensearch_pipeline
-
routing
-> Sets metadata:opensearch_routing
-
timeout
-> Configures an alternate timeout for the request in the source. This probably doesn't need to be provided downstream.
Some other parameters that we may wish to support:
-
refresh
-
require_alias
-
wait_for_active_shards
Finally, we should not support these parameters as they are being deprecated.
-
type
Response
Being able to provide the _bulk
API response may be more challenging. There are a few reasons:
- Unless end-to-end acknowledgments are enabled, we won't have any knowledge of the writes.
- Even when acknowledgments are enabled all the metadata needed in a typical response is still not available.
An initial version could provide responses that either have empty values (where appropriate) or use synthetic values.
I would like to work on this issue. Could you please assign this to me?
For the first milestone, we are going to support the OpenSearch Bulk API Index action. All other actions like create, update and delete will be available in later milestones.
Thanks @sb2k16 for picking this up. This is of interest to me working on OpenSearch UBI.
I don't want to restrict where UBI events and queries are indexed because there can be valid reasons for wanting to store those items on a different OpenSearch instance (different meaning different from where the query was done). Allowing the user to specify an OpenSearch API-compatible endpoint to receive that data would allow UBI to store data in any instance of OpenSearch with minimal overhead.
The Bulk API will be helpful because the UBI OpenSearch module can use that endpoint directly to send data to another instance of OpenSearch via Data Prepper. Additionally, using Data Prepper is valuable because of the flexibility it gives the user.
I hope that gives some insight into one use-case for this feature request. If it would be helpful to chat more about it please let me know.