[FEATURE] Add support for the AWS OpenSearch Ingestion Pipeline
Is your feature request related to a problem?
No
What solution would you like?
Native access to the newly released AWS feature, OpenSearch Ingestion Pipeline
What alternatives have you considered?
Our current workaround uses requests-aws4auth but unnecessarily adds another layer of complexity to our codebase.
Do you have any additional context?
No
@wbeckler, please take a look at this feature request.
@adilnaimi Would you explain more how you would use the opensearch client for accessing the ingestion service?
@wbeckler -- It could be this is not the right repository where I should create my issue, and I'm willing to move it to the appropriate repository if suggested. Currently, we use opensearch-py client to establish a connection with the AWS OpenSearch cluster. However, with the introduction of the AWS ingestion pipeline service (powered by Data Prepper), we are interested in leveraging its various features, such as DLQ (Dead Letter Queue). To achieve this, we have set up a pipeline between our application and the OpenSearch cluster as follows: app -> pipeline -> OpenSearch.
Currently, there is no native method available to connect to the OpenSearch Ingestion Pipeline (I'm unsure why)—both OpenSearch-py and boto3 lack built-in support for the Pipeline functionality. I'm looking for the appropriate approach to accessing the OpenSearch Ingestion Pipeline and would appreciate any suggestions or recommendations.
I'm not sure where that request would belong either. It sounds like you're looking for a data-prepper version of this: https://github.com/vklochan/python-logstash
Maybe you could fork it or start from scratch and make the first client for data-prepper? If so, leave a comment here so if anyone else wants to help out they'll see this and find you.
@wbeckler @adilnaimi - Just want to re-confirm my understanding that we are looking for a method which we have in opensearch.py package to ingest the data to the Pipeline instead of calling the HTTPS endpoint directly via curl or any other method -
awscurl --service osis --region us-east-1 \
-X POST \
-H "Content-Type: application/json" \
-d '[{"time":"2014-08-11T11:40:13+00:00","remote_addr":"122.226.223.69","status":"404","request":"GET http://www.k2proxy.com//hello.html HTTP/1.1","http_user_agent":"Mozilla/4.0 (compatible; WOW64; SLCC2;)"}]' \
https://{pipeline-endpoint}.us-east-1.osis.amazonaws.com/log-pipeline/test_ingestion_path
Just like using client.cat.indices instead of curl https://{domain-endpoint}/_cat/indices
I ask this because if my understanding is correct, I would be interested to be part of this.
@Utkarsh-Aga do you think it should be some kind of option, new namespace, require a separate client instance, or be a separate client altogether because it's specific to an AWS service?
client = Client(ingestion_pipeline: "https://{pipeline-endpoint}.us-east-1.osis.amazonaws.com/log-pipeline/test_ingestion_path")client.ingestion_pipeline(...)...- something else?
@dblock I believe, while creating the authentication for the client we need to provide the service as osis and then we can leverage options like client.ingestion_pipeline(...)
So we would treat osis as a plugin? My question is whether we're better off writing another library that depends on opensearch-py? Either way, since the service is not available in open source, it should be behind import aws.ingestion_pipeline.
Want to contribute @Utkarsh-Aga?
@dblock -
Yeah, we can have it as a plugin, because it would only have one functionality to send the data over http/https to the Ingestion Pipeline.
Further, I would be happy to contribute to it.
But I did not get this statement it should be behind import aws.ingestion_pipeline., can you please elaborate a bit more on this ?
Yeah, we can have it as a plugin, because it would only have one functionality to send the data over http/https to the Ingestion Pipeline. Further, I would be happy to contribute to it.
Awesome.
But I did not get this statement it should be behind import aws.ingestion_pipeline., can you please elaborate a bit more on this ?
Since it's an AWS service, and not a generic feature of OpenSearch open source, you can't treat it like other plugins. We can't have client.ingestion_pipeline, it would need to be client.aws_ingestion_pipeline, but that's really ugly. So as a developer I think I'd like to be able to do something like this:
from opensearch.aws import IngestionPipeline
client = Client(
plugins: IngestionPipeline
)
client.ingestion_pipeline....
Does it make sense?
Got it, Thanks a lot for these details @dblock, it made things super clear. I would start my research on this on how to implement it.
Since, I would be contributing to this first time, so I should just follow - CONTRIBUTING guide?
Yes! Let us know if you need help. A good place to ask general questions is the public Slack - https://opensearch.org/slack.html
Sure, Thanks.
@adilnaimi , @Utkarsh-Aga ,
As noted above, there is an existing http source you can use. But, it seems you are interested in having APIs that look similar to OpenSearch APIs. We have an existing issue to support the _bulk API - https://github.com/opensearch-project/data-prepper/issues/248. Another feature we've discussed is having an API that looks like existing OpenSearch document APIs. For example, POST {index}/_doc. Is this something along of the lines of what you are looking for?
Hello @dlvenable - Yes, if we have a support for _bulk or _doc, then it might help, and we need not create a different plugin in opensearch.py to send the data to http source.
Would let @adilnaimi also once confirm, if that is the same thing, they were looking for?