opensearch-py [FEATURE] Add support for the AWS OpenSearch Ingestion Pipeline

Is your feature request related to a problem?

No

What solution would you like?

Native access to the newly released AWS feature, OpenSearch Ingestion Pipeline

What alternatives have you considered?

Our current workaround uses requests-aws4auth but unnecessarily adds another layer of complexity to our codebase.

Do you have any additional context?

No

Jun 05 '23 14:06 adilnaimi

@wbeckler, please take a look at this feature request.

Jun 05 '23 20:06 saimedhi

@adilnaimi Would you explain more how you would use the opensearch client for accessing the ingestion service?

Jun 06 '23 03:06 wbeckler

@wbeckler -- It could be this is not the right repository where I should create my issue, and I'm willing to move it to the appropriate repository if suggested. Currently, we use opensearch-py client to establish a connection with the AWS OpenSearch cluster. However, with the introduction of the AWS ingestion pipeline service (powered by Data Prepper), we are interested in leveraging its various features, such as DLQ (Dead Letter Queue). To achieve this, we have set up a pipeline between our application and the OpenSearch cluster as follows: app -> pipeline -> OpenSearch.

Currently, there is no native method available to connect to the OpenSearch Ingestion Pipeline (I'm unsure why)—both OpenSearch-py and boto3 lack built-in support for the Pipeline functionality. I'm looking for the appropriate approach to accessing the OpenSearch Ingestion Pipeline and would appreciate any suggestions or recommendations.

Jun 07 '23 15:06 adilnaimi

I'm not sure where that request would belong either. It sounds like you're looking for a data-prepper version of this: https://github.com/vklochan/python-logstash

Maybe you could fork it or start from scratch and make the first client for data-prepper? If so, leave a comment here so if anyone else wants to help out they'll see this and find you.

Jun 08 '23 05:06 wbeckler

@wbeckler @adilnaimi - Just want to re-confirm my understanding that we are looking for a method which we have in opensearch.py package to ingest the data to the Pipeline instead of calling the HTTPS endpoint directly via curl or any other method -

awscurl --service osis --region us-east-1 \
    -X POST \
    -H "Content-Type: application/json" \
    -d '[{"time":"2014-08-11T11:40:13+00:00","remote_addr":"122.226.223.69","status":"404","request":"GET http://www.k2proxy.com//hello.html HTTP/1.1","http_user_agent":"Mozilla/4.0 (compatible; WOW64; SLCC2;)"}]' \
    https://{pipeline-endpoint}.us-east-1.osis.amazonaws.com/log-pipeline/test_ingestion_path

Just like using client.cat.indices instead of curl https://{domain-endpoint}/_cat/indices

I ask this because if my understanding is correct, I would be interested to be part of this.

Sep 20 '23 11:09 Utkarsh-Aga

@Utkarsh-Aga do you think it should be some kind of option, new namespace, require a separate client instance, or be a separate client altogether because it's specific to an AWS service?

client = Client(ingestion_pipeline: "https://{pipeline-endpoint}.us-east-1.osis.amazonaws.com/log-pipeline/test_ingestion_path")
client.ingestion_pipeline(...)...
something else?

Sep 21 '23 07:09 dblock

@dblock I believe, while creating the authentication for the client we need to provide the service as osis and then we can leverage options like client.ingestion_pipeline(...)

Sep 21 '23 15:09 Utkarsh-Aga

So we would treat osis as a plugin? My question is whether we're better off writing another library that depends on opensearch-py? Either way, since the service is not available in open source, it should be behind import aws.ingestion_pipeline.

Want to contribute @Utkarsh-Aga?

Sep 21 '23 15:09 dblock

@dblock - Yeah, we can have it as a plugin, because it would only have one functionality to send the data over http/https to the Ingestion Pipeline. Further, I would be happy to contribute to it. But I did not get this statement it should be behind import aws.ingestion_pipeline., can you please elaborate a bit more on this ?

Sep 21 '23 15:09 Utkarsh-Aga

Yeah, we can have it as a plugin, because it would only have one functionality to send the data over http/https to the Ingestion Pipeline. Further, I would be happy to contribute to it.

Awesome.

But I did not get this statement it should be behind import aws.ingestion_pipeline., can you please elaborate a bit more on this ?

Since it's an AWS service, and not a generic feature of OpenSearch open source, you can't treat it like other plugins. We can't have client.ingestion_pipeline, it would need to be client.aws_ingestion_pipeline, but that's really ugly. So as a developer I think I'd like to be able to do something like this:

from opensearch.aws import IngestionPipeline

client = Client(
  plugins: IngestionPipeline
)
client.ingestion_pipeline....

Does it make sense?

Sep 21 '23 15:09 dblock

Got it, Thanks a lot for these details @dblock, it made things super clear. I would start my research on this on how to implement it.

Since, I would be contributing to this first time, so I should just follow - CONTRIBUTING guide?

Sep 21 '23 16:09 Utkarsh-Aga

Yes! Let us know if you need help. A good place to ask general questions is the public Slack - https://opensearch.org/slack.html

Sep 21 '23 16:09 dblock

Sure, Thanks.

Sep 21 '23 16:09 Utkarsh-Aga

@adilnaimi , @Utkarsh-Aga ,

As noted above, there is an existing http source you can use. But, it seems you are interested in having APIs that look similar to OpenSearch APIs. We have an existing issue to support the _bulk API - https://github.com/opensearch-project/data-prepper/issues/248. Another feature we've discussed is having an API that looks like existing OpenSearch document APIs. For example, POST {index}/_doc. Is this something along of the lines of what you are looking for?

Oct 02 '23 17:10 dlvenable

Hello @dlvenable - Yes, if we have a support for _bulk or _doc, then it might help, and we need not create a different plugin in opensearch.py to send the data to http source. Would let @adilnaimi also once confirm, if that is the same thing, they were looking for?

Oct 03 '23 06:10 Utkarsh-Aga

This feature would be helpful to the OpenSearch UBI efforts. As commented in Data Prepper, being able to utilize an OpenSearch-like API to index UBI events and queries through Data Prepper would provide the user flexibility for storing UBI data with low overhead.

May 03 '24 22:05 jzonthemtn