data-prepper icon indicating copy to clipboard operation
data-prepper copied to clipboard

Dynamic Index Name in OpenSearch sink

Open dlvenable opened this issue 2 years ago • 2 comments

Is your feature request related to a problem? Please describe.

Some use-cases require placing different documents in different dynamic indices. Pipeline authors want to configure a dynamic index name that is derived from a property (or multiple properties) in a Data Prepper event.

Describe the solution you'd like

Support dynamic index names using a format string. This format string can use ${} to signal string interpolation and use JSON Pointer for getting fields from events. For example: metadata-${metadataType}.

pipeline:
  ...
  sink:
    opensearch:
      hosts: ["https://opensearch-host"]
      index_type: custom
      index: "metadata-${metadataType}"

Describe alternatives you've considered (Optional)

Using conditional routing could allow for supporting a predefined number of indices. However, this approach has a very practical bound. Using this format string allows for unlimited (from Data Prepper's perspective) indices.

Additional context

N/A

dlvenable avatar Jun 01 '22 17:06 dlvenable

I am looking at this issue to see if we can come up with a fix. It looks like this feature means, in theory, we may be sending each request to different index in the open search. The code has background flush logic to send bulk requests etc. Should we keep any limits on how many different indexes we may be using? OpenSearch itself does not seem to have any limit on number of indexes that can be created, but we do create internal data structures (like BulkRequestStrategy, BulkRequestSupplier, etc) per index, so I am wondering how we should protect against creating huge number of indexes (either deliberately or due to bad input)

kkondaka avatar Oct 21 '22 04:10 kkondaka

It looks like this feature means, in theory, we may be sending each request to different index in the open search. The code has background flush logic to send bulk requests etc.

We should be able to continue to use the _bulk API. Each index request can supply its own index, so these can be mixed.

Should we keep any limits on how many different indexes we may be using?

I'm not sure we should. A pipeline author could be using this feature to support an arbitrary number of indexes, and I don't think we can say what that number is. Also, this could get tricky in a distributed Data Prepper cluster. Maybe there is some value in having a boolean flag that enables this feature? This way, pipeline authors need to be explicit when using it.

OpenSearch itself does not seem to have any limit on number of indexes that can be created, but we do create internal data structures (like BulkRequestStrategy, BulkRequestSupplier, etc) per index

It appears that Data Prepper is currently using the <index>/_bulk API. You can see this here: https://github.com/opensearch-project/data-prepper/blob/bd8a7fa6950f954737c2a9c77875eaaa7b735872/data-prepper-plugins/opensearch/src/main/java/org/opensearch/dataprepper/plugins/sink/opensearch/OpenSearchSink.java#L110

We should be able to remove the specification of the index here which would use the _bulk API instead.

I expect the challenging part will be tracking which indices have already been created. Right now Data Prepper creates the index when it starts. With dynamic indices, it will need to detect if the index exists and then create it if it doesn't. It should include a cache that tracks which indexes it has created (this would be per node). This could be configurable, and I'd think a maximum value on the order of hundreds would make sense. It should also use expiration since these indexes might change over time (e.g. perhaps pipeline authors want to use a timestamp in it).

dlvenable avatar Oct 21 '22 12:10 dlvenable

As this issue is closed, @dlvenable is something like adding timestamp/date dynamically possible? I want to have different index based everyday.

So, is something like below configuration possible:

  sink:
   - opensearch:
     index: vpc_flow_logs-%{+YYYY.MM.dd}

Hardik-Parikh avatar Mar 01 '23 13:03 Hardik-Parikh