OpenSearch [Feature Request] Introduce Document Processors

Is your feature request related to a problem? Please describe

When developing ml_infernce ingest processors and ml_inferece search processors, an interesting question was brought up why it can not be just one ml_inference processor to work on both ingest and search phases? As it seems very similar in the APIs request.

The answer to this question is no, it has to be two processors, because ingest phase and search phase are in different requests.

However, this question triggers an good idea that actually ingest processors and search processors share logics for document manipulation and processing. @msfroh came up an idea that why can we have a kind of Document Processors that are responsible for document manipulation. And this Document Processor Factories will produce different type of processors, for example, split document processors, that can be used in both in Search Response Processor Factories and also in Ingest Processors Factories.

Describe the solution you'd like

In this split document processors will take in the object of document, and will conduct the split process, and ingest processor and search processor can share the split document processors in the middle.

split ingest processors <-split document processors -> split search processors

There are some benefits in introducing document processors:

there are over 30+ of ingest processors, but there are fewer than 10 search processors, when document processors are here, it's very easy to mirror ingest processors to search processors. We can easily create search processors to match the type of ingest processors.
there is centralized place for processing document, when there is a change to a type of processor, developers don't have to maintain the codes in two classes.

Related component

Search:Relevance

Describe alternatives you've considered

No response

Additional context

No response

May 07 '24 17:05 mingshl

ingest processors and search processors share logics for document manipulation and processing.

Can the shared logic of document processing be handled by the J-J transformer? If not, should we create another processor for this task? Piling up multiple processors might become cumbersome for customers.

May 07 '24 18:05 jackiehanyang

ingest processors and search processors share logics for document manipulation and processing.

Can the shared logic of document processing be handled by the J-J transformer? If not, should we create another processor for this task? Piling up multiple processors might become cumbersome for customers.

the J-J transformer method should be added to Document Processors so that it can also be shared in ingest and search processors. Hope this make sense.

May 07 '24 18:05 mingshl

ingest processors and search processors share logics for document manipulation and processing.

Can the shared logic of document processing be handled by the J-J transformer? If not, should we create another processor for this task? Piling up multiple processors might become cumbersome for customers.

the J-J transformer method should be moving to Document Processors so that it can also be shared in ingest and search processors. Hope this make sense.

The J-J transformer functions as a standalone utility within the Core package, making it adaptable for use by any processor. To clarify, we are not moving the J-J transformer to Document Processors. Instead, any processor, including Document Processors, can integrate the J-J transformer within their own processor if desired

May 07 '24 18:05 jackiehanyang

ingest processors and search processors share logics for document manipulation and processing.

Can the shared logic of document processing be handled by the J-J transformer? If not, should we create another processor for this task? Piling up multiple processors might become cumbersome for customers.

the J-J transformer method should be moving to Document Processors so that it can also be shared in ingest and search processors. Hope this make sense.

The J-J transformer functions as a standalone utility within the Core package, making it adaptable for use by any processor. To clarify, we are not moving the J-J transformer to Document Processors. Instead, any processor, including Document Processors, can integrate the J-J transformer within their own processor if desired

I don't mean to move the method within the Document Processors. But if adding the parameters in a document processor that used j-j transform, then it can shared in search/ingest processors.

Because it makes more sense that all document related transformation happens in document processors. And we don't have to copy the same codes between search processors and ingest processors.

May 07 '24 18:05 mingshl

@mingshl @jackiehanyang is building JtoJ transform as a utility function in Core to be used by any processor or any feature. How would that play in with this document processor?

May 07 '24 20:05 minalsha

But if adding the parameters in a document processor that used j-j transform, then it can shared in search/ingest processors.

@mingshl Could you please provide further elaboration on this? What are the parameters and how will they be used in the J-J transformer?

May 07 '24 20:05 jackiehanyang

But if adding the parameters in a document processor that used j-j transform, then it can shared in search/ingest processors.

@mingshl Could you please provide further elaboration on this? What are the parameters and how will they be used in the J-J transformer?

It depends on the j-j transform use case, as J-J transformer functions as a standalone utility, it can be used individually in a search or a ingest processor. If it's used in both ingest and search, then it makes more sense to be in the document processor then it can be shared to both ingest and search processors.

I will leave this options to builders and users for different processors.

May 07 '24 20:05 mingshl

[Triage - attendees 1 2 3 4] @mingshl Thanks for filing. Looking forward to seeing the outcome here.

May 08 '24 15:05 andrross

And this Document Processor Factories will produce different type of processors, for example, split document processors, that can be used in both in Search Response Processor Factories and also in Ingest Processors Factories.

Search Response and Ingest Processors expect SearchResponses and IngestDocuments respectively, and processors are implemented based on those interfaces. Just curious how Document Processors would be chained to these processors if the inputs dont line up.

Would a Search Response processor feed hits to a document processor and re-format the modified hits back into a Search Response?

May 09 '24 17:05 joshpalis

@mingshl

When developing ml_infernce ingest processors and ml_inferece search processors, an interesting question was brought up why it can not be just one ml_inference processor to work on both ingest and search phases? As it seems very similar in the APIs request.

One thing to think here is, during ingest you have document, but when you do the search it is not necessary that you will always have the documents. Example: a textEmbedding processor can convert 1 or more fields of a document to embedding but in a SearchRequestProcessor works on a field of the queryRequest. Also when a search request is completed actually what you are getting documents in the response is the fields of the documents. They are fundamentally different things but we generally call search response has a list of documents. If you remove _source from the search responses they are just _ids. Hence fundamentally different.

I think what you are looking here is transformers or may be convertors(+1 on @minalsha point), which does a particular task. May be something like Generic Transformers which can be called by ingestProcessors or SearchProcessors do a specific task.

May 12 '24 20:05 navneet1v

Example: a textEmbedding processor can convert 1 or more fields of a document to embedding but in a SearchRequestProcessor works on a field of the queryRequest.

The common document processor logic isn't applicable to SearchRequestProcessor (since a search request doesn't have documents).

Search Response and Ingest Processors expect SearchResponses and IngestDocuments respectively, and processors are implemented based on those interfaces.

We will create a pair of adapters (with a single implementation of each) that extract the "documents" from a SearchResponse or IngestDocument, passes them through the DocumentProcessor, and returns the modified SearchRepsonse or IngestDocument, respectively.

Once you have a DocumentProcessorFactory, you'd be able to register the ingest Processor and SearchResponseProcessor via a plugin like:

class RenameFieldDocumentProcessor implements DocumentProcessor {
  // Implementation
}

class RenameFieldDocumentProcessorFactory implements DocumentProcessorFactory {
  RenameFieldDocumentProcessor create(Map<String, Object> config) {
    // Parse config, return processor
   }
}

class DocumentProcessorPlugin implements IngestPlugin, SearchPipelinePlugin {

  Map<String, org.opensearch.ingest.Processor.Factory> getProcessors(org.opensearch.ingest.Processor.Parameters parameters) {
    return Map.of(
      "rename_field", new DocumentIngestProcessorFactory(new RenameFieldDocumentProcessorFactory());
    );
  }

  Map<String, Processor.Factory<SearchResponseProcessor>> getResponseProcessors(Parameters parameters) {
    return Map.of(
      "rename_field", new DocumentSearchResponseProcessorFactory(new RenameFieldDocumentProcessorFactory());
    );    
  }
}

You get two processors for the price of one.

May 21 '24 18:05 msfroh