ml-commons icon indicating copy to clipboard operation
ml-commons copied to clipboard

[FEATURE] text_embedding ingest processor: Allow missing or empty fields in field_map

Open reuschling opened this issue 10 months ago • 1 comments

Like in my FR https://github.com/opensearch-project/ml-commons/issues/2277, most documents in my index have the field 'body', and sometimes also 'title' and 'description'. Because the data is crawled, we can not make sure that there is valid data for each document. Nevertheless it would be nice if e.g. 'description' will be considered for generating an answer for e.g. hybrid search if there is one.

Currently, the existence of a field specified in "field_map" of the text_embedding processor is mandatory. During indexing, I get the error: {"create":{"_index":"testindex","_id":"sdfhgsd","status":400,"error":{"type":"illegal_argument_exception","reason":"field [description] has empty string value, cannot process it"}}}

Even if I configure "ignore_failure": true for the processor, the document will not processed at all, i.e. embeddings for an existing 'body' field are missing also if there is no 'description' or 'title' field. There are also documents with empty body but with title only which is a real blocker to configure just embeddings for body. Also, specifying several text_embedding processors - one for each field - is not allowed with the error type": "json_parse_exception", "reason": "Duplicate field 'text_embedding'...

I tried adding empty Strings as fields, but sadly it makes no difference, the processor recognize it.

One of the key concepts in OpenSearch/Lucene is that not all documents must follow the same 'data schema'. This is also valid for search, where only documents with matching fields will be returned.

So, in terms of consistency and robustness please allow fields inside "field_map" that don't have to appear in all documents.

{
  "description": "An NLP ingest pipeline for creating sentence embeddings",
  "processors": [
    {
      "text_embedding": {
        "model_id": "A5Xnx44B89YUJ7QK7T3K",
        "field_map": {
          "title": "embedding_tns_title",
	  "body": "embedding_tns_body",
	  "description": "embedding_tns_description"					
        },
	"ignore_failure": true
      }
    }
  ]
}

reuschling avatar Apr 12 '24 12:04 reuschling

@reuschling Currently implementation doesn't allow empty string since empty string can produce embeddings successfully but it only consumes more disk space and doesn't provide any search relevance improvement. Instead null values usually won't be indexed in OpenSearch so we allow null value here. So if you can do a pre-process to your data to replace all empty strings to null.

Also we support partial presence of the fields, e.g. even if you configured both title and body, but title is not shown in the document, the body still can be embedded successfully.

I tried adding empty Strings as fields, but sadly it makes no difference, the processor recognize it.

Empty string is not allowed, but null is allowed.

zane-neo avatar Apr 22 '24 11:04 zane-neo

@zane-neo are you planning to release this in 2.15? What's the plan?

dhrubo-os avatar May 07 '24 18:05 dhrubo-os

This doesn't looks like a bug, need @reuschling confirmation if the above response solved the issue.

zane-neo avatar May 08 '24 14:05 zane-neo

I think this is not a good solution currently, and it is not documented also. Preprocessing of the documents to add fields that are not exist but have to appear with null values can be a huge effort. You have to write code if your mapping changes, for all existing document suppliers. Sometimes you even have no access to the document supplier code further.

Why not change the default behavior of the text_embedding ingest processor to interpret a non-existing field as field with null value? I.e. that simply nothing should be done? Then no existing applications have to be changed in order to make embeddings in OpenSearch work.

reuschling avatar May 15 '24 13:05 reuschling

Should move to neural-search repo

ylwu-amzn avatar Jun 04 '24 17:06 ylwu-amzn

I created this issue also in the neural-search repo now, thanks for the hint: https://github.com/opensearch-project/neural-search/issues/774

reuschling avatar Jun 05 '24 09:06 reuschling

This issue is more related to neural search. Closing this on in ml-commons

rbhavna avatar Jun 18 '24 17:06 rbhavna