neural-search icon indicating copy to clipboard operation
neural-search copied to clipboard

[BUG] Cannot iterate over nested list of objects to apply text embeddings processor

Open krishy91 opened this issue 1 year ago • 10 comments

What is the bug?

When a nested field contains a list of objects, and a foreach processor is used to generate the text embeddings for a text in each nested object, it does not work.

How can one reproduce the bug?

  1. Create a pipeline with foreach processor that iterates over a nested list of object & applies text embedding processor to each nested object. NOTE: The text embedding model also has to be present & loaded

This is way of using the foreach processor (to access internal nested objects with _ingest._value). Since text embedding processor does not current support . notation for fields, I had to write in the below way.

{
    "processors": [
        {
            "foreach": {
                "field": "nested-field",
                "processor": {
                    "text_embedding": {
                        "model_id": "7l8rQ4sB2f-9nv8R9Veh",
                        "field_map": {
                            "_ingest": {
                                "_value" :{
                                    "text-field": "vector-field"
                                }
                            }
                        }
                    }
                }
            }
        }
    ]
}
  1. Test the pipeline with the following nested document.
{
    "docs": [
        {
            "_index": "test-index",
            "_id": "1",
            "_source": {
                "nested-field": [
                    {
                        "text-field": "This is a test"
                    },
                    {
                        "text-field": "This is another test"
                    }
                ]
            }
        }
    ]
}

The above does not work. But the thing is, neural search supports query nested fields (as nested query) and there is the possibility to only get the actual matches using inner_hits.

What is the expected behavior?

The expected behavior is that for the above example, text embedding should be generated (vector-field) for both the nested objects in the above example.

I understand that multi-value fields aren't currently supported for generating text embeddings, but since we are dealing here with nested objects & they are independant objects in lucene, this should work.

What is your host/environment?

I tested with the official docker image of Opensearch 2.9 running on Windows.

krishy91 avatar Oct 19 '23 13:10 krishy91