neural-search icon indicating copy to clipboard operation
neural-search copied to clipboard

[BUG] error on complex types `list type field [category] has empty string, cannot process it`

Open toyaokeke opened this issue 10 months ago • 12 comments

Initial bug reported in https://github.com/opensearch-project/ml-commons/issues/2303

What is the bug? I am creating a text embedding processor that creates vectors on a nested field. However, I receive illegal_argument_exception because not all the fields in the object meet the requirement

  • string
  • map
  • string list

Here is the explanation from the AWS support specialist

Our internal team informed me that this exception happened when the “id” under “brand” field has int value that is not supported by the text embedding processor from ingestion pipeline, and the fields inside the complex type must be of types: string, map or list.

However, I am not creating vectors on id so I don't understand why it must follow these requirements. Is this expected behaviour or is this a bug?

How can one reproduce the bug? Steps to reproduce the behavior:

  1. create ingest pipeline
PUT /_ingest/pipeline/neural-search-pipeline-v2
{
  "description": "An example neural search pipeline",
  "processors": [
    {
      "text_embedding": {
        "model_id": "WeliNowB6EaQJ_XFf05V",
        "field_map": {
          "category": {
            "name": {
              "en": "category_name_vector"
            }
          }
        }
      }
    }
  ]
}
  1. simulate ingest pipeline
POST _ingest/pipeline/neural-search-pipeline-v2/_simulate
{
  "docs": [
    {
      "_index": "neural-search-index-v2",
      "_id": "1",
      "_source": {
        "category": {
          "id": 1,
          "name": {
            "en": "category 1"
          }
        }
      }
    }
  ]
}

What is the expected behavior? should create vectors on category name

{
    "docs": [
      {
        "doc": {
          "_index": "neural-search-index-v2",
          "_id": "1",
          "_source": {
            "category": {
              "name": {
                "category_name_vector": [
                  0.019107267,
                  -0.029297447,
                  0.0070927013,
                  -0.022105217,
                  ...
                ],
                "en": "category 1"
              },
              "id": 1
            }
          },
          "_ingest": {
            "timestamp": "2024-01-08T17:59:39.543401762Z"
          }
        }
      }
    ]
  }

What is your host/environment?

  • OS: AWS Opensearch Service Managed Cluster
  • Version 2.11

Do you have any screenshots?

{
   "failures": {
        "index": "neural-search-index-v2",
        "id": "5302821",
        "cause": {
          "type": "illegal_argument_exception",
          "reason": "list type field [category] has empty string, cannot process it"
        },
        "status": 400
   },
   ...
}

Do you have any additional context?

invalid doc

{
   "brand": {
      "id": 123, // cannot be integer
      "description": {
         "en": "en description female",
         "fr": "" // cannot be empty string
      }
      ...
   },
   "category": {
      "id": "123", // valid string
      "sizes": [
         "XS",
         "XL",
         "", // elements in list cannot be empty strings
         123 // elements in list cannot be integers
         ...
      ]
   }
}

valid doc

{
   "brand": {
      "id": "123",
      "description": {
         "en": "en description"
      }
      ...
   },
   "category": {
      "id": "123",
      "sizes": [ ] // empty list is valid
      "description": {
         // empty object is valid
      }
   }
}

toyaokeke avatar Apr 09 '24 18:04 toyaokeke