neural-search icon indicating copy to clipboard operation
neural-search copied to clipboard

[BUG] error on complex types `list type field [category] has empty string, cannot process it`

Open toyaokeke opened this issue 1 year ago • 12 comments

Initial bug reported in https://github.com/opensearch-project/ml-commons/issues/2303

What is the bug? I am creating a text embedding processor that creates vectors on a nested field. However, I receive illegal_argument_exception because not all the fields in the object meet the requirement

  • string
  • map
  • string list

Here is the explanation from the AWS support specialist

Our internal team informed me that this exception happened when the “id” under “brand” field has int value that is not supported by the text embedding processor from ingestion pipeline, and the fields inside the complex type must be of types: string, map or list.

However, I am not creating vectors on id so I don't understand why it must follow these requirements. Is this expected behaviour or is this a bug?

How can one reproduce the bug? Steps to reproduce the behavior:

  1. create ingest pipeline
PUT /_ingest/pipeline/neural-search-pipeline-v2
{
  "description": "An example neural search pipeline",
  "processors": [
    {
      "text_embedding": {
        "model_id": "WeliNowB6EaQJ_XFf05V",
        "field_map": {
          "category": {
            "name": {
              "en": "category_name_vector"
            }
          }
        }
      }
    }
  ]
}
  1. simulate ingest pipeline
POST _ingest/pipeline/neural-search-pipeline-v2/_simulate
{
  "docs": [
    {
      "_index": "neural-search-index-v2",
      "_id": "1",
      "_source": {
        "category": {
          "id": 1,
          "name": {
            "en": "category 1"
          }
        }
      }
    }
  ]
}

What is the expected behavior? should create vectors on category name

{
    "docs": [
      {
        "doc": {
          "_index": "neural-search-index-v2",
          "_id": "1",
          "_source": {
            "category": {
              "name": {
                "category_name_vector": [
                  0.019107267,
                  -0.029297447,
                  0.0070927013,
                  -0.022105217,
                  ...
                ],
                "en": "category 1"
              },
              "id": 1
            }
          },
          "_ingest": {
            "timestamp": "2024-01-08T17:59:39.543401762Z"
          }
        }
      }
    ]
  }

What is your host/environment?

  • OS: AWS Opensearch Service Managed Cluster
  • Version 2.11

Do you have any screenshots?

{
   "failures": {
        "index": "neural-search-index-v2",
        "id": "5302821",
        "cause": {
          "type": "illegal_argument_exception",
          "reason": "list type field [category] has empty string, cannot process it"
        },
        "status": 400
   },
   ...
}

Do you have any additional context?

invalid doc

{
   "brand": {
      "id": 123, // cannot be integer
      "description": {
         "en": "en description female",
         "fr": "" // cannot be empty string
      }
      ...
   },
   "category": {
      "id": "123", // valid string
      "sizes": [
         "XS",
         "XL",
         "", // elements in list cannot be empty strings
         123 // elements in list cannot be integers
         ...
      ]
   }
}

valid doc

{
   "brand": {
      "id": "123",
      "description": {
         "en": "en description"
      }
      ...
   },
   "category": {
      "id": "123",
      "sizes": [ ] // empty list is valid
      "description": {
         // empty object is valid
      }
   }
}

toyaokeke avatar Apr 09 '24 18:04 toyaokeke

@zane-neo can you look into this issue?

navneet1v avatar Apr 09 '24 18:04 navneet1v

@toyaokeke From your example, I see two different cases:

  1. category is a map type instead of list:
{
  "category": {
    "id": 1,
    "name": {
      "en": "category 1"
    }
  }
}
  1. category doesn't have name.en:
"category": {
      "id": "123",
      "sizes": [ ] // empty list is valid
   }

Can you confirm which case is your real production case?

zane-neo avatar Apr 10 '24 00:04 zane-neo

I did found issue in code, but the error message seems differ with yours, I created this ticket to track the issue: https://github.com/opensearch-project/ml-commons/issues/2309. Still trying to understand your case to see if there's other issues here.

zane-neo avatar Apr 10 '24 01:04 zane-neo

@toyaokeke From your example, I see two different cases:

  1. category is a map type instead of list:

{

  "category": {

    "id": 1,

    "name": {

      "en": "category 1"

    }

  }

}

  1. category doesn't have name.en:

"category": {

      "id": "123",

      "sizes": [ ] // empty list is valid

   }

Can you confirm which case is your real production case?

Hi @zane-neo and thank you for looking into this.

The invalid doc and valid doc examples in my description are all possible scenarios.

  • The empty strings in an array does not occur often, but is possible
  • name.zh missing is also possible. Sometimes not all documents have translations for all supported languages in my production environment

toyaokeke avatar Apr 10 '24 02:04 toyaokeke

Also the ticket you created I also see that error as well when I run the simulator too

https://github.com/opensearch-project/ml-commons/issues/2309

  • when category.id is a number, the error you mentioned occurs
  • when category.desciption.en for example is an empty string, I get the error in this ticket. Which is unexpected since the text embedding is on category.name, not category.description

The main issue with both is that I am creating embeddings on category.name.en, which does exist. However other fields that I am not creating embeddings on are causing these errors.

toyaokeke avatar Apr 10 '24 02:04 toyaokeke

@toyaokeke , would like to double confirm on the category data structure, it's a map type instead of list, right?

{
  "category": {
    "name": {
      ...
    }
  }
}

NOT

{
  "category": [
    "name": {
      ...
    }
  ]
}

The reason asking this is because if it's list type, it would need much more complex fix, if it's a map type the fix would be easier, I created this issue to track list of map issue: https://github.com/opensearch-project/neural-search/issues/686

zane-neo avatar Apr 11 '24 09:04 zane-neo

@toyaokeke , would like to double confirm on the category data structure, it's a map type instead of list, right?


{

  "category": {

    "name": {

      ...

    }

  }

}

NOT


{

  "category": [

    "name": {

      ...

    }

  ]

}

The reason asking this is because if it's list type, it would need much more complex fix, if it's a map type the fix would be easier.

Correct it is a map type

toyaokeke avatar Apr 11 '24 09:04 toyaokeke

@toyaokeke the new ml_inference processor support nested object type. check out this unit test.

you can try using ml_inference processor in 2.14 version and see if that can solve your issue. https://github.com/opensearch-project/documentation-website/pull/7095

I am still working on the tutorial for using ml inference processors using neural search query, will notice here once I have it

mingshl avatar May 07 '24 18:05 mingshl

Hi @mingshl thank you for directing me to this! If I understand correctly this ml_inference processor supports nested type, but is only available in 2.14?

I am using AWS Managed Service, and it currently only supports up to 2.11. I would be more than happy to test that processor once AWS releases support for that version 🙏🏿

toyaokeke avatar May 07 '24 20:05 toyaokeke

@mingshl considering what you shared, is this a bug that will still be fixed for the text_embedding processor, or are users encouraged to switch to the ml_inference processor for nested fields?

toyaokeke avatar May 10 '24 18:05 toyaokeke

hello @zane-neo , I saw this PR was recently merged. Just wanted to confirmed has a fix been merged in that case and if you have an idea which version it will be released in?

toyaokeke avatar Jun 10 '24 15:06 toyaokeke

Hi @zane-neo, just checking in to see if this bug has been resolved and can be closed?

toyaokeke avatar Aug 01 '24 06:08 toyaokeke

@zane-neo can you pls validate and confirm if the bug has been fixed ? Thanks!

naveentatikonda avatar Sep 18 '24 00:09 naveentatikonda

as part of the fix, I was also wondering if more detail could be provided in the error message? for example, which field within a nested attribute is causing the error?

for example,

{
   "failures": {
        "index": "neural-search-index-v2",
        "id": "5302821",
        "cause": {
          "type": "illegal_argument_exception",
          "reason": "list type field [category] has empty string, cannot process it"
        },
        "status": 400
   },
   ...
}

I do not know which field within [category] (e.g. name, description) is causing the error unless I look through the document myself. it would be great if the error mentioned for example

{
   "failures": {
        "index": "neural-search-index-v2",
        "id": "5302821",
        "cause": {
          "type": "illegal_argument_exception",
          "reason": "[name] field within [category] within ... [rootEntity] entity has empty string, cannot process it"
        },
        "status": 400
   },
   ...
}

toyaokeke avatar Sep 24 '24 19:09 toyaokeke

cc : @model-collapse

jmazanec15 avatar Oct 02 '24 00:10 jmazanec15

@toyaokeke Sorry missed to update this issue, this is already fixed in this PR: https://github.com/opensearch-project/neural-search/pull/687. The root cause is when validating the map type field, the fields not shown in configuration also get validated. The fix removed the fix on those non-embedding fields, so you should not see this error and no need to worry about the field name causing the issue.

zane-neo avatar Oct 09 '24 05:10 zane-neo