ml-commons icon indicating copy to clipboard operation
ml-commons copied to clipboard

[Doc] Add tutorial for remote hosted Named Entity Recognition models

Open q-andy opened this issue 8 months ago • 1 comments

In neural-search we have an issue tracking a few customer requests for adding NER model support to ingest/search pipelines for documents. The use case is to inference named entities from document text and append the entities to the document so they can be searched, filtered, or visualized.

This is possible out of the box using ml inference ingest processors and connector blueprints, so I wrote up a quick tutorial using a locally hosted base-bert-NER remote model with some example queries in this comment here.

Since this solely uses ml-commons features, would it be helpful to make a PR to write up the comment into a tutorial doc in ml-commons? I can rewrite it to use remote hosted model on AWS as an example since AWS offers base-bert-NER on SageMaker through marketplace. If this would be useful, I'd be happy to contribute.

For reference, here's the tutorial:

Tutorial

Integrating NER and other kinds of metadata extracting models at ingest time and search time is possible as of 2.16 using the ingest ML inference processor and the search ML inference search request processor integrated with remote models with ML connectors. These are tools from ml-commons that allow you to hook into any remote model and append the inference results to a document at ingest time, or rewrite a query at search time.

Here's a quick POC using a self hosted base-bert-NER model. Note if you're using a remotely hosted model on AWS SageMaker, HuggingFace inference API, or some other 3rd party platform, the format of the API call and output may be different. For example, here's what the POC API format looks like:

curl -X POST -H "Content-Type: application/json" http://localhost:5000/classify -d '{"text": "Day off in Kyoto"}'

{
  "output": [
    {
      "end": 16,
      "entity": "B-LOC",
      "index": 4,
      "score": 0.998819887638092,
      "start": 11,
      "word": "Kyoto"
    }
  ]
}

Creating and deploying model group, ML connector, and model

1. Create ML connector for remote hosted model

First, create a connector blueprint for the remote NER model. For more information, see https://opensearch.org/docs/latest/ml-commons-plugin/remote-models/blueprints/

POST /_plugins/_ml/connectors/_create

{
  "name": "base-bert-ner",
  "description": "base-bert-NER named entity recognition model",
  "version": 1,
  "protocol": "http",
  "credential": {
    "secretArn": "",
    "roleArn": ""
  },
  "actions": [
    {
      "action_type": "predict",
      "method": "POST",
      "headers": {
        "content-type": "application/json",
        "Authorization": ""
      },
      "url": "<Your endpoint>",
      "request_body": "{ \"text\": \"${parameters.text_input}\" }"
    }
  ]
}

The response will include the connector id. We wil use this later to register the model to a model group

{
  "connector_id": "GOKO1JUBS1ZxFLQmG0lD"
}

2. Register model group

Next we register the model group that the model will be deployed on

PUT /_plugins/_ml/model_groups/_register
{
  "name": "remote_model_group",
  "description": "A model group for external models"
}

The response will include the model group id

{
  "model_group_id": "DOJF1JUBS1ZxFLQmEUm7",
  "status": "CREATED"
}

3. Register Model

Using the model group ID and the connector ID from the previous steps, we will register the model to get the model id.

{
  "name": "base_bert_ner",
  "function_name": "remote",
  "model_group_id": "DOJF1JUBS1ZxFLQmEUm7",
  "description": "test model",
  "connector_id": "GOKO1JUBS1ZxFLQmG0lD"
}

After the task is complete, we will get the model id which we will use

{
  "task_id": "GeKO1JUBS1ZxFLQmIknw",
  "status": "CREATED",
  "model_id": "GuKO1JUBS1ZxFLQmI0kx"
}

4. (Optional) Test inference

We can test inferencing the model through OpenSearch using the predict API

POST /_plugins/_ml/models/GuKO1JUBS1ZxFLQmI0kx/_predict
{
  "parameters": {
    "text_input": "You're a wizard, Harry!"
  }
}

The output may differe depending on the type of remote model used.

{
  "inference_results": [
    {
      "output": [
        {
          "name": "response",
          "dataAsMap": {
            "output": [
              {
                "end": 22,
                "entity": "B-PER",
                "index": 7,
                "score": 0.9568714499473572,
                "start": 17,
                "word": "Harry"
              }
            ]
          }
        }
      ],
      "status_code": 200
    }
  ]
}

Create Index and Pipelines

Next we will create the index and pipelines to ingest documents into

1. Create Pipeline and ML Inference Processor

We create an ingest pipeline using the ML inference processor, adapting the input_map and output_map fields to the format of our document and the model API.

PUT /_ingest/pipeline/ml_inference_pipeline
{
  "description": "Generate named_entities for ingested documents",
  "processors": [
    {
      "ml_inference": {
        "model_id": "GuKO1JUBS1ZxFLQmI0kx",
        "input_map": [
          {
            "text_input": "passage_text"
          }
        ],
        "output_map": [
          {
            "named_entities": "output"
          }
        ]
      }
    }
  ]
}

2. (optional) Test ML inference pipeline

We can test our inference pipeline with the _simulate parameter

GET /_ingest/pipeline/ml_inference_pipeline/_simulate
{
  "docs": [
    {
      "_id": "1",
      "_source": {
         "passage_text": "Wherefore art thou Romeo?"
      }
    }
  ]
}

The output displays the document with the appended information.

{
  "docs": [
    {
      "doc": {
        "_index": "_index",
        "_id": "1",
        "_source": {
          "passage_text": "Wherefore art thou Romeo?",
          "named_entities": [
            {
              "start": 19,
              "score": 0.6366085410118103,
              "index": 6,
              "end": 24,
              "word": "Romeo",
              "entity_group": "PER"
            }
          ]
        },
        "_ingest": {
          "timestamp": "2025-03-27T18:36:28.835328Z"
        }
      }
    }
  ]
}

3. Index creation

Here we create an index to store the inference output. We use an explicit mapping matched to model output to store it in a nested object, allowing us to run inner-hit queries. We will use the format of the named_entities field seen in the previous to define our index mapping

PUT /test-index-1
{
  "settings": {
    "number_of_shards": 1,
    "number_of_replicas": 0,
    "index.default_pipeline": "ml_inference_pipeline"
  },
  "mappings": {
    "properties": {
      "passage_text": { "type": "text" },
      "named_entities": {
        "type": "nested",
        "properties": {
          "start": { "type": "integer" },
          "end": { "type": "integer" },
          "score": { "type": "float" },
          "index": { "type": "integer" },
          "word": { "type": "keyword" },
          "entity_group": { "type": "keyword" }
        }
      }
    }
  }
}

4. Bulk Ingest Documents

Here, we ingest a number of document with passage test that contains the work "Forks," used in different contexts.

POST /_bulk

{"index": {"_index": "test-index-1"}}
{"passage_text": "The Columbia river splits. Forks in the middle."}
{"index": {"_index": "test-index-1"}}
{"passage_text": "Forks or chopsticks? Elise prefers forks."}
{"index": {"_index": "test-index-1"}}
{"passage_text": "Josephine wants to travel to Forks, Washington"}
{"index": {"_index": "test-index-1"}}
{"passage_text": "Forks has a population of about three thousand people"}

Querying document text and NER data

1. Querying based on passage text (without NER data)

If you want to find documents relating to the city Forks, you might try to do a regular match query on the text field. However, This would include all documents that include the term Forks, regardless of the context.

GET /test-index-1/_search
{
  "query": {
    "match": {"passage_text": "Forks"}
  }
}

Output (including source only)

{
  ...
  "hits": {
    ...
    "hits": [
      {
        ...
        "_source": {
          "passage_text": "Forks or chopsticks? Elise prefers forks."
        }
      },
      {
        ...
        "_source": {
          "passage_text": "Josephine wants to travel to Forks, Washington"
        }
      },
      {
        ...
        "_source": {
          "passage_text": "The Columbia river splits. Forks in the middle."
        }
      },
      {
        ...
        "_source": {
          "passage_text": "Forks has a population of about three thousand people"
        }
      }
    ]
  }
}

2. Querying based on model output (using NER data)

Instead, we can query the model output data appended to the doc to narrow our results into documents that specifically refer to Forks the city:

GET /test-index-1/_search
{
  "query": {
    "nested": {
      "path": "named_entities",
      "query": {
        "match": { "named_entities.word": "Forks" }
      }
    }
  }
}

By doing a nested match query on the NER data, our results don't include the usage of the term in non-named contexts.

{
  ...
  "hits": [
    {
      ...
      "_source": {
        "passage_text": "Josephine wants to travel to Forks, Washington",
        "named_entities": [
          {
            "score": 0.9974512457847595
            "entity_group": "PER",
            "start": 0,
            "end": 9,
            "word": "Josephine"
          },
          {
            "score": 0.9987255334854126,
            "entity_group": "LOC",
            "start": 29,
            "end": 34,
            "word": "Forks"
          },
          {
            "score": 0.9993244409561157,
            "entity_group": "LOC",
            "start": 36,
            "end": 46,
            "word": "Washington"
          }
        ]
      }
    },
    {
      ...
      "_source": {
        "passage_text": "Forks has a population of about three thousand people",
        "named_entities": [
          {
            "score": 0.9588472843170166,
            "entity_group": "LOC",
            "start": 0,
            "end": 5,
            "word": "Forks"
          }
        ]
      }
    }
  ]
}

3. Inner hit nested query based on entity group

You can also query on entity groups defined by the model and their associated words using inner hits:

{
  "query": {
    "nested": {
      "path": "named_entities",
      "inner_hits": {},
      "query": { 
        "term": { "named_entities.entity_group": "PER" }
      }
    }
  }
}

Here, our output includes all documents that include entities with the PER tag, along with the inner hits results showing the matching text corresponding to the tag.

{
  ...
  "hits": {
    ...
    "hits": [
      {
        ...
        "_source": {
          "passage_text": "Forks or chopsticks? Elise prefers forks."
        },
        "inner_hits": {
          "named_entities": {
            "hits": {
              "total": {
                "value": 1,
                "relation": "eq"
              },
              "max_score": 0.46800882,
              "hits": [
                {
                  "_index": "test-index-1",
                  "_id": "VeKY2ZUBS1ZxFLQmY0nr",
                  "_nested": {
                    "field": "named_entities",
                    "offset": 0
                  },
                  "_score": 0.46800882,
                  "_source": {
                    "score": 0.9696950912475586,
                    "entity_group": "PER",
                    "start": 21,
                    "end": 26,
                    "word": "Elise"
                  }
                }
              ]
            }
          }
        }
      },
      {
        ...
        "_source": {
          "passage_text": "Josephine wants to travel to Forks, Washington"
        },
        "inner_hits": {
          "named_entities": {
            "hits": {
              "total": {
                "value": 1,
                "relation": "eq"
              },
              "max_score": 0.46800882,
              "hits": [
                {
                  "_index": "test-index-1",
                  "_id": "VuKY2ZUBS1ZxFLQmY0nr",
                  "_nested": {
                    "field": "named_entities",
                    "offset": 0
                  },
                  "_score": 0.46800882,
                  "_source": {
                    "score": 0.9974512457847595,
                    "entity_group": "PER",
                    "start": 0,
                    "end": 9,
                    "word": "Josephine"
                  }
                }
              ]
            }
          }
        }
      }
    ]
  }
}

To take this further we could also do a query rewrite based on the model output using the ML inference search request processor. For the NER usecase this may not fit, however, since these models typically determine entity group based on surrounding context. The query is typically a concise collection of specific terms, meaning the model likely wouldn't be able to inference entities correctly. For example having a match query with the text "Forks" may not be interpretted by the model as referring to the location without surrounding information.

Let me know if you have any additional usecases for NER here, happy to expand on this idea more or explore different ways to simplify the process.

q-andy avatar Mar 27 '25 23:03 q-andy