ludwig icon indicating copy to clipboard operation
ludwig copied to clipboard

Support for specific tag formats in named entity recognition

Open henchaves opened this issue 2 years ago • 2 comments

Is your feature request related to a problem? Please describe. I followed the tutorial on Named Entity Recognition. Then I noticed that the tag format was not following any of the most used tag formats (standoff, IOB2, BILUO, etc.). Is it possible to add support for at least one of these tag formats?

Describe the use case I tried to train a model replacing the tag format presented in the tutorial (too simple and ambiguous for my project) with the IOB2 format.

For example, I replaced the tags Movie Movie O O Date O O O O O O Person Person with B-Movie I-Movie O O B-Date O O O O O O B-Person I-Person for the text "Blade Runner is a 1982 neo-noir science fiction film directed by Ridley Scott".

But when I predicted some data points with this trained model, I got tags that were not following the positioning rule, for example (hypothetical): "Harrison Ford and Rutger Hauer starred in it" - I-Person I-Person O B-Person I-Person O O O

Visualising the output for the "Harrison Ford" text slice, we can see that the model incorrectly gives the first token ("Harrison") an I-Person tag even if the previous token doesn't have a B-Person tag. The expected output is what I got for "Rutger Hauer", in which the tags for both tokens are following the positioning correctly.

Describe the solution you'd like The expected solution is always to get the correct position of the tags, so it would be great if we could type a specific tag format (e.g. IOB2), and the model automatically verifies the tag positions of the outputs, making the necessary replacements when it finds errors.

Describe alternatives you've considered I can solve these errors by my side, but it is expected for a named-entity recognition model to automatically returns a clean output without needing extra verification.

Additional context A real example of model results not following the tag format (IOB2) on which it was trained: https://prnt.sc/9SFilgpvlusX

henchaves avatar Jul 25 '22 14:07 henchaves

@henchaves thanks for sharing this. it's a good point. The reason no specific format is followed is for flexibility. For instance, one could build an extractive summarization model by having 0 and 1, or keep and not as labels. So constraining to a specific format would be limiting. On the other hand I believe we could introduce an postprocessing function to capture these kind of scenarios, a function that fixes IOB prefixes. If you have a function that already performs the fix, would you consider contributing it?

w4nderlust avatar Jul 25 '22 19:07 w4nderlust

Hey @w4nderlust. Sure! I'm using the IOB2 tag format, so I've built a function to fix the wrongly predicted IOB2 tags:

from typing import List


def split_to_tags(prediction: str) -> List[str]:
  tags = prediction.split(",")
  return tags


def join_tags(tags: List[str]) -> str:
  prediction = ",".join(tags)
  return prediction


def fix_iob2_tags(tags: List[str]) -> List[str]:
  for i in range(1, len(tags)):
    current_tag = tags[i]
    previous_tag = tags[i-1]

    if current_tag == "<EOS>":  # Check if end of sentence
      break

    if current_tag.startswith("I-"):  # Check if tag starts with 'I-'
      b_tag = f"B-{current_tag[2:]}"
      current_tag = b_tag if previous_tag != b_tag else current_tag

    tags[i] = current_tag
  
  return tags


def postprocessing_ner_tags(prediction_list: List[str], tag_format: str) -> List[str]:

  if tag_format == "iob2":
    fix_func = fix_iob2_tags

  prediction_list_processed = []

  for prediction in prediction_list:
    tags = split_to_tags(prediction)
    tags_fixed = fix_func(tags)
    prediction_fixed = join_tags(tags_fixed)
    prediction_list_processed.append(prediction_fixed)

  return prediction_list_processed

For example, the following snippet:

prediction_list = ["<SOS>,O,O,I-PER,I-PER,B-MISC,B-PER,B-PER,O,O,<EOS>,I-PER,O,O"]
prediction_list_postprocessed = postprocessing_ner_tags(prediction_list, tag_format="iob2")
print(prediction_list_postprocessed)

Has the output: ['<SOS>,O,O,B-PER,I-PER,B-MISC,B-PER,B-PER,O,O,<EOS>,I-PER,O,O']

henchaves avatar Jul 27 '22 12:07 henchaves

Hi @henchaves, would you like to submit a PR for the fix you proposed?

dalianaliu avatar Sep 21 '22 20:09 dalianaliu