ludwig
ludwig copied to clipboard
Support for specific tag formats in named entity recognition
Is your feature request related to a problem? Please describe. I followed the tutorial on Named Entity Recognition. Then I noticed that the tag format was not following any of the most used tag formats (standoff, IOB2, BILUO, etc.). Is it possible to add support for at least one of these tag formats?
Describe the use case I tried to train a model replacing the tag format presented in the tutorial (too simple and ambiguous for my project) with the IOB2 format.
For example, I replaced the tags Movie Movie O O Date O O O O O O Person Person
with B-Movie I-Movie O O B-Date O O O O O O B-Person I-Person
for the text "Blade Runner is a 1982 neo-noir science fiction film directed by Ridley Scott".
But when I predicted some data points with this trained model, I got tags that were not following the positioning rule, for example (hypothetical):
"Harrison Ford and Rutger Hauer starred in it" - I-Person I-Person O B-Person I-Person O O O
Visualising the output for the "Harrison Ford" text slice, we can see that the model incorrectly gives the first token ("Harrison") an I-Person
tag even if the previous token doesn't have a B-Person
tag. The expected output is what I got for "Rutger Hauer", in which the tags for both tokens are following the positioning correctly.
Describe the solution you'd like The expected solution is always to get the correct position of the tags, so it would be great if we could type a specific tag format (e.g. IOB2), and the model automatically verifies the tag positions of the outputs, making the necessary replacements when it finds errors.
Describe alternatives you've considered I can solve these errors by my side, but it is expected for a named-entity recognition model to automatically returns a clean output without needing extra verification.
Additional context A real example of model results not following the tag format (IOB2) on which it was trained: https://prnt.sc/9SFilgpvlusX
@henchaves thanks for sharing this. it's a good point. The reason no specific format is followed is for flexibility. For instance, one could build an extractive summarization model by having 0 and 1, or keep and not as labels. So constraining to a specific format would be limiting. On the other hand I believe we could introduce an postprocessing function to capture these kind of scenarios, a function that fixes IOB prefixes. If you have a function that already performs the fix, would you consider contributing it?
Hey @w4nderlust. Sure! I'm using the IOB2 tag format, so I've built a function to fix the wrongly predicted IOB2 tags:
from typing import List
def split_to_tags(prediction: str) -> List[str]:
tags = prediction.split(",")
return tags
def join_tags(tags: List[str]) -> str:
prediction = ",".join(tags)
return prediction
def fix_iob2_tags(tags: List[str]) -> List[str]:
for i in range(1, len(tags)):
current_tag = tags[i]
previous_tag = tags[i-1]
if current_tag == "<EOS>": # Check if end of sentence
break
if current_tag.startswith("I-"): # Check if tag starts with 'I-'
b_tag = f"B-{current_tag[2:]}"
current_tag = b_tag if previous_tag != b_tag else current_tag
tags[i] = current_tag
return tags
def postprocessing_ner_tags(prediction_list: List[str], tag_format: str) -> List[str]:
if tag_format == "iob2":
fix_func = fix_iob2_tags
prediction_list_processed = []
for prediction in prediction_list:
tags = split_to_tags(prediction)
tags_fixed = fix_func(tags)
prediction_fixed = join_tags(tags_fixed)
prediction_list_processed.append(prediction_fixed)
return prediction_list_processed
For example, the following snippet:
prediction_list = ["<SOS>,O,O,I-PER,I-PER,B-MISC,B-PER,B-PER,O,O,<EOS>,I-PER,O,O"]
prediction_list_postprocessed = postprocessing_ner_tags(prediction_list, tag_format="iob2")
print(prediction_list_postprocessed)
Has the output: ['<SOS>,O,O,B-PER,I-PER,B-MISC,B-PER,B-PER,O,O,<EOS>,I-PER,O,O']
Hi @henchaves, would you like to submit a PR for the fix you proposed?