keras-nlp icon indicating copy to clipboard operation
keras-nlp copied to clipboard

BIO/IOB Tagging Text and Vice-a-Versa

Open aflah02 opened this issue 3 years ago • 13 comments

Are there any plans to add a BIO Tagging Layer to KerasNLP? This layer could take sentences and relevant spans as input and output their BIO representations. BIO Tags are used extensively in Span Recognition Tasks and are present in literature as well. Some libraries also offer this functionality such as AllenNLP

aflah02 avatar Apr 13 '22 17:04 aflah02

@aflah02 Thanks for reporting this feature request!

I would like to understand this better, so you are proposing to add one preprocessing layer that constructs the BIO representations, which can be taken in by downstream models? It seems to me that the tagging process is a manual work, so what will your layer look like?

chenmoneygithub avatar Apr 13 '22 19:04 chenmoneygithub

@chenmoneygithub Yup that's exactly what I'm proposing. So basically in my experience with annotating span based data the annotators aren't expected to tag a sentence as BIO directly rather they might annotate it as follows -

Sentence: India is in Asia Spans: ['India', 'Asia']

Essentially listing the spans.

Now the task is to generate the BIO tags from this data or do the opposite, you have the BIO tags and the original sentence but you need the spans then you just invert the process

So the layer I'm proposing does these 2 operations by either taking in input a String and Set of Spans while tagging or the BIO Representation and String while detagging

aflah02 avatar Apr 13 '22 20:04 aflah02

In hindsight it can be something else and not necessarily a layer like maybe part of a set of helper functions which we might offer and develop over time however a layer also aligns with a workflow as the user can batch string and spans and first pass it through this layer to get the BIO tags which then go through the other layers

aflah02 avatar Apr 13 '22 20:04 aflah02

Thanks! How would the BIO get used in the downstream model? Is it the prediction target? Also curious - is BIO always at character level?

chenmoneygithub avatar Apr 13 '22 21:04 chenmoneygithub

@chenmoneygithub One use case for BIO tags is in Named Entity Recognition Models. So the use case is basically a sequence-to-sequence learning setting where each token is tagged according to the BIO Scheme and then for unseen samples it's predicted whether each token is B, I or O. I think these are token level and not character level so for example:

Sentence: France and Germany are a part of the European Union Spans: ['France ', 'Germany', 'European Union'] BIO Representation: B O B O O O O O B I

So at each training run the model will get the Sentence and BIO Representation to learn from and during testing it will take the Sentence and try to predict the BIO Representation using which the Spans can be created

aflah02 avatar Apr 13 '22 22:04 aflah02

Thanks! I think we want this tool, one more question - what does a dataset for NER tasks look like? Does it always come with sentence and spans, but missing BIO tags? I am trying to understand how this tool wiil fit in the e2e workflow.

chenmoneygithub avatar Apr 14 '22 20:04 chenmoneygithub

@chenmoneygithub Oh that's nice! So basically I don't think I can speak in general however in my experience with working with annotators and annotating data myself the sentence and spans are present. After that the BIO Tags are precomputed and then used in the model so not directly in the process itself but I think that's primarily because there are no tools that do it as part of the workflow such as a layer. Also the process of creating these Bio Tags is often tedious to do as the annotators might make small mistakes such as:

Sentence: France and Germany are a part of the European Union Spans: ['France ', 'Germany', 'European union'] BIO Representation: Error as when it tries to look for 'European union' it cannot find it since the 'u' should've been capital

If we have this as a layer we can raise errors which can help the user in the workflow by stating something like:

"Span 'European union' not found in the Original Sentence for sample X, Please Check the Spans or the Sentences"

Which makes it much faster to trace back the exact data point which is causing issues and fix it

Your question did intrigue me so I looked up some NER datasets and noticed that while sharing they are already BIO Tagged such as this however even this fits the second proposed use case where it can act analogous to how a detokenizer works to generate the spans from the text if needed. Also it still retains it's original use as these BIO Tags would've been made somehow I guess our tool can kick in at that point as well to create tags for the entire dataset and can also work as a layer during training for the same reasons

Also it turns out there are several variations of the BIO Tagging System such as IOB1, IOB2, BILOU etc. which are all minor variations of each other and hence can be implemented as choices in the layer

Also feel free to ask any more questions you have on this!

aflah02 avatar Apr 14 '22 20:04 aflah02

As a side note with the advent of better annotating tools such as Prodigy this is often handled by the tool itself but i still think there would be labs using Google Sheets to do the same (p.s. I was told to xD and only later found the existence of these tools and because these tools are often paid or unmaintained)

aflah02 avatar Apr 14 '22 21:04 aflah02

Thanks for the detailed explanation!

Could you write a colab to showcase how this layer would be used in an e2e flow? Since this is not a planned feature by KerasNLP team, we need to first settle down the use case and determine the API interface.

chenmoneygithub avatar Apr 15 '22 18:04 chenmoneygithub

@chenmoneygithub You're Welcome! Sure I'll share a colab for the same

aflah02 avatar Apr 15 '22 20:04 aflah02

I haven't had time to read this paper yet, but open question for me...

Do we need the ability to map from "token index spans" to "source text spans" and vice versa? E.g if a model predicts a span of interest from token 5 to token 10, do we need to say that maps to the characters "hello there" in some input text?

If so there might be a bit of basic infrastructure we need to build out first for our tokenizers. Essentially, we would need the functionality described here https://www.tensorflow.org/text/api_docs/python/text/TokenizerWithOffsets.

We could start work on that today if so. It's something I expect we will need at some point for name entity recognition, part of speech tagging, question answering.

I'll read the paper, so I can comment more. But dropping the question here in case people have thoughts!

mattdangerw avatar Apr 15 '22 21:04 mattdangerw

@mattdangerw I think we might need that but I'm not totally sure as a lot of this (I feel) is subjective to how the data is being formatted, I've also gotten some second thoughts when implementing the colab so I'll try to answer them as I built it. Sadly I've also only had limited experience with NER tasks :(

aflah02 avatar Apr 15 '22 21:04 aflah02

@aflah02 For this tagging layer, I think we can assume that we have spans at token levels. Even if the dataset comes with span on original sentence level, the tokenizer and some helper functions would help produce spans at token levels. In short, handling the relationship between tokens and original characters/words is outside the scope of this tagging layer, which however is very important to the e2e workflow.

chenmoneygithub avatar Apr 16 '22 00:04 chenmoneygithub