argilla icon indicating copy to clipboard operation
argilla copied to clipboard

[FEATURE]Auto-annotation of Repeated Tokens

Open bikash119 opened this issue 7 months ago • 2 comments

Is your feature request related to a problem? Please describe. There are situations when same bigrams,trigrams, etc appear multiple times in a text being annotated. The annotator has to repeatedly annotate the n-grams, else the tokens will be labelled as "0" under IOB scheme by default.

Describe the solution you'd like Currently, Argilla UI enables us to annotate/label tokens in a text with an easy-to-use interface. However, I've identified a use case where an additional feature could enhance efficiency:

Sample claim text:

The method of claim 1 that includes the step of locating said pillow directly between a tympanic membrane and a round window membrane, but without contacting the round window membrane to block the approach of the tympanic membrane into close proximity to said round window membrane.

Assume we have labels like: ["method of use", "product", "machine", "system"] Here first occurrence of token tympanic membrane is labelled as product by annotator. Since there are multiple instances of the tympanic membrane, the annotator must annotate each instance appropriately, else the system implicitly annotates them as 'O' per the IOB scheme to each token of the bigram. This makes it harder for the model to learn that "tympanic membrane" is a product and shouldn't be treated as two different tokens "tympanic" and "membrane".

Proposed Improvement

When an annotator labels a token (e.g., "tympanic membrane" as "product"), the system could automatically identify and suggest the same label for all exact matches of that token in the text. This would:

Reduce repetitive labeling actions Save significant time, especially in longer texts with recurring terms Ensure consistency in labeling across the document Prevent accidental omissions that could lead to incorrect 'O' labels in the IOB scheme

bikash119 avatar Jul 23 '24 08:07 bikash119