stanza icon indicating copy to clipboard operation
stanza copied to clipboard

Match URLs and mentions correctly in NER

Open paulthemagno opened this issue 4 years ago • 4 comments

I'm running some of the available NER models on different texts.

Many times in these text URLs, EMAILs, mentions and so on appear and I've seen that, not surprisingly, the model tends to make mistakes on them. For example sometimes @.... is taken as nothing, sometimes as ORG (keeping @ in the entity), sometimes as PER (not including @ in the entity).

Visit my website https://mysite.com and follow me on Instagram as @veryImportantUser

I've seen that now in CoreNLP URL, EMAIL andHANDLE classes solves this problem by using a regex annotator, so I'm asking if such a label is planned for Stanza because it would be very helpful in order not to preprocess the text removing this stuff and keeping the text clean. Or if this is not possible, at least, a method to force Stanza not to tag these words.

Otherwise does an already smart and documented approach exist? I could remove these words from the original text and run NER on that, but next I should re-integrate the removed words in the text and change all the character indexes of the resulting entities.

paulthemagno avatar Apr 06 '21 11:04 paulthemagno

There are no short term plans for making such a change, although if you keep this open, we'll eventually work on it.

The most recent version at least uses regex to tokenize URLs and emails as single tokens, so that's an improvement.

AngledLuffa avatar Apr 08 '21 06:04 AngledLuffa

There are no short term plans for making such a change, although if you keep this open, we'll eventually work on it. The most recent version at least uses regex to tokenize URLs and emails as single tokens, so that's an improvement.

Thank you, it would be great 🥳

paulthemagno avatar Apr 08 '21 15:04 paulthemagno

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

stale[bot] avatar Jun 07 '21 16:06 stale[bot]

This issue has been automatically closed due to inactivity.

stale[bot] avatar Jun 14 '21 18:06 stale[bot]