stanza
stanza copied to clipboard
Match URLs and mentions correctly in NER
I'm running some of the available NER models on different texts.
Many times in these text URLs, EMAILs, mentions and so on appear and I've seen that, not surprisingly, the model tends to make mistakes on them. For example sometimes @.... is taken as nothing, sometimes as ORG (keeping @ in the entity), sometimes as PER (not including @ in the entity).
Visit my website https://mysite.com and follow me on Instagram as @veryImportantUser
I've seen that now in CoreNLP URL, EMAIL andHANDLE classes solves this problem by using a regex annotator, so I'm asking if such a label is planned for Stanza because it would be very helpful in order not to preprocess the text removing this stuff and keeping the text clean. Or if this is not possible, at least, a method to force Stanza not to tag these words.
Otherwise does an already smart and documented approach exist? I could remove these words from the original text and run NER on that, but next I should re-integrate the removed words in the text and change all the character indexes of the resulting entities.
There are no short term plans for making such a change, although if you keep this open, we'll eventually work on it.
The most recent version at least uses regex to tokenize URLs and emails as single tokens, so that's an improvement.
There are no short term plans for making such a change, although if you keep this open, we'll eventually work on it. The most recent version at least uses regex to tokenize URLs and emails as single tokens, so that's an improvement.
Thank you, it would be great 🥳
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
This issue has been automatically closed due to inactivity.