taranis-ai icon indicating copy to clipboard operation
taranis-ai copied to clipboard

No assignment of tags because of Named-Entity Recognition parameters

Open leonstiegler opened this issue 1 year ago • 1 comments

Problem:

Some items don't have tags assigned, even though the model does recognize words as tags.

Findings:

The item associated with this article had no tags: https://ooe.orf.at/stories/3240633/

A manual test with the Interface API of flair/ner-multi got the follwing results:

image

JSON:

image

The tags "Ebensee" (LOC), "Eibenberg" (LOC) und "Hartmuth Hofstätter" (PER) were correctly recognized.

The problem lies within the nlp_bot.py:

https://github.com/taranis-ai/taranis-ai/blob/958f4ff40adbf3779c07f6c3744908a4c6ebdc49/src/worker/worker/bots/nlp_bot.py#L67-L77

Tags can only be assigned if there are more than 2 instances of the word and the score is above 0.97.

"Ebensee" occurs twice "Eibenberg" and "Hartmuth Hofstätter" once, all of them are therefore in the eyes of the bot not a tag.

Even if "Ebensee" got mentioned three times, the score would probably not be high enough to be recognized as a tag.

Solution:

Due to the fact that these parameters are hardcoded and not adjustable for the user (which shouldn't be the case), their appropriateness needs to be reconsidered.

leonstiegler avatar Jan 16 '24 15:01 leonstiegler

len(tag) > 2 in this is in regard to length of the word, so words consisting of a single character or two characters are ignored right now. (Company X formeraly Twitter, would be an exmaple where this could be relevant).

The score of 0.97, is something that actually could make sense to expose as a configuration variable to users.

Why exactly the two Words in this example weren't picked up as a tag seems to be an error, I will in the next couple of days investigate and publish findings here.

b3n4kh avatar Jan 16 '24 17:01 b3n4kh