taranis-ai
taranis-ai copied to clipboard
No assignment of tags because of Named-Entity Recognition parameters
Problem:
Some items don't have tags assigned, even though the model does recognize words as tags.
Findings:
The item associated with this article had no tags: https://ooe.orf.at/stories/3240633/
A manual test with the Interface API of flair/ner-multi got the follwing results:
JSON:
The tags "Ebensee" (LOC), "Eibenberg" (LOC) und "Hartmuth Hofstätter" (PER) were correctly recognized.
The problem lies within the nlp_bot.py:
https://github.com/taranis-ai/taranis-ai/blob/958f4ff40adbf3779c07f6c3744908a4c6ebdc49/src/worker/worker/bots/nlp_bot.py#L67-L77
Tags can only be assigned if there are more than 2 instances of the word and the score is above 0.97.
"Ebensee" occurs twice "Eibenberg" and "Hartmuth Hofstätter" once, all of them are therefore in the eyes of the bot not a tag.
Even if "Ebensee" got mentioned three times, the score would probably not be high enough to be recognized as a tag.
Solution:
Due to the fact that these parameters are hardcoded and not adjustable for the user (which shouldn't be the case), their appropriateness needs to be reconsidered.
len(tag) > 2
in this is in regard to length of the word, so words consisting of a single character or two characters are ignored right now. (Company X formeraly Twitter, would be an exmaple where this could be relevant).
The score of 0.97
, is something that actually could make sense to expose as a configuration variable to users.
Why exactly the two Words in this example weren't picked up as a tag seems to be an error, I will in the next couple of days investigate and publish findings here.