stanza
stanza copied to clipboard
Arabic words being changed from the original text in the lemmatization process
If you see the commands below, for any single word identified, the lemma is being shown as different which for me is a bug. Please, correct me if not and arabic follows another pattern different of the western languages.
import stanza
nlp = stanza.Pipeline('ar', processors='tokenize, lemma')
nlp('السيارة جميلة جدا')
In [12]: nlp('السيارة جميلة جدا')
Out[12]:
[
[
{
"id": 1,
"text": "السيارة",
"lemma": "سَيَّارَة",
"misc": "start_char=0|end_char=7"
},
{
"id": 2,
"text": "جميلة",
"lemma": "جمِيلَة",
"misc": "start_char=8|end_char=13"
},
{
"id": 3,
"text": "جدا",
"lemma": "جِدّ",
"misc": "start_char=14|end_char=17"
}
]
]
To Reproduce Steps to reproduce the behavior:
import stanza
stanza.download('ar')
nlp = stanza.Pipeline('ar', processors='tokenize, lemma')
nlp('السيارة جميلة جدا')
We don't really have any Arabic experts in the stanza group, so would you explain a bit more what you see happening?
The idea of the lemma annotator is to condense different forms of the same word (in different parts of the text) into a single word. So for example, in English, run runs ran running -> run
Is something like that happening here, or is it going haywire and replacing a word with a completely different meaning?
Hi @AngledLuffa, thank you for your answer. I am not an Arabic expert, I am a software developer dealing with these cases during development using Google Translate. What I see is that any lemma created by Stanza in the Arabic language is identified by Google as sindhi and not Arabic. It's not seen in other languages. Could it be a software error?
I literally know nothing about this, and I am equally sure that none of the original stanza developers do either, but I will ask a broader audience in our research group to see if they can identify a problem with these lemmas. In the meantime, I would encourage you to look elsewhere if you know anyone with Arabic expertise. If there is in fact a problem, we'll take a look, of course.
Speaking with some of my Arabic speaking more distantly related colleagues, they seem to think this lemmatization is close enough but point out that the model is adding diacritics. Perhaps the reason for this is the model was only trained on text containing diacritics. I don't know why Google is identifying the different texts as different languages, but it could be because of the different diacritics.
On Thu, Nov 5, 2020 at 10:15 AM John Bauer [email protected] wrote:
I literally know nothing about this, and I am equally sure that none of the original stanza developers do either, but I will ask a broader audience in our research group to see if they can identify a problem with these lemmas. In the meantime, I would encourage you to look elsewhere if you know anyone with Arabic expertise. If there is in fact a problem, we'll take a look, of course.
The Arabic model was trained on the Prague Arabic Dependency Treebank as prepared by Universal Dependencies. Info on that here https://github.com/UniversalDependencies/UD_Arabic-PADT/tree/master. It says that the lemmas are vocalized so you'll likely want to strip off the diacritics.
I continue to know less about Arabic lemmatization than any other NLP researcher in the world, but at least we're upgrading the lemmatizer with a copy mechanism in the coming release. That should hopefully improve things a little bit. If not, perhaps it would make sense to check the original treebank to see if it's following a reasonable standard. If the problem is still with the model, we can always revisit other ways to make it better
The model is doing exactly what it was trained to do. My understanding is Arabic lemmatization involves diacritization because of the ambiguity when diacritics are left out. If anyone wants to use the lemmas for an application like IR, they probably want to strip them. I just tried pasting some Arabic text with full diacritics into Google translate and its lang id does return Sindhi. Manually setting the language does result in a reasonable translation.
I think this issue can be closed, but you may want to add some documentation for lemmatization - that the forms returned depend on the training data and some might produce unexpected results such as diacritization with Arabic and in Persian you get # characters in some lemmas (PerDT model) because in the training data they have entries like نشود
mapping to شد#شو
.
I think the model is doing fine but it could be fine-tuned for Arabic. In Arabic language almost all the words (nouns, adjective and verbs) are come from the simple form of the verb, like:
-
جَميلَة
-->جَمَلَ
(beautiful) -
َجَمال
-->جَمَل
(something that is the cause of your beauty) -
تَجَمُّل
-->جَمَلَ
(paying extra attention on something) whereجَمَلَ
is the verb of making something or somebody beautiful.
So it seems lemma is performing a grate deal in Arabic.
As @cash mentioned Persian also has same structure but in a little bit more complex way: Persian words fall into 3 categories which in all of them you can find either the present or past root of the verb, like
-
ن
+ شو
+د
=نشود
(It's not possible) -
ن
+ شد
=نشد
(It didn't happen) -
شد
+ ن
+ی
=شدنی
(possible) whenشد
means (happened) andشو
is in commanding form (which is the present root of verb) means (be)
I guess this is a pretty stale issue, but with the copy mechanism in the lemmatizer, is there anything left to do for this issue?