stanza Arabic words being changed from the original text in the lemmatization process

If you see the commands below, for any single word identified, the lemma is being shown as different which for me is a bug. Please, correct me if not and arabic follows another pattern different of the western languages.

import stanza
nlp = stanza.Pipeline('ar', processors='tokenize, lemma')
nlp('السيارة جميلة جدا')

In [12]: nlp('السيارة جميلة جدا')
Out[12]: 
[
  [
    {
      "id": 1,
      "text": "السيارة",
      "lemma": "سَيَّارَة",
      "misc": "start_char=0|end_char=7"
    },
    {
      "id": 2,
      "text": "جميلة",
      "lemma": "جمِيلَة",
      "misc": "start_char=8|end_char=13"
    },
    {
      "id": 3,
      "text": "جدا",
      "lemma": "جِدّ",
      "misc": "start_char=14|end_char=17"
    }
  ]
]

To Reproduce Steps to reproduce the behavior:

import stanza
stanza.download('ar')
nlp = stanza.Pipeline('ar', processors='tokenize, lemma')
nlp('السيارة جميلة جدا')

Nov 05 '20 11:11 hugomelo

We don't really have any Arabic experts in the stanza group, so would you explain a bit more what you see happening?

The idea of the lemma annotator is to condense different forms of the same word (in different parts of the text) into a single word. So for example, in English, run runs ran running -> run

Is something like that happening here, or is it going haywire and replacing a word with a completely different meaning?

Nov 05 '20 15:11 AngledLuffa

Hi @AngledLuffa, thank you for your answer. I am not an Arabic expert, I am a software developer dealing with these cases during development using Google Translate. What I see is that any lemma created by Stanza in the Arabic language is identified by Google as sindhi and not Arabic. It's not seen in other languages. Could it be a software error?

Nov 05 '20 18:11 hugomelo

I literally know nothing about this, and I am equally sure that none of the original stanza developers do either, but I will ask a broader audience in our research group to see if they can identify a problem with these lemmas. In the meantime, I would encourage you to look elsewhere if you know anyone with Arabic expertise. If there is in fact a problem, we'll take a look, of course.

Nov 05 '20 18:11 AngledLuffa

Speaking with some of my Arabic speaking more distantly related colleagues, they seem to think this lemmatization is close enough but point out that the model is adding diacritics. Perhaps the reason for this is the model was only trained on text containing diacritics. I don't know why Google is identifying the different texts as different languages, but it could be because of the different diacritics.

On Thu, Nov 5, 2020 at 10:15 AM John Bauer [email protected] wrote:

I literally know nothing about this, and I am equally sure that none of the original stanza developers do either, but I will ask a broader audience in our research group to see if they can identify a problem with these lemmas. In the meantime, I would encourage you to look elsewhere if you know anyone with Arabic expertise. If there is in fact a problem, we'll take a look, of course.

Nov 05 '20 21:11 AngledLuffa

The Arabic model was trained on the Prague Arabic Dependency Treebank as prepared by Universal Dependencies. Info on that here https://github.com/UniversalDependencies/UD_Arabic-PADT/tree/master. It says that the lemmas are vocalized so you'll likely want to strip off the diacritics.

May 20 '21 00:05 cash

I continue to know less about Arabic lemmatization than any other NLP researcher in the world, but at least we're upgrading the lemmatizer with a copy mechanism in the coming release. That should hopefully improve things a little bit. If not, perhaps it would make sense to check the original treebank to see if it's following a reasonable standard. If the problem is still with the model, we can always revisit other ways to make it better

May 20 '21 00:05 AngledLuffa

The model is doing exactly what it was trained to do. My understanding is Arabic lemmatization involves diacritization because of the ambiguity when diacritics are left out. If anyone wants to use the lemmas for an application like IR, they probably want to strip them. I just tried pasting some Arabic text with full diacritics into Google translate and its lang id does return Sindhi. Manually setting the language does result in a reasonable translation.

I think this issue can be closed, but you may want to add some documentation for lemmatization - that the forms returned depend on the training data and some might produce unexpected results such as diacritization with Arabic and in Persian you get # characters in some lemmas (PerDT model) because in the training data they have entries like نشود mapping to شد#شو.

May 20 '21 11:05 cash

I think the model is doing fine but it could be fine-tuned for Arabic. In Arabic language almost all the words (nouns, adjective and verbs) are come from the simple form of the verb, like:

جَميلَة --> جَمَلَ (beautiful)
َجَمال --> جَمَل (something that is the cause of your beauty)
تَجَمُّل --> جَمَلَ (paying extra attention on something) where جَمَلَ is the verb of making something or somebody beautiful.

So it seems lemma is performing a grate deal in Arabic.

As @cash mentioned Persian also has same structure but in a little bit more complex way: Persian words fall into 3 categories which in all of them you can find either the present or past root of the verb, like

ن ‍‍+ ‍‍شو + د = نشود (It's not possible)
ن ‍‍+ ‍‍شد = نشد (It didn't happen)
شد ‍‍+ ‍‍ن + ی = شدنی (possible) when شد means (happened) and شو is in commanding form (which is the present root of verb) means (be)

Aug 31 '21 14:08 sadrasabouri

I guess this is a pretty stale issue, but with the copy mechanism in the lemmatizer, is there anything left to do for this issue?

Apr 23 '22 17:04 AngledLuffa

stanza stanza copied to clipboard

Arabic words being changed from the original text in the lemmatization process

stanza
stanza copied to clipboard