stanza [QUESTION] Arabic model not recognizing words

I've been using several of the Stanza models for POS tagging in other languages (English, Spanish, Mandarin, German, Japanese) and these all seem to work great. But with the Arabic model ('ar'), I'm getting "X" for every POS tag, with the transcripts that I'm currently using. I'm not sure if this is an issue with Stanza itself, or if it is with the formatting of the transcripts, or how I'm reading in the data.

Here is what I am doing:

nlp = stanza.Pipeline("ar")

file = 4023.txt
transcript_raw = pd.read_csv(file,sep="\t", encoding="utf-8")
            concat_sentences = []
            for row in transcript_raw.iterrows():
                if str(row[1][0]).startswith("***"):           
                    if concat_sentences:
                        utterance = " ".join(concat_sentences)
                    concat_sentences = []                
                else:
                    concat_sentences.append(row[1][0])
                    break
nlp_out = nlp(concat_sentences[0])

Note that the break at the end is just to get an example case here.

Does anyone have any idea what is going wrong here? I've also tried running arabic_reshaper.reshape on the string first, but this does not resolve the issue. The head and deprel attributes do seem to be present, so I was wondering if this was specifically an issue with POS tagging, but it seems others have been able to successfully use this pipeline.

The transcript that I'm using comes from the CALLHOME corpus, and I've attached an example file, to be used in the code snippet above. 4023.txt

Mar 16 '23 09:03 jptrujillo

For reference, here is what we trained on:

https://github.com/UniversalDependencies/UD_Arabic-PADT

Let's start from some random sentence in the text file you sent

ﺍﻧﺍ ﺳﻣﻋﺗ ؛ﻣﻫﺍ ﻛﺍﻧﺗ ﺑﺗﺣﻛﻯ ؛ﻟﺷﻳﺭﻳﻧ ﻭ ؛ﺷﻓﻳﻋﺓ ﻭ ﻣﺍﻋﺭﻓﺷ ﻣﻳﻧ ﻭ ﻫﻣ ﻗﺍﻟﻭﺍ_ﻟﻯ

The first character looks a lot like ا from a sentence in the PADT training data such as

برلين 15-7 (اف ب) - افادت صحيفة الاحد الالمانية "ويلت ام سونتاغ" في عددها الصادر غدا، ان المستشار غيرهارد شرودر يرفض حصول المجموعة ميركية "جنرال ديناميكس" على رخصة لتصنيع الدبابة الالمانية "ليوبارد 2" عبر شراء المجموعة الحكومية الاسبانية للاسلحة "سانتا بربارة".ة".

but watch what happens when I do this:

ord(text[0])   # this was your sentence
65165
ord(text[0])    # this is from the training data
1575

Looking that up:

https://www.utf8icons.com/character/1575/arabic-letter-alef https://www.utf8icons.com/character/65165/arabic-letter-alef-isolated-form

The next letter in the random bit of text I grabbed is

https://www.utf8icons.com/character/65255/arabic-letter-noon-initial-form

So it looks like the text is composed of real Arabic letters, but the important thing here is that they don't show up anywhere in the training data, which means the models won't recognize them whatsoever. The letter "noon" (normally written "nun"?) that shows up in our training data is

https://www.utf8icons.com/character/1606/arabic-letter-noon

Is there some way to unify these to a canonical version, or perhaps the version used in the training set?

Mar 16 '23 17:03 AngledLuffa

A brief look at arabic_reshaper makes it look like it turns "no form" text into having the proper form. So basically we would need to either

convert all of Arabic PADT using arabic_reshaper, then train models using both the "no form" and the proper form text so that we can handle either writing style
somehow inverse the arabic_reshaper operation to use the current models

Mar 16 '23 17:03 AngledLuffa

Perhaps the easiest thing to do would be to put everything through an unshaper such as in this Stack Overflow question:

https://stackoverflow.com/questions/33718144/do-arabic-characters-have-different-unicode-code-points-based-on-position-in-str

Mar 16 '23 17:03 AngledLuffa

It's not perfect, but if I start from the SHAPING map described in that Stack Overflow post, reverse it

>>> unshape = {}
>>> for x in SHAPING:
...   for y in SHAPING[x]:
...     unshape[y] = x
...

then use it to map the text

text2 = "".join([unshape.get(x, x) for x in text])

the POS tagger at least tags most of the words with something intelligible instead of X

Mar 17 '23 03:03 AngledLuffa

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

May 18 '23 20:05 stale[bot]

Is this something you think should be a default behavior of the Arabic pipeline? I have close to zero Arabic knowledge and cannot comment on how frequently this kind of text shows up in everyday NLP usage.

May 18 '23 20:05 AngledLuffa

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

Jul 19 '23 03:07 stale[bot]

This issue has been automatically closed due to inactivity.

Aug 07 '23 03:08 stale[bot]