[QUESTION] Arabic model not recognizing words
I've been using several of the Stanza models for POS tagging in other languages (English, Spanish, Mandarin, German, Japanese) and these all seem to work great. But with the Arabic model ('ar'), I'm getting "X" for every POS tag, with the transcripts that I'm currently using. I'm not sure if this is an issue with Stanza itself, or if it is with the formatting of the transcripts, or how I'm reading in the data.
Here is what I am doing:
nlp = stanza.Pipeline("ar")
file = 4023.txt
transcript_raw = pd.read_csv(file,sep="\t", encoding="utf-8")
concat_sentences = []
for row in transcript_raw.iterrows():
if str(row[1][0]).startswith("***"):
if concat_sentences:
utterance = " ".join(concat_sentences)
concat_sentences = []
else:
concat_sentences.append(row[1][0])
break
nlp_out = nlp(concat_sentences[0])
Note that the break at the end is just to get an example case here.
Does anyone have any idea what is going wrong here? I've also tried running arabic_reshaper.reshape on the string first, but this does not resolve the issue. The head and deprel attributes do seem to be present, so I was wondering if this was specifically an issue with POS tagging, but it seems others have been able to successfully use this pipeline.
The transcript that I'm using comes from the CALLHOME corpus, and I've attached an example file, to be used in the code snippet above. 4023.txt
For reference, here is what we trained on:
https://github.com/UniversalDependencies/UD_Arabic-PADT
Let's start from some random sentence in the text file you sent
ﺍﻧﺍ ﺳﻣﻋﺗ ؛ﻣﻫﺍ ﻛﺍﻧﺗ ﺑﺗﺣﻛﻯ ؛ﻟﺷﻳﺭﻳﻧ ﻭ ؛ﺷﻓﻳﻋﺓ ﻭ ﻣﺍﻋﺭﻓﺷ ﻣﻳﻧ ﻭ ﻫﻣ ﻗﺍﻟﻭﺍ_ﻟﻯ
The first character looks a lot like ا from a sentence in the PADT training data such as
برلين 15-7 (اف ب) - افادت صحيفة الاحد الالمانية "ويلت ام سونتاغ" في عددها الصادر غدا، ان المستشار غيرهارد شرودر يرفض حصول المجموعة ميركية "جنرال ديناميكس" على رخصة لتصنيع الدبابة الالمانية "ليوبارد 2" عبر شراء المجموعة الحكومية الاسبانية للاسلحة "سانتا بربارة".ة".
but watch what happens when I do this:
ord(text[0]) # this was your sentence
65165
ord(text[0]) # this is from the training data
1575
Looking that up:
https://www.utf8icons.com/character/1575/arabic-letter-alef https://www.utf8icons.com/character/65165/arabic-letter-alef-isolated-form
The next letter in the random bit of text I grabbed is
https://www.utf8icons.com/character/65255/arabic-letter-noon-initial-form
So it looks like the text is composed of real Arabic letters, but the important thing here is that they don't show up anywhere in the training data, which means the models won't recognize them whatsoever. The letter "noon" (normally written "nun"?) that shows up in our training data is
https://www.utf8icons.com/character/1606/arabic-letter-noon
Is there some way to unify these to a canonical version, or perhaps the version used in the training set?
A brief look at arabic_reshaper makes it look like it turns "no form" text into having the proper form. So basically we would need to either
- convert all of Arabic PADT using arabic_reshaper, then train models using both the "no form" and the proper form text so that we can handle either writing style
- somehow inverse the arabic_reshaper operation to use the current models
Perhaps the easiest thing to do would be to put everything through an unshaper such as in this Stack Overflow question:
https://stackoverflow.com/questions/33718144/do-arabic-characters-have-different-unicode-code-points-based-on-position-in-str
It's not perfect, but if I start from the SHAPING map described in that Stack Overflow post, reverse it
>>> unshape = {}
>>> for x in SHAPING:
... for y in SHAPING[x]:
... unshape[y] = x
...
then use it to map the text
text2 = "".join([unshape.get(x, x) for x in text])
the POS tagger at least tags most of the words with something intelligible instead of X
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
Is this something you think should be a default behavior of the Arabic pipeline? I have close to zero Arabic knowledge and cannot comment on how frequently this kind of text shows up in everyday NLP usage.
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
This issue has been automatically closed due to inactivity.