spaCy
spaCy copied to clipboard
Sentencizer fails with Armenian, Gujarati, and Icelandic
Hi, the outputs of sentence segmentation with Armenian, Gujarati, and Icelandic texts are not as expected.
Note that the colon : is used in Armenian as the sentence terminator.
Related issue: #4269
>>> import spacy
>>> TEXT_HYE = 'Հայոց լեզվով ստեղծվել է մեծ գրականություն։ Գրաբարով է ավանդված հայ հին պատմագրությունը, գիտափիլիսոփայական, մաթեմատիկական, բժշկագիտական, աստվածաբանական-դավանաբանական գրականությունը։ Միջին գրական հայերենով են մեզ հասել միջնադարյան հայ քնարերգության գլուխգործոցները, բժշկագիտական, իրավագիտական նշանակալի աշխատություններ։ Գրական նոր հայերենի արևելահայերեն ու արևմտահայերեն գրական տարբերակներով ստեղծվել է գեղարվեստական, հրապարակախոսական ու գիտական բազմատիպ ու բազմաբնույթ հարուստ գրականություն։'
>>> TEXT_GUJ = 'ગુજરાતી (/ɡʊdʒəˈrɑːti/[૭], રોમન લિપિમાં: Gujarātī, ઉચ્ચાર: [ɡudʒəˈɾɑːtiː]) ભારત દેશના ગુજરાત રાજ્યની ઇન્ડો-આર્યન ભાષા છે, અને મુખ્યત્વે ગુજરાતી લોકો દ્વારા બોલાય છે. તે બૃહદ ઇન્ડો-યુરોપિયન ભાષા કુટુંબનો ભાગ છે. ગુજરાતીનો ઉદ્ભવ જૂની ગુજરાતી ભાષા (આશરે ઇ.સ. ૧૧૦૦-૧૫૦૦)માંથી થયો છે. તે ગુજરાત રાજ્ય અને દીવ, દમણ અને દાદરા-નગર હવેલી કેન્દ્રશાસિત પ્રદેશોની અધિકૃત ભાષા છે.'
>>> TEXT_ISL = 'Íslenska er vesturnorrænt, germanskt og indóevrópskt tungumál sem er einkum talað og ritað á Íslandi og er móðurmál langflestra Íslendinga.[5] Það hefur tekið minni breytingum frá fornnorrænu en önnur norræn mál[5] og er skyldara norsku og færeysku en sænsku og dönsku.[2][3]'
>>> nlp = spacy.blank('hy'); nlp.add_pipe('sentencizer'); doc = nlp(TEXT_HYE); len(list(doc.sents))
<spacy.pipeline.sentencizer.Sentencizer object at 0x000001C1FB330380>
1
>>> nlp = spacy.blank('gu'); nlp.add_pipe('sentencizer'); doc = nlp(TEXT_GUJ); len(list(doc.sents))
<spacy.pipeline.sentencizer.Sentencizer object at 0x000001C1FB40B040>
1
>>> nlp = spacy.blank('is'); nlp.add_pipe('sentencizer'); doc = nlp(TEXT_ISL); len(list(doc.sents))
<spacy.pipeline.sentencizer.Sentencizer object at 0x000001C1FB3FCA40>
1
- Operating System: Windows 10 x64
- Python Version Used: Python 3.8.7 x64
- spaCy Version Used: 3.0.6
Thanks for the report, it's useful to see these kind of test cases. The underlying reasons are interactions with the tokenizer, which isn't splitting off the sentence-final punctuation into separate tokens, which means that the sentencizer doesn't recognize the punctuation tokens in order to split on them.
I think Icelandic should be okay for cases without the wikipedia footnote marks.
Armenian and Gujarati don't have any language-specific tokenizer settings, so they fall back to a relatively English-like set of defaults, which don't split ։ as a separate token and which assume that short strings followed by periods like છે. are abbreviations.
What is your goal/task? The sentencizer is pretty simple and at best you're going to get sentence-ish chunks more than high quality sentence boundaries, especially once the punctuation gets more complicated or is potentially missing.
I use spaCy to do word tokenization and sentence tokenization for all languages supported by spaCy in my project (which is a multi-lingual corpus processing and analyzing tool packaged and provided to non-technical users). I do not have a specific goal like the one that sentence or word tokenization must be nearly 100% correct, but since I do exhaustive tests with each task and language in my project, I would like to report whenever assertion errors are thrown during testing.
This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.