spaCy Sentencizer fails with Armenian, Gujarati, and Icelandic

Hi, the outputs of sentence segmentation with Armenian, Gujarati, and Icelandic texts are not as expected. Note that the colon : is used in Armenian as the sentence terminator.

Related issue: #4269

>>> import spacy

>>> TEXT_HYE = 'Հայոց լեզվով ստեղծվել է մեծ գրականություն։ Գրաբարով է ավանդված հայ հին պատմագրությունը, գիտափիլիսոփայական, մաթեմատիկական, բժշկագիտական, աստվածաբանական-դավանաբանական գրականությունը։ Միջին գրական հայերենով են մեզ հասել միջնադարյան հայ քնարերգության գլուխգործոցները, բժշկագիտական, իրավագիտական նշանակալի աշխատություններ։ Գրական նոր հայերենի արևելահայերեն ու արևմտահայերեն գրական տարբերակներով ստեղծվել է գեղարվեստական, հրապարակախոսական ու գիտական բազմատիպ ու բազմաբնույթ հարուստ գրականություն։'
>>> TEXT_GUJ = 'ગુજરાતી ‍(/ɡʊdʒəˈrɑːti/[૭], રોમન લિપિમાં: Gujarātī, ઉચ્ચાર: [ɡudʒəˈɾɑːtiː]) ભારત દેશના ગુજરાત રાજ્યની ઇન્ડો-આર્યન ભાષા છે, અને મુખ્યત્વે ગુજરાતી લોકો દ્વારા બોલાય છે. તે બૃહદ ઇન્ડો-યુરોપિયન ભાષા કુટુંબનો ભાગ છે. ગુજરાતીનો ઉદ્ભવ જૂની ગુજરાતી ભાષા (આશરે ઇ.સ. ૧૧૦૦-૧૫૦૦)માંથી થયો છે. તે ગુજરાત રાજ્ય અને દીવ, દમણ અને દાદરા-નગર હવેલી કેન્દ્રશાસિત પ્રદેશોની અધિકૃત ભાષા છે.'
>>> TEXT_ISL = 'Íslenska er vesturnorrænt, germanskt og indóevrópskt tungumál sem er einkum talað og ritað á Íslandi og er móðurmál langflestra Íslendinga.[5] Það hefur tekið minni breytingum frá fornnorrænu en önnur norræn mál[5] og er skyldara norsku og færeysku en sænsku og dönsku.[2][3]'

>>> nlp = spacy.blank('hy'); nlp.add_pipe('sentencizer'); doc = nlp(TEXT_HYE); len(list(doc.sents))
<spacy.pipeline.sentencizer.Sentencizer object at 0x000001C1FB330380>
1
>>> nlp = spacy.blank('gu'); nlp.add_pipe('sentencizer'); doc = nlp(TEXT_GUJ); len(list(doc.sents))
<spacy.pipeline.sentencizer.Sentencizer object at 0x000001C1FB40B040>
1
>>> nlp = spacy.blank('is'); nlp.add_pipe('sentencizer'); doc = nlp(TEXT_ISL); len(list(doc.sents))
<spacy.pipeline.sentencizer.Sentencizer object at 0x000001C1FB3FCA40>
1

Operating System: Windows 10 x64
Python Version Used: Python 3.8.7 x64
spaCy Version Used: 3.0.6

Jun 29 '21 16:06 BLKSerene

Thanks for the report, it's useful to see these kind of test cases. The underlying reasons are interactions with the tokenizer, which isn't splitting off the sentence-final punctuation into separate tokens, which means that the sentencizer doesn't recognize the punctuation tokens in order to split on them.

I think Icelandic should be okay for cases without the wikipedia footnote marks.

Armenian and Gujarati don't have any language-specific tokenizer settings, so they fall back to a relatively English-like set of defaults, which don't split ։ as a separate token and which assume that short strings followed by periods like છે. are abbreviations.

What is your goal/task? The sentencizer is pretty simple and at best you're going to get sentence-ish chunks more than high quality sentence boundaries, especially once the punctuation gets more complicated or is potentially missing.

Jun 30 '21 08:06 adrianeboyd

I use spaCy to do word tokenization and sentence tokenization for all languages supported by spaCy in my project (which is a multi-lingual corpus processing and analyzing tool packaged and provided to non-technical users). I do not have a specific goal like the one that sentence or word tokenization must be nearly 100% correct, but since I do exhaustive tests with each task and language in my project, I would like to report whenever assertion errors are thrown during testing.

Jul 08 '21 15:07 BLKSerene

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

Sep 21 '22 00:09 github-actions[bot]

spaCy spaCy copied to clipboard

Sentencizer fails with Armenian, Gujarati, and Icelandic

spaCy
spaCy copied to clipboard