spaCy icon indicating copy to clipboard operation
spaCy copied to clipboard

luganda language extension

Open tobiusaolo opened this issue 3 years ago • 10 comments

This is an intiative to add luganda language from Uganda,East Africa to spacy

tobiusaolo avatar May 25 '22 08:05 tobiusaolo

You can click on the red X to see why a build failed, though in this case the most recent build succeeded so you're fine. We'll take a look at this, thanks for the submission!

polm avatar May 26 '22 02:05 polm

Thank you , Am looking forward for your feedback

tobiusaolo avatar May 26 '22 07:05 tobiusaolo

Aside from the ordinal English number issue mentioned above, I think this PR is looking good for initial support for Luganda.

Were you planning on adding any custom tokenizer settings (in punctuation.py or tokenizer_exceptions.py) or do the current defaults work well enough for now?

I think it would be nice to have a few example sentences in examples.py. You can choose your own sentences or translate sentences from another language like the English examples:

https://github.com/explosion/spaCy/blob/master/spacy/lang/en/examples.py

adrianeboyd avatar Jun 20 '22 07:06 adrianeboyd

Alright let me do that

On Mon, Jul 4, 2022 at 5:09 PM Sofie Van Landeghem @.***> wrote:

@.**** commented on this pull request.

In spacy/lang/lg/stop_words.py https://github.com/explosion/spaCy/pull/10847#discussion_r913042334:

+STOP_WORDS=set(

  • "wa lwa si ebyo nti anti nanti okutuusa tu wandi wa kiki kki dda"
  • "a singa oluvannyuma neera yenna nze ne kyonna ba nga ku beera kubanga"
  • "byombi naye osobola buli okuva kuva teyalina talina bayina byonna yonna byaffe be"
  • "bombi tebaalina tayina bonna zonna tayina tebaalina teyayina tetulina alina wano bimu abadde waliwo"
  • "bangi wakati ejja omuli ebyo nabo balina kuwa kyaffe olwekyo"
  • "buva bwaffe yonna ddala liryo yaffe terina kennyini ye bwonna bokka abalala bulungi kirungi ebweru"
  • "obulungi leero bya kikye yina atya munda ziba byabwe tewali erimu engeri ffenna lyange okudda kudda ebiri twafuna nnyingi lyabwe"
    
  • "zaabwe mu endala lyaffe kye nnyini tebayina yennyini ga bibye ayinza ali kikino nandi"
  • "ye nyinza ateekeddwa tetuteekeddwa neetaaga seetaaga nedda edda kati ku gumu gujja oba ekirala wabweru waggulu"
  • "nnina byebimu n'olwekyo ekyo bo abava bingi abangi ojja bangi waliyo bino bwabwe bandi bajja ajja wansi bulijjo kaseera ba"
  • "balina kino ebyo ku nnyo ennyo okutuusa bwayo yabadde ffe tu-yina kyekimu"
  • "oyo babadde baali tebaali ki kiki ddi wa ani lwaki ne gwe wandi oli oyina kikyo e mu wange ku bwe wa bajja"
  • "newankubade sinakindi n'olwekyo okuggyako gunno guno bateekeddwa oba gwe mwe"
  • "gyabwe erina tolina ebimu mingi zijja ffe nanti anti naye ate"
    
  • "wamu awamu baweebwa aweebwa weebwa era wadde mpozzi ekyo oyo kati  kyekyo oluvannyuma kwegamba nandiyagadde wadde kubanga"
    
  •  "olwokuba wabula nnyo nnyini nnyinza tuyina tulina tayina balina bali okuwa twetaaga okugenda bayina alina mulina"
    
  • "oyina olina abamu bano ye otya ki ono gwa nabadde mbadde".split() +)

I would format this as

STOP_WORDS = set( """ wa lwa si ... ... ... nabadde mbadde """.split() )

and sort this alphabetically

— Reply to this email directly, view it on GitHub https://github.com/explosion/spaCy/pull/10847#pullrequestreview-1027610475, or unsubscribe https://github.com/notifications/unsubscribe-auth/AHLWOY4ORJCCUNLELLBSZZTVSLWCZANCNFSM5W4JFWKQ . You are receiving this because you authored the thread.Message ID: @.***>

tobiusaolo avatar Jul 04 '22 14:07 tobiusaolo

Thanks for the updates, this is looking good! In a second I'll try to make a few minor edits and reformat so this is ready to merge...

adrianeboyd avatar Jul 14 '22 07:07 adrianeboyd

Actually, one more question: what is the intended tokenization of strings like 'ab’emmamba' and "ky'ebyenjigiriza"? When I try out the examples (thanks for adding a few!), I get the tokenization:

Abooluganda ab’emmamba ababiri ['Abooluganda', 'ab’emmamba', 'ababiri']
Ekisaawe ky'ebyenjigiriza kya mugaso nnyo ['Ekisaawe', "ky'ebyenjigiriza", 'kya', 'mugaso', 'nnyo']

From the stop words, it looks like you're expecting "ky'" to be a separate token?

If I know what the tokenization is intended to be, I can add a few tokenizer tests and help adjust the tokenizer settings.

adrianeboyd avatar Jul 14 '22 07:07 adrianeboyd

Thank you for your feedback, first about tokenization i have consulted the language expert and if we follow the sentence below:

sentence:Abooluganda ab’emmamba ababiri

We can tokenize the above sentence as ['Abooluganda', 'ab’emmamba', 'ababiri']

About the 'ky' i will remove it and update the repo otherwise thank you for the guidance

Regards

On Thu, Jul 14, 2022 at 10:46 AM Adriane Boyd @.***> wrote:

Actually, one more question: what is the intended tokenization of strings like 'ab’emmamba' and "ky'ebyenjigiriza"? When I try out the examples (thanks for adding a few!), I get the tokenization:

Abooluganda ab’emmamba ababiri ['Abooluganda', 'ab’emmamba', 'ababiri']

Ekisaawe ky'ebyenjigiriza kya mugaso nnyo ['Ekisaawe', "ky'ebyenjigiriza", 'kya', 'mugaso', 'nnyo']

From the stop words, it looks like you're expecting "ky' to be a separate token?

If I know what the tokenization is intended to be, I can add a few tokenizer tests and help adjust the tokenizer settings.

— Reply to this email directly, view it on GitHub https://github.com/explosion/spaCy/pull/10847#issuecomment-1184111641, or unsubscribe https://github.com/notifications/unsubscribe-auth/AHLWOYYSZOKSGLWUZOL4NRTVT7AVDANCNFSM5W4JFWKQ . You are receiving this because you authored the thread.Message ID: @.***>

tobiusaolo avatar Jul 15 '22 07:07 tobiusaolo

Do you have a source for the stop words?

I'm still a bit confused about the tokenizer settings vs. stop words.

Is ky' ever a separate token and not just a prefix? With the current tokenizer settings, none of the stop words with ' will end up as separate tokens, so the stop words with apostrophes might not make sense.

For example:

import spacy

nlp = spacy.blank("lg")

doc = nlp("Ekiwandiiko ky'olunaku")
print([t.text for t in doc]) # ['Ekiwandiiko', "ky'olunaku"]

I will add some basic tokenizer tests in a minute with the example above.

adrianeboyd avatar Jul 27 '22 07:07 adrianeboyd

You're right, the ky' is not supposed it a prefix. According to the discussion with the Luganda Experts, they indicated that the word should remain "ky'olunaku" when tokenized. For the source of stop words we are its don't yet publish but it was list generated by the Experts here.

On Wed, Jul 27, 2022 at 10:24 AM Adriane Boyd @.***> wrote:

Do you have a source for the stop words?

I'm still a bit confused about the tokenizer settings vs. stop words.

Is ky' ever a separate token and not just a prefix? With the current tokenizer settings, none of the stop words with ' will end up as separate tokens, so the stop words with apostrophes might not make sense.

For example:

import spacy nlp = spacy.blank("lg") doc = nlp("Ekiwandiiko ky'olunaku")print([t.text for t in doc]) # ['Ekiwandiiko', "ky'olunaku"]

I will add some basic tokenizer tests in a minute with the example above.

— Reply to this email directly, view it on GitHub https://github.com/explosion/spaCy/pull/10847#issuecomment-1196358519, or unsubscribe https://github.com/notifications/unsubscribe-auth/AHLWOY3BQW3RFFTMCMZBLETVWDPZDANCNFSM5W4JFWKQ . You are receiving this because you authored the thread.Message ID: @.***>

tobiusaolo avatar Jul 29 '22 12:07 tobiusaolo

I'm worried that users will be confused in the future because "ky'" is a stop word but never a separate token that could be marked as a stop word. Does it make sense to remove all these stop words?

contractions = [
    "b'",
    "bw'",
    "by'",
    "eky'",
    "ey'",
    "ez'",
    "g'",
    "gw'",
    "gy'",
    "ky'",
    "lw'",
    "ly'",
    "n'",
    "ng'",
    "olw'",
    "ow'",
    "w'",
    "y'",
    "z'",
]

adrianeboyd avatar Aug 03 '22 10:08 adrianeboyd

I sent in a new stopwords list in the latest PR which does not include the "ky',b'". Those words were transferred to the contractions. I suggest that the stopwords should stand since the contractions are distinct . Kind regards

On Wed, Aug 3, 2022 at 1:41 PM Adriane Boyd @.***> wrote:

I'm worried that users will be confused in the future because "ky'" is a stop word but never a separate token that could be marked as a stop word. Does it make sense to remove all these stop words?

contractions = [ "b'", "bw'", "by'", "eky'", "ey'", "ez'", "g'", "gw'", "gy'", "ky'", "lw'", "ly'", "n'", "ng'", "olw'", "ow'", "w'", "y'", "z'", ]

— Reply to this email directly, view it on GitHub https://github.com/explosion/spaCy/pull/10847#issuecomment-1203780173, or unsubscribe https://github.com/notifications/unsubscribe-auth/AHLWOYY346J6O74ENYHFJ7TVXJEGJANCNFSM5W4JFWKQ . You are receiving this because you authored the thread.Message ID: @.***>

tobiusaolo avatar Aug 16 '22 11:08 tobiusaolo

Sorry for the delay, I thought I should wait on an update because in the current version the contractions are still added to the stop words. If the contractions are removed, then I think this is fine to merge. Let me go ahead and do that...

We're actually planning to remove the default stop word lists for v4, but I was hoping to leave all the stop words in v3 as a useful reference for users.

adrianeboyd avatar Aug 23 '22 08:08 adrianeboyd

Thanks again for the PR! We'll mention Luganda in the release notes for the next release (probably v3.4.2).

adrianeboyd avatar Aug 23 '22 11:08 adrianeboyd