Scribe-Data icon indicating copy to clipboard operation
Scribe-Data copied to clipboard

Create Scribe-Data Swahili data process queries

Open andrewtavis opened this issue 1 year ago • 34 comments
trafficstars

Terms

Description

This issue would create the queries for Swahili in the src/scribe_data/language_data_extraction directory. To start we can make a nouns query and a verbs query in two separate PRs, and from there we can make new issues for other types of data. These queries can be based on the already existing queries for other languages 😊

Data types to include:

  • [x] Nouns
  • [x] Verbs
  • [x] Adjectives
  • [x] Adverbs
  • [x] Prepositions
  • [ ] Emoji keywords

Contribution

Happy to support and answer any questions that might come up in this process! Can also review when the PRs are up :)

andrewtavis avatar Oct 03 '24 09:10 andrewtavis

CC @@LevisNgigi who expressed interest in working on this :) Can you write in here and I'll assign? Feel free to make the directory structure as you see the other languages are structured!

andrewtavis avatar Oct 03 '24 09:10 andrewtavis

Yes I can write in here and I will be glad to be assigned this. Yes I will make the directory structure as I have seen for other languages.

LevisNgigi avatar Oct 03 '24 09:10 LevisNgigi

Fantastic, @LevisNgigi! Looking forward to the contribution :)

andrewtavis avatar Oct 03 '24 10:10 andrewtavis

Question, I am currently querying data from Wikidata Query Service and the column for singular and plural are currently empty for the swahili language.Is it possible to get clarification on how to proceed?

LevisNgigi avatar Oct 04 '24 08:10 LevisNgigi

Can you paste your query, @LevisNgigi? Maybe there's little data on Wikidata right now, or it's not categorized correctly 🤔 You can also try to remove everything from the query and just get Swahili words to check if there's info there :)

Nouns at the very least are usually consistent, so you'll still be able to send along your code that will work when there is data :)

andrewtavis avatar Oct 04 '24 08:10 andrewtavis

SELECT DISTINCT ?lexeme ?lemma ?singular ?plural

WHERE { ?lexeme dct:language wd:Q7838 ; wikibase:lexicalCategory wd:Q1084 ; wikibase:lemma ?lemma .

OPTIONAL {
    ?lexeme ontolex:lexicalForm ?singularForm .
    ?singularForm ontolex:representation ?singular ;
    wikibase:grammaticalFeature wd:Q110786 ;
} .

OPTIONAL {
    ?lexeme ontolex:lexicalForm ?pluralForm .
    ?pluralForm ontolex:representation ?plural ;
    wikibase:grammaticalFeature wd:Q146786 ;
} .

}

LIMIT 100
Above is my query. The lemma column has data but for singular and plural it is currently empty.Yes there are swahili words I had checked that before including the query with singular and plural.

LevisNgigi avatar Oct 04 '24 08:10 LevisNgigi

You can also use src/scribe_data/check_language_data.sparql to check the data totals :) It looks like there are 203 Swahili nouns using Q7838 :)

andrewtavis avatar Oct 04 '24 08:10 andrewtavis

By the looks of it singulars and plurals haven't been added for them yet, which is ok 😊 When I started Scribe years ago so many languages had no data. French only had two verbs with conjugations, and now there are thousands. Can you convert the lexeme over to just the LID instead of the URI, and from there I think we should be good for now :)

andrewtavis avatar Oct 04 '24 08:10 andrewtavis

You can see the conversion in other queries :)

andrewtavis avatar Oct 04 '24 08:10 andrewtavis

Yes just checked and they are only 203 and 20 verbs.Should I proceed or it needs more data?

LevisNgigi avatar Oct 04 '24 08:10 LevisNgigi

Proceed by all means, @LevisNgigi! There will be more data eventually :)

andrewtavis avatar Oct 04 '24 08:10 andrewtavis

For now do your best, and we can revisit the queries later 😊

andrewtavis avatar Oct 04 '24 08:10 andrewtavis

Thank you.Really appreciate your help.

LevisNgigi avatar Oct 04 '24 08:10 LevisNgigi

To quote you: The pleasure is mine :)

andrewtavis avatar Oct 04 '24 08:10 andrewtavis

Hey, I would like to also work on this issue

VNW22 avatar Oct 04 '24 15:10 VNW22

Hey @VNW22 👋 I think that @LevisNgigi has nouns covered. Would you want to make an adjectives query?

andrewtavis avatar Oct 04 '24 17:10 andrewtavis

yeah, happy to work on the adjectives query.

VNW22 avatar Oct 04 '24 18:10 VNW22

Ok, check the one for Bengali adjectives query and make something similar in the a Swahili directiry :)

andrewtavis avatar Oct 04 '24 18:10 andrewtavis

Hey @andrewtavis i would also like to work on this. Kindly let me know if there is anyway i could contribute.

GicharuElvis avatar Oct 08 '24 07:10 GicharuElvis

I'll leave it to the other contributors to say if there's more work to do here :) We'll make more issues soon.

andrewtavis avatar Oct 08 '24 07:10 andrewtavis

Sure, no worries. I'll be on the lookout

GicharuElvis avatar Oct 08 '24 07:10 GicharuElvis

I'll leave it to the other contributors to say if there's more work to do here :) We'll make more issues soon.

I think we have the Nouns,verb and adjectives query.I think there is no more work here for now. @GicharuElvis

LevisNgigi avatar Oct 08 '24 08:10 LevisNgigi

No worries. Let me have a look at the other issues. :)

GicharuElvis avatar Oct 08 '24 08:10 GicharuElvis

Okay :). You can also check in Scribe-android. https://github.com/scribe-org/Scribe-Android

LevisNgigi avatar Oct 08 '24 08:10 LevisNgigi

Or for this we could also do an adverbs one or prepositions :) Not sure on that for Swahili, but could be something to look into 😊

andrewtavis avatar Oct 08 '24 08:10 andrewtavis

Hey all 👋 In regards to the Swahili work, I did some filtering for sw so it's Latin script as that's what a quick search showed is mostly used (sorry if this isn't the case!). Does it make sense to also make queries for the Arabic-letter style? And are there different names for these types of written Swahili?

andrewtavis avatar Oct 08 '24 23:10 andrewtavis

Hey @andrewtavis the filtering you used works perfectly as the Swahili that is written and spoken uses Latin script. The Arabic style of writing fizzled out with the coming of the missionaries who introduced Latin script. The Arabic-letter style is no longer in use .Also there are no other names for Swahili just that Swahili borrowed a lot from Arabic language hence the use of Arabic-letter style back in the day.

LevisNgigi avatar Oct 09 '24 07:10 LevisNgigi

Thanks for letting me know, @LevisNgigi!

andrewtavis avatar Oct 09 '24 07:10 andrewtavis

Just added a list of data types that we want to include to this issue :) Have marked those that are already done or have PRs open, and we can work on the others 😊 If the data type can't work, then we can move to the others and open up specific issues later :)

andrewtavis avatar Oct 09 '24 08:10 andrewtavis

Sounds great let me have a look at them now.

LevisNgigi avatar Oct 09 '24 08:10 LevisNgigi